API Reference¶

warcio.statusandheaders.StatusAndHeaders ¶

class warcio.statusandheaders.StatusAndHeaders(statusline, headers, protocol='', total_len=0, is_http_request=False)¶

ENCODE_HEADER_RX = re.compile('[=]["\\\']?([^;"]+)["\\\']?(?=[;]?)')¶: Representation of parsed http-style status line and headers Status Line if first line of request/response Headers is a list of (name, value) tuples An optional protocol which appears on first line may be specified If is_http_request is true, split http verb (instead of protocol) from start of statusline

get_header(name, default_value=None)¶: return header (name, value) if found

add_header(name, value)¶

replace_header(name, value)¶: replace header with new value or add new header return old header value, if any

remove_header(name)¶: Remove header (case-insensitive) return True if header removed, False otherwise

get_statuscode()¶: Return the statuscode part of the status response line (Assumes no protocol in the statusline)

validate_statusline(valid_statusline)¶: Check that the statusline is valid, eg. starts with a numeric code. If not, replace with passed in valid_statusline

add_range(start, part_len, total_len)¶: Add range headers indicating that this a partial response

compute_headers_buffer(header_filter=None)¶: Set buffer representing headers

to_str(filter_func=None)¶

to_bytes(filter_func=None, encoding='utf-8')¶

to_ascii_bytes(filter_func=None)¶: Attempt to encode the headers block as ascii If encoding fails, call percent_encode_non_ascii_headers() to encode any headers per RFCs

percent_encode_non_ascii_headers(encoding='UTF-8')¶: Encode any headers that are not plain ascii as UTF-8 as per: https://tools.ietf.org/html/rfc8187#section-3.2.3 https://tools.ietf.org/html/rfc5987#section-3.2.2

get(name, default_value=None)¶: return header (name, value) if found

__getitem__(name, default_value=None)¶: return header (name, value) if found

warcio.archiveiterator.ArchiveIterator ¶

class warcio.archiveiterator.ArchiveIterator(fileobj, no_record_parse=False, verify_http=False, arc2warc=False, ensure_http_headers=False, block_size=16384, check_digests=False)¶

Iterate over records in WARC and ARC files, both gzip chunk compressed and uncompressed

The indexer will automatically detect format, and decompress if necessary.

GZIP_ERR_MSG = '\n ERROR: non-chunked gzip file detected, gzip block continues\n beyond single record.\n\n This file is probably not a multi-member gzip but a single gzip file.\n\n To allow seek, a gzipped {1} must have each record compressed into\n a single gzip member and concatenated together.\n\n This file is likely still valid and can be fixed by running:\n\n warcio recompress <path/to/file> <path/to/new_file>\n\n'¶

INC_RECORD = ' WARNING: Record not followed by newline, perhaps Content-Length is invalid\n Offset: {0}\n Remainder: {1}\n'¶

close()¶

read_to_end(record=None)¶: Read remainder of the stream If a digester is included, update it with the data read

get_record_offset()¶

get_record_length()¶

warcio.warcwriter.WARCWriter ¶

class warcio.warcwriter.WARCWriter(filebuf, *args, **kwargs)¶

write_record(record, params=None)¶

API Reference¶

warcio.statusandheaders.StatusAndHeaders¶

warcio.archiveiterator.ArchiveIterator¶

warcio.warcwriter.WARCWriter¶

warcio.statusandheaders.StatusAndHeaders ¶

warcio.archiveiterator.ArchiveIterator ¶

warcio.warcwriter.WARCWriter ¶