API Reference

warcio.statusandheaders.StatusAndHeaders

class warcio.statusandheaders.StatusAndHeaders(statusline, headers, protocol='', total_len=0, is_http_request=False)
ENCODE_HEADER_RX = re.compile('[=]["\\\']?([^;"]+)["\\\']?(?=[;]?)')

Representation of parsed http-style status line and headers Status Line if first line of request/response Headers is a list of (name, value) tuples An optional protocol which appears on first line may be specified If is_http_request is true, split http verb (instead of protocol) from start of statusline

get_header(name, default_value=None)

return header (name, value) if found

add_header(name, value)
replace_header(name, value)

replace header with new value or add new header return old header value, if any

remove_header(name)

Remove header (case-insensitive) return True if header removed, False otherwise

get_statuscode()

Return the statuscode part of the status response line (Assumes no protocol in the statusline)

validate_statusline(valid_statusline)

Check that the statusline is valid, eg. starts with a numeric code. If not, replace with passed in valid_statusline

add_range(start, part_len, total_len)

Add range headers indicating that this a partial response

compute_headers_buffer(header_filter=None)

Set buffer representing headers

to_str(filter_func=None)
to_bytes(filter_func=None, encoding='utf-8')
to_ascii_bytes(filter_func=None)

Attempt to encode the headers block as ascii If encoding fails, call percent_encode_non_ascii_headers() to encode any headers per RFCs

percent_encode_non_ascii_headers(encoding='UTF-8')

Encode any headers that are not plain ascii as UTF-8 as per: https://tools.ietf.org/html/rfc8187#section-3.2.3 https://tools.ietf.org/html/rfc5987#section-3.2.2

get(name, default_value=None)

return header (name, value) if found

__getitem__(name, default_value=None)

return header (name, value) if found

warcio.archiveiterator.ArchiveIterator

class warcio.archiveiterator.ArchiveIterator(fileobj, no_record_parse=False, verify_http=False, arc2warc=False, ensure_http_headers=False, block_size=16384, check_digests=False)

Iterate over records in WARC and ARC files, both gzip chunk compressed and uncompressed

The indexer will automatically detect format, and decompress if necessary.

GZIP_ERR_MSG = '\n    ERROR: non-chunked gzip file detected, gzip block continues\n    beyond single record.\n\n    This file is probably not a multi-member gzip but a single gzip file.\n\n    To allow seek, a gzipped {1} must have each record compressed into\n    a single gzip member and concatenated together.\n\n    This file is likely still valid and can be fixed by running:\n\n    warcio recompress <path/to/file> <path/to/new_file>\n\n'
INC_RECORD = '    WARNING: Record not followed by newline, perhaps Content-Length is invalid\n    Offset: {0}\n    Remainder: {1}\n'
close()
read_to_end(record=None)

Read remainder of the stream If a digester is included, update it with the data read

get_record_offset()
get_record_length()

warcio.warcwriter.WARCWriter

class warcio.warcwriter.WARCWriter(filebuf, *args, **kwargs)
write_record(record, params=None)