API Reference¶
warcio.statusandheaders.StatusAndHeaders¶
- class warcio.statusandheaders.StatusAndHeaders(statusline, headers, protocol='', total_len=0, is_http_request=False)¶
- ENCODE_HEADER_RX = re.compile('[=]["\\\']?([^;"]+)["\\\']?(?=[;]?)')¶
Representation of parsed http-style status line and headers Status Line if first line of request/response Headers is a list of (name, value) tuples An optional protocol which appears on first line may be specified If is_http_request is true, split http verb (instead of protocol) from start of statusline
- get_header(name, default_value=None)¶
return header (name, value) if found
- add_header(name, value)¶
- replace_header(name, value)¶
replace header with new value or add new header return old header value, if any
- remove_header(name)¶
Remove header (case-insensitive) return True if header removed, False otherwise
- get_statuscode()¶
Return the statuscode part of the status response line (Assumes no protocol in the statusline)
- validate_statusline(valid_statusline)¶
Check that the statusline is valid, eg. starts with a numeric code. If not, replace with passed in valid_statusline
- add_range(start, part_len, total_len)¶
Add range headers indicating that this a partial response
- compute_headers_buffer(header_filter=None)¶
Set buffer representing headers
- to_str(filter_func=None)¶
- to_bytes(filter_func=None, encoding='utf-8')¶
- to_ascii_bytes(filter_func=None)¶
Attempt to encode the headers block as ascii If encoding fails, call percent_encode_non_ascii_headers() to encode any headers per RFCs
- percent_encode_non_ascii_headers(encoding='UTF-8')¶
Encode any headers that are not plain ascii as UTF-8 as per: https://tools.ietf.org/html/rfc8187#section-3.2.3 https://tools.ietf.org/html/rfc5987#section-3.2.2
- get(name, default_value=None)¶
return header (name, value) if found
- __getitem__(name, default_value=None)¶
return header (name, value) if found
warcio.archiveiterator.ArchiveIterator¶
- class warcio.archiveiterator.ArchiveIterator(fileobj, no_record_parse=False, verify_http=False, arc2warc=False, ensure_http_headers=False, block_size=16384, check_digests=False)¶
Iterate over records in WARC and ARC files, both gzip chunk compressed and uncompressed
The indexer will automatically detect format, and decompress if necessary.
- GZIP_ERR_MSG = '\n ERROR: non-chunked gzip file detected, gzip block continues\n beyond single record.\n\n This file is probably not a multi-member gzip but a single gzip file.\n\n To allow seek, a gzipped {1} must have each record compressed into\n a single gzip member and concatenated together.\n\n This file is likely still valid and can be fixed by running:\n\n warcio recompress <path/to/file> <path/to/new_file>\n\n'¶
- INC_RECORD = ' WARNING: Record not followed by newline, perhaps Content-Length is invalid\n Offset: {0}\n Remainder: {1}\n'¶
- close()¶
- read_to_end(record=None)¶
Read remainder of the stream If a digester is included, update it with the data read
- get_record_offset()¶
- get_record_length()¶