[issue20132] Many incremental codecs don’t handle fragmented data

Sun Jan 5 14:48:58 CET 2014

New submission from Martin Panter:

Many of the incremental codecs do not handle fragmented data very well. In the past I think I was interested in using the Base-64 and Quoted-printable codecs, and playing with other codecs today reveals many more issues. A lot of the issues reflect missing functionality, so maybe the simplest solution would be to document the codecs that don’t work.

Incremental decoding issues:

>>> str().join(codecs.iterdecode(iter((b"\\", b"u2013")), "unicode-escape"))
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: \ at end of string
# Same deal for raw-unicode-escape.

>>> bytes().join(codecs.iterdecode(iter((b"3", b"3")), "hex-codec"))
binascii.Error: Odd-length string

>>> bytes().join(codecs.iterdecode(iter((b"A", b"A==")), "base64-codec"))
binascii.Error: Incorrect padding

>>> bytes().join(codecs.iterdecode(iter((b"=", b"3D")), "quopri-codec"))
b'3D'  # Should return b"="

>>> codecs.getincrementaldecoder("uu-codec")().decode(b"begin ")
ValueError: Truncated input data

Incremental encoding issues:

>>> e = codecs.getincrementalencoder("base64-codec")(); codecs.decode(e.encode(b"1") + e.encode(b"2", final=True), "base64-codec")
b'1'  # Should be b"12"

>>> e = codecs.getincrementalencoder("quopri-codec")(); e.encode(b"1" * 50) + e.encode(b"2" * 50, final=True)
b'1111111111111111111111111111111111111111111111111122222222222222222222222222222222222222222222222222'
# I suspect the line should have been split in two

>>> e = codecs.getincrementalencoder("uu-codec")(); codecs.decode(e.encode(b"1") + e.encode(b"2", final=True), "uu-codec")
b'1'  # Should be b"12"

I also noticed iterdecode() does not work with “uu-codec”:

>>> bytes().join(codecs.iterdecode(iter((b"begin 666 <data>\n \nend\n",)), "uu-codec"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/codecs.py", line 1032, in iterdecode
    output = decoder.decode(b"", True)
  File "/usr/lib/python3.3/encodings/uu_codec.py", line 80, in decode
    return uu_decode(input, self.errors)[0]
  File "/usr/lib/python3.3/encodings/uu_codec.py", line 45, in uu_decode
    raise ValueError('Missing "begin" line in input data')
ValueError: Missing "begin" line in input data

And iterencode() does not work with any of the byte encoders, because it does not know what kind of empty string to pass to IncrementalEncoder.encode(final=True):

>>> bytes().join(codecs.iterencode(iter(()), "base64-codec"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/codecs.py", line 1014, in iterencode
    output = encoder.encode("", True)
  File "/usr/lib/python3.3/encodings/base64_codec.py", line 31, in encode
    return base64.encodebytes(input)
  File "/usr/lib/python3.3/base64.py", line 343, in encodebytes
    raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str

Finally, incremental UTF-7 encoding is suboptimal, and the decoder seems to buffer unlimited data, both defeating the purpose of using an incremental codec:

>>> bytes().join(codecs.iterencode("\xA9" * 2, "utf-7"))
b'+AKk-+AKk-'  # b"+AKkAqQ-" would be better

>>> d = codecs.getincrementaldecoder("utf-7")()
>>> d.decode(b"+")
b''
>>> any(d.decode(b"AAAAAAAA" * 100000) for _ in range(10))
False  # No data returned: everything must be buffered
>>> d.decode(b"-") == "\x00" * 3000000
True  # Returned all buffered data in one go

----------
components: Library (Lib)
messages: 207374
nosy: vadmium
priority: normal
severity: normal
status: open
title: Many incremental codecs don’t handle fragmented data
type: behavior
versions: Python 3.3

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20132>
_______________________________________