[Python-3000] Comment on iostack library

Fri Sep 1 00:24:35 CEST 2006

tomer filiba wrote:

> [...]
> besides, encoding suffers from many issues. suppose you have a
> damaged UTF8 file, which you read char-by-char. when we reach the
> damaged part, you'll never be able to "skip" it, as we'll just keep
> read()ing bytes, hoping to make a character out of it , until we
> reach EOF, i.e.:
> 
> def read_char(self):
>     buf = ""
>     while not self._stream.eof:
>         buf += self._stream.read(1)
>         try:
>             return buf.decode("utf8")
>         except ValueError:
>             pass
> 
> which leads me to the following thought: maybe we should have
> an "enhanced" encoding library for py3k, which would report
> *incomplete* data differently from *invalid* data. today it's just a
> ValueError: suppose decode() would raise IncompleteDataError
> when the given data is not sufficient to be decoded successfully,
> and ValueError when the data is just corrupted.
> 
> that could aid iostack greatly.

We *do* have that functionality in Python 2.5: incremental decoders can
retain incomplete byte sequences on the call to the decode() method
until the next call. Only when final=True is passed in the decode() call
will it treat incomplete and invalid data in the same way: by raising an
exception.

Incomplete input:
>>> import codecs
>>> d = codecs.lookup("utf-8").incrementaldecoder()
>>> d.decode("\xe1")
u''
>>> d.decode("\x88")
u''
>>> d.decode("\xb4")
u'\u1234'

Invalid input:
>>> import codecs
>>> d = codecs.lookup("utf-8").incrementaldecoder()
>>> d.decode("\x80")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256,
in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
unexpected code byte

Incomplete input with final=True:
>>> import codecs
>>> d = codecs.lookup("utf-8").incrementaldecoder()
>>> d.decode("\xe1", final=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256,
in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0:
unexpected end of data

Servus,
   Walter