[Python-3000] Comment on iostack library

Fri Sep 1 10:05:10 CEST 2006

very well, i'll use it. thanks.

On 9/1/06, Walter Dörwald <walter at livinglogic.de> wrote:
> tomer filiba wrote:
>
> > [...]
> > besides, encoding suffers from many issues. suppose you have a
> > damaged UTF8 file, which you read char-by-char. when we reach the
> > damaged part, you'll never be able to "skip" it, as we'll just keep
> > read()ing bytes, hoping to make a character out of it , until we
> > reach EOF, i.e.:
> >
> > def read_char(self):
> >     buf = ""
> >     while not self._stream.eof:
> >         buf += self._stream.read(1)
> >         try:
> >             return buf.decode("utf8")
> >         except ValueError:
> >             pass
> >
> > which leads me to the following thought: maybe we should have
> > an "enhanced" encoding library for py3k, which would report
> > *incomplete* data differently from *invalid* data. today it's just a
> > ValueError: suppose decode() would raise IncompleteDataError
> > when the given data is not sufficient to be decoded successfully,
> > and ValueError when the data is just corrupted.
> >
> > that could aid iostack greatly.
>
> We *do* have that functionality in Python 2.5: incremental decoders can
> retain incomplete byte sequences on the call to the decode() method
> until the next call. Only when final=True is passed in the decode() call
> will it treat incomplete and invalid data in the same way: by raising an
> exception.
>
> Incomplete input:
> >>> import codecs
> >>> d = codecs.lookup("utf-8").incrementaldecoder()
> >>> d.decode("\xe1")
> u''
> >>> d.decode("\x88")
> u''
> >>> d.decode("\xb4")
> u'\u1234'
>
> Invalid input:
> >>> import codecs
> >>> d = codecs.lookup("utf-8").incrementaldecoder()
> >>> d.decode("\x80")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256,
> in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
> unexpected code byte
>
> Incomplete input with final=True:
> >>> import codecs
> >>> d = codecs.lookup("utf-8").incrementaldecoder()
> >>> d.decode("\xe1", final=True)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256,
> in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0:
> unexpected end of data
>
> Servus,
>    Walter
>
>