[Python-3000] Comment on iostack library
Walter Dörwald
walter at livinglogic.de
Fri Sep 1 00:24:35 CEST 2006
tomer filiba wrote:
> [...]
> besides, encoding suffers from many issues. suppose you have a
> damaged UTF8 file, which you read char-by-char. when we reach the
> damaged part, you'll never be able to "skip" it, as we'll just keep
> read()ing bytes, hoping to make a character out of it , until we
> reach EOF, i.e.:
>
> def read_char(self):
> buf = ""
> while not self._stream.eof:
> buf += self._stream.read(1)
> try:
> return buf.decode("utf8")
> except ValueError:
> pass
>
> which leads me to the following thought: maybe we should have
> an "enhanced" encoding library for py3k, which would report
> *incomplete* data differently from *invalid* data. today it's just a
> ValueError: suppose decode() would raise IncompleteDataError
> when the given data is not sufficient to be decoded successfully,
> and ValueError when the data is just corrupted.
>
> that could aid iostack greatly.
We *do* have that functionality in Python 2.5: incremental decoders can
retain incomplete byte sequences on the call to the decode() method
until the next call. Only when final=True is passed in the decode() call
will it treat incomplete and invalid data in the same way: by raising an
exception.
Incomplete input:
>>> import codecs
>>> d = codecs.lookup("utf-8").incrementaldecoder()
>>> d.decode("\xe1")
u''
>>> d.decode("\x88")
u''
>>> d.decode("\xb4")
u'\u1234'
Invalid input:
>>> import codecs
>>> d = codecs.lookup("utf-8").incrementaldecoder()
>>> d.decode("\x80")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256,
in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
unexpected code byte
Incomplete input with final=True:
>>> import codecs
>>> d = codecs.lookup("utf-8").incrementaldecoder()
>>> d.decode("\xe1", final=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256,
in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0:
unexpected end of data
Servus,
Walter
More information about the Python-3000
mailing list