Python 3.2 bug? Reading the last line of a file

Ian Kelly ian.g.kelly at gmail.com
Wed May 25 21:06:04 EDT 2011


On Wed, May 25, 2011 at 3:52 PM, MRAB <python at mrabarnett.plus.com> wrote:
> What do you mean by "may include the decoder state in its return value"?
>
> It does make sense that the values returned from tell() won't be in the
> middle of an encoded sequence of bytes.

If you take a look at the source code, tell() returns a long that
includes decoder state data in the upper bytes.  For example:

>>> data = b' ' + '\u0302a'.encode('utf-16')
>>> data
b' \xff\xfe\x02\x03a\x00'
>>> f = open('test.txt', 'wb')
>>> f.write(data)
7
>>> f.close()
>>> f = open('test.txt', 'r', encoding='utf-16')
>>> f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python32\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  File "c:\python32\lib\encodings\utf_16.py", line 61, in _buffer_decode
    codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 6-6:
truncated data

The problem of course is the initial space, throwing off the decoder.
We can try to seek past it:

>>> f.seek(1)
1
>>> f.read()
'\ufeff\u0302a'

But notice that since we're not reading from the beginning of the
file, the BOM has now been interpreted as data.  However:

>>> f.seek(1 + (2 << 65))
73786976294838206465
>>> f.read()
'\u0302a'

And you can see that instead of reading from position
73786976294838206465 it has read from position 1 starting in the "read
a BOM" state.  Note that I wouldn't recommend doing anything remotely
like this in production code, not least because the value that I
passed into seek() is platform-dependent.  This is just a
demonstration of how the seek() value can include decoder state.

Cheers,
Ian



More information about the Python-list mailing list