Python 3.2 bug? Reading the last line of a file
Ian Kelly
ian.g.kelly at gmail.com
Wed May 25 21:06:04 EDT 2011
On Wed, May 25, 2011 at 3:52 PM, MRAB <python at mrabarnett.plus.com> wrote:
> What do you mean by "may include the decoder state in its return value"?
>
> It does make sense that the values returned from tell() won't be in the
> middle of an encoded sequence of bytes.
If you take a look at the source code, tell() returns a long that
includes decoder state data in the upper bytes. For example:
>>> data = b' ' + '\u0302a'.encode('utf-16')
>>> data
b' \xff\xfe\x02\x03a\x00'
>>> f = open('test.txt', 'wb')
>>> f.write(data)
7
>>> f.close()
>>> f = open('test.txt', 'r', encoding='utf-16')
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python32\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "c:\python32\lib\encodings\utf_16.py", line 61, in _buffer_decode
codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 6-6:
truncated data
The problem of course is the initial space, throwing off the decoder.
We can try to seek past it:
>>> f.seek(1)
1
>>> f.read()
'\ufeff\u0302a'
But notice that since we're not reading from the beginning of the
file, the BOM has now been interpreted as data. However:
>>> f.seek(1 + (2 << 65))
73786976294838206465
>>> f.read()
'\u0302a'
And you can see that instead of reading from position
73786976294838206465 it has read from position 1 starting in the "read
a BOM" state. Note that I wouldn't recommend doing anything remotely
like this in production code, not least because the value that I
passed into seek() is platform-dependent. This is just a
demonstration of how the seek() value can include decoder state.
Cheers,
Ian
More information about the Python-list
mailing list