[issue4862] utf-16 BOM is not skipped after seek(0)

Marc-Andre Lemburg report at bugs.python.org
Thu Jan 8 14:50:51 CET 2009


Marc-Andre Lemburg <mal at egenix.com> added the comment:

On 2009-01-07 01:21, Amaury Forgeot d'Arc wrote:
> First write a utf-16 file with its signature:
> 
>>>> f1 = open('utf16.txt', 'w', encoding='utf-16')
>>>> f1.write('0123456789')
>>>> f1.close()
> 
> Then read it twice:
> 
>>>> f2 = open('utf16.txt', 'r', encoding='utf-16')
>>>> print('read1', ascii(f2.read()))
> read1 '0123456789'
>>>> f2.seek(0)
> 0
>>>> print('read2', ascii(f2.read()))
> read2 '\ufeff0123456789'
> 
> The second read returns the BOM!
> This is because the zero in seek(0) is a "cookie" which contains both the position 
> and the decoder state. Unfortunately, state=0 means 'endianness has been determined: 
> native order'.
> 
> maybe a suggestion: handle seek(0) as a special value which calls decoder.reset().
> The patch implement this idea.

This is a problem with the utf_16.py codec, not the io layer.
Opening a file in append mode is something that the io layer
would have to handle, since the codec doesn't know anything about
the underlying file mode.

Using .reset() will not help. The code for the StreamReader
and StreamWriter in utf_16.py will have to be modified to undo
the adjustment of the .encode() and .decode() method after using
.seek(0).

Note that there's also the case .seek(1) - I guess this must
be considered as resulting in undefined behavior.

----------
nosy: +lemburg

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue4862>
_______________________________________


More information about the Python-bugs-list mailing list