[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Antoine Pitrou solipsis at pitrou.net
Fri Jan 8 17:31:33 CET 2010


Guido van Rossum <guido <at> python.org> writes:
> 
> On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver <tseaver <at> palladion.com>
wrote:
> > The BOM should not be seekeable if the file is opened with the proposed
> > "guess encoding from BOM" mode:  it isn't properly part of the stream at
> > all in that case.
> 
> This feels about right to me. There are still questions though:
> immediately after opening a file with a BOM, what should .tell()
> return?

tell() in the context of text I/O is specified to return an "opaque cookie". So
whatever value it returns would probably be fine, as long as seeking to that
value leaves the file in an acceptable state.

Rewinding (seeking to 0) in the presence of a BOM is already reasonably
supported by the TextIOWrapper object:

>>> dec = codecs.getincrementaldecoder('utf-16')()
>>> dec.decode(b'\xff\xfea\x00b\x00')
'ab'
>>> dec.decode(b'\xff\xfea\x00b\x00')
'\ufeffab'
>>> 
>>> bio = io.BytesIO(b'\xff\xfea\x00b\x00')
>>> f = io.TextIOWrapper(bio, encoding='utf-16')
>>> f.read()
'ab'
>>> f.seek(0)
0
>>> f.read()
'ab'

There are tests for this in test_io.py (test_encoded_writes, line 1929, and
test_append_bom and test_seek_bom, line 2045).

Regards

Antoine.





More information about the Python-Dev mailing list