[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Tres Seaver tseaver at palladion.com
Fri Jan 8 22:09:54 CET 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Guido van Rossum wrote:
> On Thu, Jan 7, 2010 at 10:12 PM, Tres Seaver <tseaver at palladion.com> wrote:
>> The BOM should not be seekeable if the file is opened with the proposed
>> "guess encoding from BOM" mode:  it isn't properly part of the stream at
>> all in that case.
> 
> This feels about right to me. There are still questions though:
> immediately after opening a file with a BOM, what should .tell()
> return? And regardless of that, .seek(0) should put the file in that
> same initial state.

I think the behavior should be something like:

 >>> f = open('/path/to/maybe-BOM-encoded-file', 'r', encoding='BOM')
 >>> f.tell()
 0L
 >>> f.seek(-1)
 >>> f.tell() # count of unicode chars in decoded stream
 45L
 >>> f.seek(0)
 >>> f.read(1) # read first unicode char decoded from stream.
 'A'

In other words, the BOM is not readable / seekable at all:  it is
invisible to the consumer of the decoded stream.


Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktHnyIACgkQ+gerLs4ltQ6s3QCgznD+7FbUzfCbe5TS6OcoXjMg
rdgAoJAMEXe2xwLCIwJaZ6XA6rVyTIAi
=oXb3
-----END PGP SIGNATURE-----




More information about the Python-Dev mailing list