No subject


Mon Jan 18 10:37:10 CET 2010


Hi,

Thanks for all the answers! I will try to sum up all ideas here.


(1) Change default open() behaviour or make it optional?

Guido would like to add an option and keep open() unchanged. He wrote that 
checking for BOM and using system locale are too much different to be the same 
option (encoding=None).

Antoine would like to check BOM by default, because both options (system 
locale vs checking for BOM) is the same thing.

About Antoine choice (encoding=None): which file modes would check for a BOM? 
I would like to answer only the read only mode, but then open(filename, "r") 
and open(filename, "r+") would behave differently?

  => 1 point for Guido (encoding="BOM" is more explicit)

Antoine choice has the advantage of directly support UTF-8+BOM, UTF-16 and 
UTF-32 for all applications and all modules using open(filename).

  => 1 point for Antoine


(2) Check for a BOM while reading or detect it before?

Everybody agree that checking BOM is an interesting option and should not be 
limited to open().

Marc-Andre proposed a codecs.guess_file_encoding() function accepting a file 
name or a binary file-like object: it returns the encoding and seek to the 
file start or just after the BOM.

I dislike this function because it requires extra file operations (open 
(optional), read() and seek()) and it doesn't work if the file is not seekable 
(eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to 
avoid extra file operations.

Note: I implemented the BOM check in TextIOWrapper; so it's already usable for 
any file-like object.


(3) tell() and seek() on a text file starting with a BOM

To be consistent with Antoine example:

   >>> bio = io.BytesIO(b'\xff\xfea\x00b\x00')
   >>> f = io.TextIOWrapper(bio, encoding='utf-16')
   >>> f.read()
   'ab'
   >>> f.seek(0)
   0
   >>> f.read()
   'ab'

TextIOWrapper:

 * tell() should return zero at file start,
 * seek(0) should go be to file start,
 * and the BOM should always be "ignored".

I mean:

  with open("utf8bom.txt", encoding="BOM") as fp:
     assert fp.tell() == 0   
     text = fp.read() # no BOM here
     fp.seek(0)
     assert fp.read() == text

--

About my patch:

 - BOM check is explicit: open(filebame,  encoding="BOM")
 - tell() / seek(0) works as expected
 - BOM bytes are always skipped in read() / readlines() result

Hum, I don't know if this email can be called a sum up ;-)

-- 
Victor Stinner
http://www.haypocalc.com/



More information about the Python-Dev mailing list