[Python-Dev] Quick sum up about open() + BOM

M.-A. Lemburg mal at egenix.com
Sat Jan 9 13:45:58 CET 2010


Victor Stinner wrote:
> (2) Check for a BOM while reading or detect it before?
> 
> Everybody agree that checking BOM is an interesting option and should not be 
> limited to open().
> 
> Marc-Andre proposed a codecs.guess_file_encoding() function accepting a file 
> name or a binary file-like object: it returns the encoding and seek to the 
> file start or just after the BOM.
> 
> I dislike this function because it requires extra file operations (open 
> (optional), read() and seek()) and it doesn't work if the file is not seekable 
> (eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to 
> avoid extra file operations.
> 
> Note: I implemented the BOM check in TextIOWrapper; so it's already usable for 
> any file-like object.

Yes, but the implementation is limited to just BOM checking
and thus only supports UTF-8-SIG, UTF-16 and UTF-32.

With a codecs module function we could easily extend the
encoding detection to more file types, e.g. XML files,
Python source code files, etc. that use other mechanisms
for defining the encoding.

BTW: I haven't looked at your implementation, but what happens
when your BOM check fails ? Will the implementation add the
already read bytes back to a buffer ?

This rollback action is the only reason for needing a
seekable stream in codecs.guess_stream_encoding().

Another point to consider:

AFAIK, we currently have a moratorium on changes to Python
builtins. How does that match up with the proposed changes ?

Using a new codec like Walter suggested would move the
implementation into the stdlib for which doesn't the
moratorium doesn't apply.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 09 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/



More information about the Python-Dev mailing list