[Python-Dev] Improve open() to support reading file starting with an unicode BOM

M.-A. Lemburg mal at egenix.com
Fri Jan 8 22:51:26 CET 2010


Tres Seaver wrote:
> M.-A. Lemburg wrote:
> 
>> Shouldn't this encoding guessing be a separate function that you call
>> on either a file or a seekable stream ?
> 
>> After all, detecting encodings is just as useful to have for non-file
>> streams.
> 
> Other stream sources typically have out-of-band ways to signal the
> encoding:  only when reading from the filesystem do we pretty much
> *have* to guess, and in that case the BOM / signature is the best
> heuristic we have.  Also, some non-file streams are not seekable, and so
> can't be guessed via a pre-pass.

Sure there are non-seekable file streams, but at least when
using StringIO-type streams you don't have that restriction.

An encoding detection function would provide more use in other
cases as well, so instead of hiding away the functionality in
the open() constructor, I'm suggesting to make expose it via
the codecs module.

Applications can then use it where necessary and also provide their
own defaults, using other heuristics.

>> You'd then avoid having to stuff everything into
>> a single function call and also open up the door for more complex
>> application specific guess work or defaults.
> 
>> The whole process would then have two steps:
> 
>>  1. guess encoding
> 
>>   import codecs
>>   encoding = codecs.guess_file_encoding(filename)
> 
> Filename is not enough information:  or do you mean that API to actually
> open the stream?

Yes. The API would open the file, guess the encoding and then
close it again. If you don't want that to happen, you could use
the second API I mentioned below on the already open file.

Note that this function could detect a lot more encodings with
reasonably high probability than just BOM-prefixed ones,
e.g. we could also add support to detect encoding declarations
such as the ones we use in Python source files.

>>  2. open the file with the found encoding
> 
>>   f = open(filename, encoding=encoding)
> 
>> For seekable streams f, you'd have:
> 
>>  1. guess encoding
> 
>>   import codecs
>>   encoding = codecs.guess_stream_encoding(f)

I forgot to mention: This API needs to position the file pointer
to the start of the first data byte.

>>  2. wrap the stream with a reader for the found encoding
> 
>>   reader_class = codecs.getreader(encoding)
>>   g = reader_class(f)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 08 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/



More information about the Python-Dev mailing list