[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Olemis Lang olemis at gmail.com
Mon Jan 11 19:58:01 CET 2010

> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
> <victor.stinner at haypocalc.com> wrote:
>> Hi,
>> Builtin open() function is unable to open an UTF-16/32 file starting with a
>> BOM if the encoding is not specified (raise an unicode error). For an UTF-8
>> file starting with a BOM, read()/readline() returns also the BOM whereas the
>> BOM should be "ignored".

I had similar issues too (please read below ;o) ...

On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum <guido at python.org> wrote:
> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
> talk. And for the other two, perhaps it would make more sense to have
> a separate encoding-guessing function that takes a binary stream and
> returns a text stream wrapping it with the proper encoding?

About guessing the encoding, I experienced this issue while I was
developing a Trac plugin. What I was doing is as follows :

- I guessed the MIME type + charset encoding using Trac MIME API (it
was a CSV file encoded using UTF-16)
- I read the file using `open`
- Then wrapped the file using `codecs.EncodedFile`
- Then used `csv.reader`

... and still get the BOM in the first value of the first row in the CSV file.


>>> mimetype
>>> ef = EncodedFile(f, 'utf-8', mimetype)

IMO I think I am +1 for leaving `open` just like it is, and use module
`codecs` to deal with encodings, but I am strongly -1 for returning
the BOM while using `EncodedFile` (mainly because encoding is
explicitly supplied in ;o)

> --Guido

CMIIW anyway ...



Blog ES: http://simelo-es.blogspot.com/
Blog EN: http://simelo-en.blogspot.com/

Featured article:

More information about the Python-Dev mailing list