[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Olemis Lang olemis at gmail.com
Mon Jan 11 19:58:01 CET 2010


> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
> <victor.stinner at haypocalc.com> wrote:
>> Hi,
>>
>> Builtin open() function is unable to open an UTF-16/32 file starting with a
>> BOM if the encoding is not specified (raise an unicode error). For an UTF-8
>> file starting with a BOM, read()/readline() returns also the BOM whereas the
>> BOM should be "ignored".
>>
[...]
>

I had similar issues too (please read below ;o) ...

On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum <guido at python.org> wrote:
> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
> talk. And for the other two, perhaps it would make more sense to have
> a separate encoding-guessing function that takes a binary stream and
> returns a text stream wrapping it with the proper encoding?
>

About guessing the encoding, I experienced this issue while I was
developing a Trac plugin. What I was doing is as follows :

- I guessed the MIME type + charset encoding using Trac MIME API (it
was a CSV file encoded using UTF-16)
- I read the file using `open`
- Then wrapped the file using `codecs.EncodedFile`
- Then used `csv.reader`

... and still get the BOM in the first value of the first row in the CSV file.

{{{
#!python

>>> mimetype
'utf-16-le'
>>> ef = EncodedFile(f, 'utf-8', mimetype)
}}}

IMO I think I am +1 for leaving `open` just like it is, and use module
`codecs` to deal with encodings, but I am strongly -1 for returning
the BOM while using `EncodedFile` (mainly because encoding is
explicitly supplied in ;o)

> --Guido
>

CMIIW anyway ...

-- 
Regards,

Olemis.

Blog ES: http://simelo-es.blogspot.com/
Blog EN: http://simelo-en.blogspot.com/

Featured article:



More information about the Python-Dev mailing list