[Python-Dev] Improve open() to support reading file starting with an unicode BOM
olemis at gmail.com
Mon Jan 11 19:58:01 CET 2010
> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
> <victor.stinner at haypocalc.com> wrote:
>> Builtin open() function is unable to open an UTF-16/32 file starting with a
>> BOM if the encoding is not specified (raise an unicode error). For an UTF-8
>> file starting with a BOM, read()/readline() returns also the BOM whereas the
>> BOM should be "ignored".
I had similar issues too (please read below ;o) ...
On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum <guido at python.org> wrote:
> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
> talk. And for the other two, perhaps it would make more sense to have
> a separate encoding-guessing function that takes a binary stream and
> returns a text stream wrapping it with the proper encoding?
About guessing the encoding, I experienced this issue while I was
developing a Trac plugin. What I was doing is as follows :
- I guessed the MIME type + charset encoding using Trac MIME API (it
was a CSV file encoded using UTF-16)
- I read the file using `open`
- Then wrapped the file using `codecs.EncodedFile`
- Then used `csv.reader`
... and still get the BOM in the first value of the first row in the CSV file.
>>> ef = EncodedFile(f, 'utf-8', mimetype)
IMO I think I am +1 for leaving `open` just like it is, and use module
`codecs` to deal with encodings, but I am strongly -1 for returning
the BOM while using `EncodedFile` (mainly because encoding is
explicitly supplied in ;o)
CMIIW anyway ...
Blog ES: http://simelo-es.blogspot.com/
Blog EN: http://simelo-en.blogspot.com/
More information about the Python-Dev