[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Mon Jan 11 22:29:38 CET 2010

Probably one part of this is OT , but I think it could complement the
discussion ;o)

On Mon, Jan 11, 2010 at 3:44 PM, M.-A. Lemburg <mal at egenix.com> wrote:
> Olemis Lang wrote:
>>> On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
>>> <victor.stinner at haypocalc.com> wrote:
>>>> Hi,
>>>>
>>>> Builtin open() function is unable to open an UTF-16/32 file starting with a
>>>> BOM if the encoding is not specified (raise an unicode error). For an UTF-8
>>>> file starting with a BOM, read()/readline() returns also the BOM whereas the
>>>> BOM should be "ignored".
>>>>
>> [...]
>>>
>>
>> I had similar issues too (please read below ;o) ...
>>
>> On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum <guido at python.org> wrote:
>>> I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
>>> talk. And for the other two, perhaps it would make more sense to have
>>> a separate encoding-guessing function that takes a binary stream and
>>> returns a text stream wrapping it with the proper encoding?
>>>
>>
>> About guessing the encoding, I experienced this issue while I was
>> developing a Trac plugin. What I was doing is as follows :
>>
>> - I guessed the MIME type + charset encoding using Trac MIME API (it
>> was a CSV file encoded using UTF-16)
>> - I read the file using `open`
>> - Then wrapped the file using `codecs.EncodedFile`
>> - Then used `csv.reader`
>>
>> ... and still get the BOM in the first value of the first row in the CSV file.
>
> You didn't say, but I presume that the charset guessing logic
> returned either 'utf-16-le' or 'utf-16-be'

Yes. In fact they return the full mimetype 'text/csv; charset=utf-16-le' ... ;o)

> - those encodings don't
> remove the leading BOM.

... and they should ?

> The 'utf-16' codec will remove the BOM.
>

In this particular case there's nothing I can do, I have to process
whatever charset is detected in the input ;o)

>> {{{
>> #!python
>>
>>>>> mimetype
>> 'utf-16-le'
>>>>> ef = EncodedFile(f, 'utf-8', mimetype)
>> }}}
>
> Same here: the UTF-8 codec will not remove the BOM, you have
> to use the 'utf-8-sig' codec for that.
>
>> IMO I think I am +1 for leaving `open` just like it is, and use module
>> `codecs` to deal with encodings, but I am strongly -1 for returning
>> the BOM while using `EncodedFile` (mainly because encoding is
>> explicitly supplied in ;o)
>
> Note that EncodedFile() doesn't do any fancy BOM detection or
> filtering.

... directly.

> This is the job of the codecs.
>

OK ... to come back to the scope of the subject, in the general case,
I think that BOM (if any) should be handled by codecs, and therefore
indirectly by EncodedFile . If that's a explicit way of working with
encodings I'd prefer to use that wrapper explicitly in order to
(encode | decode) the file and let the codec detect whether there's a
BOM or not and «adjust» `tell`, `read` and everything else in that
wrapper (instead of `open`).

> Also note that BOM removal is only valid at the beginning of
> a file. All subsequent BOM-bytes have to be read as-is (they
> map to a zero-width non-breaking space) - without removing them.
>

;o)

-- 
Regards,

Olemis.

Blog ES: http://simelo-es.blogspot.com/
Blog EN: http://simelo-en.blogspot.com/

Featured article:
Test cases for custom query (i.e report 9) ... PASS (1.0.0)  -
http://simelo.hg.sourceforge.net/hgweb/simelo/trac-gviz/rev/d276011e7297