[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Guido van Rossum guido at python.org
Fri Jan 8 01:52:20 CET 2010


I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy
talk. And for the other two, perhaps it would make more sense to have
a separate encoding-guessing function that takes a binary stream and
returns a text stream wrapping it with the proper encoding?

--Guido

On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner
<victor.stinner at haypocalc.com> wrote:
> Hi,
>
> Builtin open() function is unable to open an UTF-16/32 file starting with a
> BOM if the encoding is not specified (raise an unicode error). For an UTF-8
> file starting with a BOM, read()/readline() returns also the BOM whereas the
> BOM should be "ignored".
>
> See recent issues related to reading an UTF-8 text file including a BOM: #7185
> (csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with
> the UTF-8-SIG encoding, but it's possible to do better.
>
> I propose to improve open() (TextIOWrapper) by using the BOM to choose the
> right encoding. I think that only files opened in read only mode should
> support this new feature. *Read* the BOM in a *write* only file would cause
> unexpected behaviours.
>
> Since my proposition changes the result TextIOWrapper.read()/readline() for
> files starting with a BOM, we might introduce an option to open() to enable
> the new behaviour. But is it really needed to keep the backward compatibility?
>
> I wrote a proof of concept attached to the issue #7651. My patch only changes
> the behaviour of TextIOWrapper for reading files starting with a BOM. It
> doesn't work yet if a seek() is used before the first read.
>
> --
> Victor Stinner
> http://www.haypocalc.com/
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
>



-- 
--Guido van Rossum (python.org/~guido)



More information about the Python-Dev mailing list