[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Victor Stinner victor.stinner at haypocalc.com
Fri Jan 8 01:10:35 CET 2010


Hi,

Builtin open() function is unable to open an UTF-16/32 file starting with a 
BOM if the encoding is not specified (raise an unicode error). For an UTF-8 
file starting with a BOM, read()/readline() returns also the BOM whereas the 
BOM should be "ignored".

See recent issues related to reading an UTF-8 text file including a BOM: #7185 
(csv) and #7519 (ConfigParser). Such file can be opened in unicode mode with 
the UTF-8-SIG encoding, but it's possible to do better.

I propose to improve open() (TextIOWrapper) by using the BOM to choose the 
right encoding. I think that only files opened in read only mode should 
support this new feature. *Read* the BOM in a *write* only file would cause 
unexpected behaviours.

Since my proposition changes the result TextIOWrapper.read()/readline() for 
files starting with a BOM, we might introduce an option to open() to enable 
the new behaviour. But is it really needed to keep the backward compatibility?

I wrote a proof of concept attached to the issue #7651. My patch only changes 
the behaviour of TextIOWrapper for reading files starting with a BOM. It 
doesn't work yet if a seek() is used before the first read.

-- 
Victor Stinner
http://www.haypocalc.com/



More information about the Python-Dev mailing list