[Python-Dev] Quick sum up about open() + BOM

Sat Jan 9 02:12:28 CET 2010

Glenn Linderman wrote:
> On approximately 1/8/2010 3:59 PM, came the following characters from 
> the keyboard of Victor Stinner:
>> Hi,
>>
>> Thanks for all the answers! I will try to sum up all ideas here.
> 
> One concern I have with this implementation encoding="BOM" is that if 
> there is no BOM it assumes UTF-8.  That is probably a good assumption in 
> some circumstances, but not in others.
> 
> * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE 
> encoded files include a BOM.  It is only required that UTF-16 and UTF-32 
> (cases where the endianness is unspecified) contain a BOM.  Hence, it 
> might be that someone would expect a UTF-16LE (or any of the formats 
> that don't require a BOM, rather than UTF-8), but be willing to accept 
> any BOM-discriminated format.
> 
> * Potentially, this could be expanded beyond the various Unicode 
> encodings... one could envision that a program whose data files 
> historically were in any particular national language locale, could want 
> to be enhance to accept Unicode, and could declare that they will accept 
> any BOM-discriminated format, but want to default, in the absence of a 
> BOM, to the original national language locale that they historically 
> accepted.  That would provide a migration path for their old data files.
> 
> So the point is, that it might be nice to have 
> "BOM-otherEncodingForDefault" for each other encoding that Python 
> supports.  Not sure that is the right API, but I think it is expressive 
> enough to handle the cases above.  Whether the cases solve actual 
> problems or not, I couldn't say, but they seem like reasonable cases.
> 
> It would, of course, be nicest if OS metadata had been invented way back 
> when, for all OSes, such that all text files were flagged with their 
> encoding... then languages could just read the encoding and do the right 
> thing! But we live in the real world, instead.
> 
What about listing the possible encodings? It would try each in turn
until it found one where the BOM matched or had no BOM:

     my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8')

or is that taking it too far?