[Python-Dev] Quick sum up about open() + BOM

Sat Jan 9 11:51:57 CET 2010

On 09.01.10 01:47, Glenn Linderman wrote:

> On approximately 1/8/2010 3:59 PM, came the following characters from
> the keyboard of Victor Stinner:
>> Hi,
>>
>> Thanks for all the answers! I will try to sum up all ideas here.
> 
> One concern I have with this implementation encoding="BOM" is that if
> there is no BOM it assumes UTF-8.  That is probably a good assumption in
> some circumstances, but not in others.
> 
> * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE
> encoded files include a BOM.  It is only required that UTF-16 and UTF-32
> (cases where the endianness is unspecified) contain a BOM.  Hence, it
> might be that someone would expect a UTF-16LE (or any of the formats
> that don't require a BOM, rather than UTF-8), but be willing to accept
> any BOM-discriminated format.
> 
> * Potentially, this could be expanded beyond the various Unicode
> encodings... one could envision that a program whose data files
> historically were in any particular national language locale, could want
> to be enhance to accept Unicode, and could declare that they will accept
> any BOM-discriminated format, but want to default, in the absence of a
> BOM, to the original national language locale that they historically
> accepted.  That would provide a migration path for their old data files.
> 
> So the point is, that it might be nice to have
> "BOM-otherEncodingForDefault" for each other encoding that Python
> supports.  Not sure that is the right API, but I think it is expressive
> enough to handle the cases above.  Whether the cases solve actual
> problems or not, I couldn't say, but they seem like reasonable cases.

This is doable with the currect API. Simply define a codec search
function that handles all encoding names that start with "BOM-" and pass
the "otherEncodingForDefault" part along to the codec.

> It would, of course, be nicest if OS metadata had been invented way back
> when, for all OSes, such that all text files were flagged with their
> encoding... then languages could just read the encoding and do the right
> thing! But we live in the real world, instead.

Servus,
   Walter