[Python-Dev] Quick sum up about open() + BOM
MRAB
python at mrabarnett.plus.com
Sat Jan 9 02:12:28 CET 2010
Glenn Linderman wrote:
> On approximately 1/8/2010 3:59 PM, came the following characters from
> the keyboard of Victor Stinner:
>> Hi,
>>
>> Thanks for all the answers! I will try to sum up all ideas here.
>
> One concern I have with this implementation encoding="BOM" is that if
> there is no BOM it assumes UTF-8. That is probably a good assumption in
> some circumstances, but not in others.
>
> * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE
> encoded files include a BOM. It is only required that UTF-16 and UTF-32
> (cases where the endianness is unspecified) contain a BOM. Hence, it
> might be that someone would expect a UTF-16LE (or any of the formats
> that don't require a BOM, rather than UTF-8), but be willing to accept
> any BOM-discriminated format.
>
> * Potentially, this could be expanded beyond the various Unicode
> encodings... one could envision that a program whose data files
> historically were in any particular national language locale, could want
> to be enhance to accept Unicode, and could declare that they will accept
> any BOM-discriminated format, but want to default, in the absence of a
> BOM, to the original national language locale that they historically
> accepted. That would provide a migration path for their old data files.
>
> So the point is, that it might be nice to have
> "BOM-otherEncodingForDefault" for each other encoding that Python
> supports. Not sure that is the right API, but I think it is expressive
> enough to handle the cases above. Whether the cases solve actual
> problems or not, I couldn't say, but they seem like reasonable cases.
>
> It would, of course, be nicest if OS metadata had been invented way back
> when, for all OSes, such that all text files were flagged with their
> encoding... then languages could just read the encoding and do the right
> thing! But we live in the real world, instead.
>
What about listing the possible encodings? It would try each in turn
until it found one where the BOM matched or had no BOM:
my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8')
or is that taking it too far?
More information about the Python-Dev
mailing list