[Python-Dev] Quick sum up about open() + BOM

Sat Jan 9 02:49:12 CET 2010

On approximately 1/8/2010 5:12 PM, came the following characters from 
the keyboard of MRAB:
> Glenn Linderman wrote:
>> On approximately 1/8/2010 3:59 PM, came the following characters from 
>> the keyboard of Victor Stinner:
>>> Hi,
>>>
>>> Thanks for all the answers! I will try to sum up all ideas here.
>>
>> One concern I have with this implementation encoding="BOM" is that if 
>> there is no BOM it assumes UTF-8.  That is probably a good assumption 
>> in some circumstances, but not in others.
>>
>> * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE 
>> encoded files include a BOM.  It is only required that UTF-16 and 
>> UTF-32 (cases where the endianness is unspecified) contain a BOM.  
>> Hence, it might be that someone would expect a UTF-16LE (or any of 
>> the formats that don't require a BOM, rather than UTF-8), but be 
>> willing to accept any BOM-discriminated format.
>>
>> * Potentially, this could be expanded beyond the various Unicode 
>> encodings... one could envision that a program whose data files 
>> historically were in any particular national language locale, could 
>> want to be enhance to accept Unicode, and could declare that they 
>> will accept any BOM-discriminated format, but want to default, in the 
>> absence of a BOM, to the original national language locale that they 
>> historically accepted.  That would provide a migration path for their 
>> old data files.
>>
>> So the point is, that it might be nice to have 
>> "BOM-otherEncodingForDefault" for each other encoding that Python 
>> supports.  Not sure that is the right API, but I think it is 
>> expressive enough to handle the cases above.  Whether the cases solve 
>> actual problems or not, I couldn't say, but they seem like reasonable 
>> cases.
>>
>> It would, of course, be nicest if OS metadata had been invented way 
>> back when, for all OSes, such that all text files were flagged with 
>> their encoding... then languages could just read the encoding and do 
>> the right thing! But we live in the real world, instead.
>>
> What about listing the possible encodings? It would try each in turn
> until it found one where the BOM matched or had no BOM:
>
>     my_file = open(filename, 'r', encoding='UTF-8-sig|UTF-16|UTF-8')
>
> or is that taking it too far?

That sounds very flexible -- but in net effect it would only make 
illegal a subset of the BOM-containing encodings (those not listed) 
without making legal any additional encodings other than the non-BOM 
encoding.  Whether prohibiting a subset of BOM-containing encodings is a 
useful use case, I couldn't say... but my goal would be to included as 
many different file encodings on input as possible: without a BOM, that 
is exactly 1 (unless there are other heuristics), with a BOM, it is 
1+all-BOM-containing encodings.  Your scheme would permit numbers of 
encodings accepted to vary between 1 and 1+all-BOM-containing encodings.

(I think everyone can agree there are 5 different byte sequences that 
can be called a Unicode BOM.  The likelihood of them appearing in any 
other text encoding created by mankind depends on those other encodings 
-- but it is not impossible.  It is truly up to the application to 
decide whether BOM detection could potentially conflict with files in 
some other encoding that would be acceptable to the application.)

So I think it is taking it further than I can see value in, but I'm 
willing to be convinced otherwise.  I see only a need for detecting BOM, 
and specifying a default encoding to be used if there is no BOM.  Note 
that it might be nice to have a specification for using current 
encoding=None heuristic -- perhaps encoding="BOM-None" per my originally 
proposed syntax.  But I'm still not saying that is the best syntax.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking