[Python-3000] Pre-PEP: Easy Text File Decoding

Sun Sep 10 22:01:10 CEST 2006

Josiah Carlson wrote:
> David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> 
>>Here is a very simple, reasonably (although not completely) safe, and much
>>more predictable guessing algorithm, based on a generalization of
>><http://www.w3.org/TR/REC-xml/#sec-guessing>:
>>
>>   Let A, B, C, and D be the first 4 bytes of the stream, or None if the
>>     corresponding byte is past end-of-stream.
>>
>>   Let other be any encoding which is to be used as a default if no specific
>>     UTF is detected.
>>
>>   if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8
>>   if B == None: return other
>>   if A == 0 and B == 0 and D != None: return UTF32BE
>>   if C == 0 and D == 0: return UTF32LE
>>   if A == 0xFE and B == 0xFF: return UTF16BE
>>   if A == 0xFF and B == 0xFE: return UTF16LE
>>   if A != 0 and B != 0: return other
>>   if A == 0: return UTF16BE
>>   return UTF16LE
>>
>>This would normally be used with 'other' as the system encoding, as an alternative
>>to just assuming that the file is in the system encoding.
> 
> Using the xml guessing mechanism is fine, as long as you get it right. 
> A first pass with BOM detection and a second pass to "guess" based on
> content in the case that a BOM isn't detected seems to make sense.

... if you think that guessing based on content is a good idea -- I don't.
In any case, such guessing necessarily depends on the expected file format,
so it should be done by the application itself, or by a library that knows
more about the format.

If the encoding of a text stream were settable after it had been opened,
then it would be easy for anyone to implement whatever guessing algorithm
they needed, without having to write an encoding implementation or include
any other support for guessing in the I/O library itself.

(This also requires the ability to seek back to the beginning of the stream
after reading the data needed for the guess.)

> Note that the above algorithm returns UTF32BE for a files beginning with
> 4 null bytes.

Yes. But such a thing probably isn't a text file at all -- in which case
there will be subsequent decoding errors when most of the code units are
not in the range 0 to 0x10FFFF.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>