[Python-3000] Pre-PEP: Easy Text File Decoding

Sun Sep 10 20:25:43 CEST 2006

David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Here is a very simple, reasonably (although not completely) safe, and much
> more predictable guessing algorithm, based on a generalization of
> <http://www.w3.org/TR/REC-xml/#sec-guessing>:
> 
>    Let A, B, C, and D be the first 4 bytes of the stream, or None if the
>      corresponding byte is past end-of-stream.
> 
>    Let other be any encoding which is to be used as a default if no specific
>      UTF is detected.
> 
>    if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8
>    if B == None: return other
>    if A == 0 and B == 0 and D != None: return UTF32BE
>    if C == 0 and D == 0: return UTF32LE
>    if A == 0xFE and B == 0xFF: return UTF16BE
>    if A == 0xFF and B == 0xFE: return UTF16LE
>    if A != 0 and B != 0: return other
>    if A == 0: return UTF16BE
>    return UTF16LE
> 
> This would normally be used with 'other' as the system encoding, as an alternative
> to just assuming that the file is in the system encoding.

Using the xml guessing mechanism is fine, as long as you get it right. 
A first pass with BOM detection and a second pass to "guess" based on
content in the case that a BOM isn't detected seems to make sense.

Note that the above algorithm returns UTF32BE for a files beginning with
4 null bytes.

 - Josiah