[Python-3000] Pre-PEP: Easy Text File Decoding
Josiah Carlson
jcarlson at uci.edu
Sun Sep 10 20:25:43 CEST 2006
David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Here is a very simple, reasonably (although not completely) safe, and much
> more predictable guessing algorithm, based on a generalization of
> <http://www.w3.org/TR/REC-xml/#sec-guessing>:
>
> Let A, B, C, and D be the first 4 bytes of the stream, or None if the
> corresponding byte is past end-of-stream.
>
> Let other be any encoding which is to be used as a default if no specific
> UTF is detected.
>
> if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8
> if B == None: return other
> if A == 0 and B == 0 and D != None: return UTF32BE
> if C == 0 and D == 0: return UTF32LE
> if A == 0xFE and B == 0xFF: return UTF16BE
> if A == 0xFF and B == 0xFE: return UTF16LE
> if A != 0 and B != 0: return other
> if A == 0: return UTF16BE
> return UTF16LE
>
> This would normally be used with 'other' as the system encoding, as an alternative
> to just assuming that the file is in the system encoding.
Using the xml guessing mechanism is fine, as long as you get it right.
A first pass with BOM detection and a second pass to "guess" based on
content in the case that a BOM isn't detected seems to make sense.
Note that the above algorithm returns UTF32BE for a files beginning with
4 null bytes.
- Josiah
More information about the Python-3000
mailing list