[Python-3000] Pre-PEP: Easy Text File Decoding

Sun Sep 10 15:52:44 CEST 2006

Antoine Pitrou wrote:
> Le dimanche 10 septembre 2006 à 21:58 +1000, Nick Coghlan a écrit :
>>Antoine Pitrou wrote:
>>
>>>So, here is an alternative proposal :
>>>Make it so that textfile() doesn't recognize system-wide defaults (as in
>>>your proposal), but also provide autotextfile() which would recognize
>>>those defaults (with a by_content=False optional argument to enable
>>>content-based guessing).
>>>
>>>textfile() being clearly marked for use by large well thought-out
>>>applications, and autotextfile() for small scripts and the like.
>>>Different names make it clear that they are for different uses, and
>>>allow to spot them easily when looking at source code (either by a human
>>>reader or a quality measurement tool).
>>
>>How does your "autotextfile('myfile.txt')" differ from Paul's 
>>"textfile('myfile.txt', encoding='guess')"?
> 
> Paul's "encoding='guess'" specifies a complicated and dangerous guessing
> algorithm.

Indeed, to the extent that it specifies anything. However, guessing algorithms
can differ greatly in how complicated and dangerous they are.

Here is a very simple, reasonably (although not completely) safe, and much
more predictable guessing algorithm, based on a generalization of
<http://www.w3.org/TR/REC-xml/#sec-guessing>:

   Let A, B, C, and D be the first 4 bytes of the stream, or None if the
     corresponding byte is past end-of-stream.

   Let other be any encoding which is to be used as a default if no specific
     UTF is detected.

   if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8
   if B == None: return other
   if A == 0 and B == 0 and D != None: return UTF32BE
   if C == 0 and D == 0: return UTF32LE
   if A == 0xFE and B == 0xFF: return UTF16BE
   if A == 0xFF and B == 0xFE: return UTF16LE
   if A != 0 and B != 0: return other
   if A == 0: return UTF16BE
   return UTF16LE

This would normally be used with 'other' as the system encoding, as an alternative
to just assuming that the file is in the system encoding.

There is very little chance of this algorithm misdetecting a file in a non-Unicode
encoding as Unicode. For that to happen, either the first two or three bytes would
have to be encoded in exactly the same way as a UTF-16 or UTF-8 BOM, or one of the
first three characters would have to be NUL.

However, if the file *is* Unicode and it starts with a BOM, then its UTF will
always be correctly detected.

Furthermore, UTF-16 and UTF-32 will be correctly detected if the file starts with
a character from U+0001 to U+00FF (i.e. non-NUL and in the ISO-8859-1 range).

Another advantage of this algorithm is that it always reads only 4 bytes.

> However, autotextfile('myfile.txt') would mean :
> - use Paul's "site" if such a thing is defined
> - otherwise, use Paul's "locale"
> (no content-based guessing)
> 
> On the other hand "autotextfile('myfile.txt', by_content=True)" would
> enable content-based guessing, thus be equivalent to Paul's
> "encoding='guess'".

As I pointed out earlier, any file open function that guesses the encoding
should return which encoding has been guessed. Alternatively, it could be
possible to allow the encoding to be set after the file has been opened,
in which case a separate function could do the guessing.

>>The 'additional symbolic values' should be implemented as true
>>encodings (i.e., it should be possible to look up 'site', 'guess' and
>>'locale' in the codecs registry, and replace them there as well).
> 
> Treating different things as "true encodings" does not help
> understandability IMHO. "guess", "site" and "locale" are not encodings
> in themselves, they are decision algorithms.

+1.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>