[Python-Dev] str object going in Py3K

Wed Feb 15 23:54:09 CET 2006

Guido van Rossum wrote:
> On 2/15/06, Fuzzyman <fuzzyman at voidspace.org.uk> wrote:
>   
>>  Forcing the programmer to be aware of encodings, also pushes the same
>> requirement onto the user (who is often the source of the text in question).
>>     
>
> The programmer shouldn't have to be aware of encodings most of the
> time -- it's the job of the I/O library to determine the end user's
> (as opposed to the language's) default encoding dynamically and act
> accordingly. Users who use non-ASCII characters without informing the
> OS of their encoding are in a world of pain, *unless* they use the OS
> default encoding (which may vary per locale). If the OS can figure out
> the default encoding, so can the Python I/O library. Many apps won't
> have to go beyond this at all.
>
> Note that I don't want to use this OS/user default encoding as the
> default encoding between bytes and strings; once you are reading bytes
> you are writing "grown-up" code and you will have to be explicit. It's
> only the I/O library that should automatically encode on write and
> decode on read.
>
>   
>>  Currently you can read a text file and process it - making sure that any
>> changes/requirements only use ascii characters. It therefore doesn't matter
>> what 8 bit ascii-superset encoding is used in the original. If you force the
>> programmer to specify the encoding in order to read the file, they would
>> have to pass that requirement onto their user. Their user is even less
>> likely to be encoding aware than the programmer.
>>     
>
> I disagree -- the user most likely has set or received a default
> encoding when they first got the computer, and that's all they are
> using. If other tools (notepad, wordpad, emacs, vi etc.) can figure
> out the encoding, so can Python's I/O library.
>
>   
I'm intrigued  by the encoding guessing techniques you envisage. I 
currently use a modified version of something contained within docutils.

I read the file in binary and first check for UTF8 or UTF16 BOM.

Then I try to decode the text using the following encodings (in this 
order) :

ascii
UTF8
locale.nl_langinfo(locale.CODESET)
locale.getlocale()[1]
locale.getdefaultlocale()[1]
ISO8859-1
cp1252

(The encodings returned by the locale calls are only used on platforms 
for which they exist.)

The first decode that doesn't blow up, I assume is correct. The problem 
I have is that I usually (for the application I have in mind anyway) 
then want to re-encode into a consistent encoding rather than back into 
the original encoding. If the encoding of the original (usually 
unspecified) is any arbitrary 8-bit ascii superset (as it usually is), 
then it will probably not blow up if decoded with any other arbitrary 8 
bit encoding. This means I sometimes get junk.

I'm curious if there is any extra things I could do ? This is possibly 
beyond the scope of this discussion (in which case I apologise), but we 
are discussing the techniques the I/O layer would use to 'guess' the 
encoding of a file opened in text mode - so maybe it's not so off topic.

There is also the following cookbook recipe that uses an heuristic to 
guess encoding :

    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/163743

XML, HTML, or other text streams may also contain additional information 
about their encoding - which be unreliable. :-)

All the best,

Michael Foord