I wrote:
A utf-8-encoded 8-bit string in Python is *not* a string, but a "ByteArray".
Another way of putting this is: - utf-8 in an 8-bit string is to a unicode string what a pickle is to an object. - defaulting to utf-8 upon coercing is like implicitly trying to unpickle an 8-bit string when comparing it to an instance. Bad idea.
Defaulting to Latin-1 is the only logical choice, no matter how western-culture-centric this may seem.
Just
The Van Rossum Common Sense gene strikes again! You guys owe it to the world to have lots of children. I agree 100%. Let me also add that if you want to do encoding work that goes beyond what the library gives you, you absolutely need a 'byte array' type which makes no assumptions and does nothing magic to its content. I have always thought of 8-bit strings as 'byte arrays' and not 'characer arrays', and doing anything magic to them in literals or standard input is going to cause lots of trouble. I think our proposal is BETTER than Java, Tcl, Visual Basic etc for the following reasons: - you can work with old fashioned strings, which are understood by everyone to be arrays of bytes, and there is no magic conversion going on. The bytes in literal strings in your script file are the bytes that end up in the program. - you can work with Unicode strings if you want - you are in explicit control of conversions between them - both types have similar methods so there isn't much to learn or remember The 'no magic' thing is very important with Japanese, where very often you need to roll your own codecs and look at the raw bytes; any auto-conversion might not go through the filter you want and you've already lost information before you started. Especially If your job is to repair possibly corrupt data. Any company with a few extra custom characters in the user-defined Shift-JIS range is going to suddenly find their Perl scripts are failing or trashing all their data as a result of the UTF-8 decision. I'm also convinced that the majority of Python scripts won't need to work in Unicode. Even working with exotic languages, there is always a native 8-bit encoding. I have only used Unicode when (a) working with data that is in several languages (b) doing conversions, which requires a 'central point' (b) wanting to do per-character operations safely on multi-byte data I still haven't sorted out in my head whether the default encoding thing is a big red herring or is important; I already have a safe way to construct Unicode literals in my source files if I want to using unicode('rawdata','myencoding'). But if there has to be one I'd say the following: - strict ASCII is an option - Latin-1 is the more generous option that is right for the most people, and has a 'special status' among 8-bit encodings - UTF-8 is not one byte per character and will confuse people Just my 2p worth, Andy