M.-A. Lemburg wrote:
and as I've pointed out a zillion times, Python 1.6a2 doesn't.
Just a side note: we never discussed turning the native 8-bit strings into any encoding aware type.
hey, you just argued that we should use UTF-8 because Tcl and Perl use it, didn't you? my point is that they don't use it the way Python 1.6a2 uses it, and that their design is correct, while our design is slightly broken. so let's fix it !
Why not name the beast ?! In your proposal, the old 8-bit strings simply use Latin-1 as native encoding.
in my proposal, there's an important distinction between character sets and character encodings. unicode is a character set. latin 1 is one of many possible encodings of (portions of) that set. maybe it's easier to grok if we get rid of the term "character set"? http://www.hut.fi/u/jkorpela/chars.html suggests the following replacements: character repertoire A set of distinct characters. character code A mapping, often presented in tabular form, which defines one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers. character encoding A method (algorithm) for presenting characters in digital form by mapping sequences of code numbers of characters into sequences of octets. now, in my proposal, the *repertoire* contains all characters described by the unicode standard. the *codes* are defined by the same standard. but strings are sequences of characters, not sequences of octets: strings have *no* encoding. (the encoding used for the internal string storage is an implementation detail). (but sure, given the current implementation, the internal storage for an 8-bit string happens use Latin-1. just as the internal storage for a 16-bit string happens to use UCS-2 stored in native byte order. but from the outside, they're just character sequences).
The current version doesn't make any encoding assumption as long as the 8-bit strings do not get auto-converted. In that case they are interpreted as UTF-8 -- which will (usually) fail for Latin-1 encoded strings using the 8th bit, but hey, at least you get an error message telling you what is going wrong.
sure, but I don't think you get the right message, or that you get it at the right time. consider this: if you're going from 8-bit strings to unicode using implicit con- version, the current design can give you: "UnicodeError: UTF-8 decoding error: unexpected code byte" if you go from unicode to 8-bit strings, you'll never get an error. however, the result is not always a string -- if the unicode string happened to contain any characters larger than 127, the result is a binary buffer containing encoded data. you cannot use string methods on it, you cannot use regular expressions on it. indexing and slicing won't work. unlike earlier versions of Python, and unlike unicode-aware versions of Tcl and Perl, the fundamental assumption that a string is a sequence of characters no longer holds. in my proposal, going from 8-bit strings to unicode always works. a character is a character, no matter what string type you're using. however, going from unicode to an 8-bit string may given you an OverflowError, say: "OverflowError: unicode character too large to fit in a byte" the important thing here is that if you don't get an exception, the result is *always* a string. string methods always work. etc. [8. Special cases aren't special enough to break the rules.]
The key to these problems is using explicit conversions where 8-bit strings meet Unicode objects.
yeah, but the flaw in the current design is the implicit conversions, not the explicit ones. [2. Explicit is better than implicit.] (of course, the 8-bit string type also needs an "encode" method under my proposal, but that's just a detail ;-)
Some more ideas along the convenience path:
Perhaps changing just the way 8-bit strings are coerced to Unicode would help: strings would then be interpreted as Latin-1.
ok.
str(Unicode) and "t" would still return UTF-8 to assure loss- less conversion.
maybe. or maybe str(Unicode) should return a unicode string? think about it! (after all, I'm pretty sure that ord() and chr() should do the right thing, also for character codes above 127)
Another way to tackle this would be to first try UTF-8 conversion during auto-conversion and then fallback to Latin-1 in case it fails. Has anyone tried this ? Guido mentioned that TCL does something along these lines...
haven't found any traces of that in the source code. hmm, you're right -- it looks like it attempts to "fix" invalid UTF-8 data (on a character by character basis), instead of choking on it. scary. [12. In the face of ambiguity, refuse the temptation to guess.] more tomorrow. </F>