Encoding of 8-bit strings and Python source code
After the discussion about #pragmas two weeks ago and some interesting ideas in the direction of source code encodings and ways to implement them, I would like to restart the talk about encodings in source code and runtime auto-conversions. Fredrik recently posted patches to the patches list which loosen the currently hard-coded default encoding used throughout the Unicode design and add a layer of abstraction which would make it easily possible to change the default encoding at some later point. While making things more abstract is certainly a wise thing to do, I am not sure whether this particular case fits into the design decisions made a few months ago. Here's a short summary of what was discussed recently: 1. Fredrik posted the idea of changing the default encoding from UTF-8 to Latin-1 (he calls this 8-bit Unicode which points to the motivation behind this: 8-bit strings should behave like 8-bit Unicode). His recent patches work into this direction. 2. Fredrik also posted an interesting idea which enables writing Python source code in any supported encoding by having the Python tokenizer read Py_UNICODE data instead of char data. A preprocessor would take care of converting the input to Py_UNICODE; the parser would assure that 8-bit string data gets converted back to char data (using e.g. UTF-8 or Latin-1 for the encoding) 3. Regarding the addition of pragmas to allow specifying the used source code encoding several possibilities were mentioned: - addition of a keyword "pragma" to define pragma dictionaries - usage of a "global" as basis for this - adding a new keyword "decl" which also allows defining other things such as type information - XML like syntax embedded into Python comments Some comments: Ad 1. UTF-8 is used as basis in many other languages such as TCL or Perl. It is not an intuitive way of writing strings and causes problems due to one character spanning 1-6 bytes. Still, the world seems to be moving into this direction, so going the same way can't be all wrong... Note that stream IO can be recoded in a way which allows Python to print and read e.g. Latin-1 (see below). The general idea behind the fixed default encoding design was to give all the power to the user, since she eventually knows best which encoding to use or expect. Ad 2. I like this idea because it enables writing Unicode- aware programs *in* Unicode... the only problem which remains is again the encoding to use for the classic 8-bit strings. Ad 3. For 2. to work, the encoding would have to appear close to the top of the file. The preprocessor would have to be BOM-mark aware to tell whether UTF-16 or some ASCII extension is used by the file. Guido asked me for some code which demonstrates Latin-1 recoding using the existing mechanisms. I've attached a simple script to this mail. It is not much tested yet, so please give it a try. You can also change it to use any other encoding you like. Together with the Japanese codecs provided by Tamito Kajiyama (http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/tmp/japanese-codecs.tar.gz) you should be able to type Shift-JIS at the raw_input() or interactive prompt, have it stored as UTF-8 and then printed back as Shift-JIS, provided you put add a recoder similar to the attached one for Latin-1 to your PYTHONSTARTUP or site.py script. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
I'll follow up with a longer reply later; just one correction: M.-A. Lemburg <mal@lemburg.com> wrote:
Ad 1. UTF-8 is used as basis in many other languages such as TCL or Perl. It is not an intuitive way of writing strings and causes problems due to one character spanning 1-6 bytes. Still, the world seems to be moving into this direction, so going the same way can't be all wrong...
the problem here is the current Python implementation doesn't use UTF-8 in the same way as Perl and Tcl. Perl and Tcl only exposes one string type, and that type be- haves exactly like it should: "The Tcl string functions properly handle multi- byte UTF-8 characters as single characters." "By default, Perl now thinks in terms of Unicode characters instead of simple bytes. /.../ All the relevant built-in functions (length, reverse, and so on) now work on a character-by-character basis instead of byte-by-byte, and strings are represented internally in Unicode." or in other words, both languages guarantee that given a string s: - s is a sequence of characters (not bytes) - len(s) is the number of characters in the string - s[i] is the i'th character - len(s[i]) is 1 and as I've pointed out a zillion times, Python 1.6a2 doesn't. this should be solved, and I see (at least) four ways to do that: -- the Tcl 8.1 way: make 8-bit strings UTF-8 aware. operations like len and getitem usually searches from the start of the string. to handle binary data, introduce a special ByteArray type. when mixing ByteArrays and strings, treat each byte in the array as an 8-bit unicode character (conversions from strings to byte arrays are lossy). [imho: lots of code, and seriously affects performance, even when unicode characters are never used. this approach was abandoned in Tcl 8.2] -- the Tcl 8.2 way: use a unified string type, which stores data as UTF-8 and/or 16-bit unicode: struct { char* bytes; /* 8-bit representation (utf-8) */ Tcl_UniChar* unicode; /* 16-bit representation */ } if one of the strings are modified, the other is regenerated on demand. operations like len, slice and getitem always convert to 16-bit first. still need a ByteArray type, similar to the one described above. [imho: faster than before, but still not as good as a pure 8-bit string type. and the need for a separate byte array type would break alot of existing Python code] -- the Perl 5.6 way? (haven't looked at the implementation, but I'm pretty sure someone told me it was done this way). essentially same as Tcl 8.2, but with an extra encoding field (to avoid con- versions if data is just passed through). struct { int encoding; char* bytes; /* 8-bit representation */ Tcl_UniChar* unicode; /* 16-bit representation */ } [imho: see Tcl 8.2] -- my proposal: expose both types, but let them contain characters from the same character set -- at least when used as strings. as before, 8-bit strings can be used to store binary data, so we don't need a separate ByteArray type. in an 8-bit string, there's always one character per byte. [imho: small changes to the existing code base, about as efficient as can be, no attempt to second-guess the user, fully backwards com- patible, fully compliant with the definition of strings in the language reference, patches are available, etc...] </F>
Fredrik Lundh wrote:
I'll follow up with a longer reply later; just one correction:
M.-A. Lemburg <mal@lemburg.com> wrote:
Ad 1. UTF-8 is used as basis in many other languages such as TCL or Perl. It is not an intuitive way of writing strings and causes problems due to one character spanning 1-6 bytes. Still, the world seems to be moving into this direction, so going the same way can't be all wrong...
the problem here is the current Python implementation doesn't use UTF-8 in the same way as Perl and Tcl. Perl and Tcl only exposes one string type, and that type be- haves exactly like it should:
"The Tcl string functions properly handle multi- byte UTF-8 characters as single characters."
"By default, Perl now thinks in terms of Unicode characters instead of simple bytes. /.../ All the relevant built-in functions (length, reverse, and so on) now work on a character-by-character basis instead of byte-by-byte, and strings are represented internally in Unicode."
or in other words, both languages guarantee that given a string s:
- s is a sequence of characters (not bytes) - len(s) is the number of characters in the string - s[i] is the i'th character - len(s[i]) is 1
and as I've pointed out a zillion times, Python 1.6a2 doesn't.
Just a side note: we never discussed turning the native 8-bit strings into any encoding aware type.
this should be solved, and I see (at least) four ways to do that:
... -- the Perl 5.6 way? (haven't looked at the implementation, but I'm pretty sure someone told me it was done this way). essentially same as Tcl 8.2, but with an extra encoding field (to avoid con- versions if data is just passed through).
struct { int encoding; char* bytes; /* 8-bit representation */ Tcl_UniChar* unicode; /* 16-bit representation */ }
[imho: see Tcl 8.2]
-- my proposal: expose both types, but let them contain characters from the same character set -- at least when used as strings.
as before, 8-bit strings can be used to store binary data, so we don't need a separate ByteArray type. in an 8-bit string, there's always one character per byte.
[imho: small changes to the existing code base, about as efficient as can be, no attempt to second-guess the user, fully backwards com- patible, fully compliant with the definition of strings in the language reference, patches are available, etc...]
Why not name the beast ?! In your proposal, the old 8-bit strings simply use Latin-1 as native encoding. The current version doesn't make any encoding assumption as long as the 8-bit strings do not get auto-converted. In that case they are interpreted as UTF-8 -- which will (usually) fail for Latin-1 encoded strings using the 8th bit, but hey, at least you get an error message telling you what is going wrong. The key to these problems is using explicit conversions where 8-bit strings meet Unicode objects. Some more ideas along the convenience path: Perhaps changing just the way 8-bit strings are coerced to Unicode would help: strings would then be interpreted as Latin-1. str(Unicode) and "t" would still return UTF-8 to assure loss-less conversion. Another way to tackle this would be to first try UTF-8 conversion during auto-conversion and then fallback to Latin-1 in case it fails. Has anyone tried this ? Guido mentioned that TCL does something along these lines... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
M.-A. Lemburg wrote:
and as I've pointed out a zillion times, Python 1.6a2 doesn't.
Just a side note: we never discussed turning the native 8-bit strings into any encoding aware type.
hey, you just argued that we should use UTF-8 because Tcl and Perl use it, didn't you? my point is that they don't use it the way Python 1.6a2 uses it, and that their design is correct, while our design is slightly broken. so let's fix it !
Why not name the beast ?! In your proposal, the old 8-bit strings simply use Latin-1 as native encoding.
in my proposal, there's an important distinction between character sets and character encodings. unicode is a character set. latin 1 is one of many possible encodings of (portions of) that set. maybe it's easier to grok if we get rid of the term "character set"? http://www.hut.fi/u/jkorpela/chars.html suggests the following replacements: character repertoire A set of distinct characters. character code A mapping, often presented in tabular form, which defines one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers. character encoding A method (algorithm) for presenting characters in digital form by mapping sequences of code numbers of characters into sequences of octets. now, in my proposal, the *repertoire* contains all characters described by the unicode standard. the *codes* are defined by the same standard. but strings are sequences of characters, not sequences of octets: strings have *no* encoding. (the encoding used for the internal string storage is an implementation detail). (but sure, given the current implementation, the internal storage for an 8-bit string happens use Latin-1. just as the internal storage for a 16-bit string happens to use UCS-2 stored in native byte order. but from the outside, they're just character sequences).
The current version doesn't make any encoding assumption as long as the 8-bit strings do not get auto-converted. In that case they are interpreted as UTF-8 -- which will (usually) fail for Latin-1 encoded strings using the 8th bit, but hey, at least you get an error message telling you what is going wrong.
sure, but I don't think you get the right message, or that you get it at the right time. consider this: if you're going from 8-bit strings to unicode using implicit con- version, the current design can give you: "UnicodeError: UTF-8 decoding error: unexpected code byte" if you go from unicode to 8-bit strings, you'll never get an error. however, the result is not always a string -- if the unicode string happened to contain any characters larger than 127, the result is a binary buffer containing encoded data. you cannot use string methods on it, you cannot use regular expressions on it. indexing and slicing won't work. unlike earlier versions of Python, and unlike unicode-aware versions of Tcl and Perl, the fundamental assumption that a string is a sequence of characters no longer holds. in my proposal, going from 8-bit strings to unicode always works. a character is a character, no matter what string type you're using. however, going from unicode to an 8-bit string may given you an OverflowError, say: "OverflowError: unicode character too large to fit in a byte" the important thing here is that if you don't get an exception, the result is *always* a string. string methods always work. etc. [8. Special cases aren't special enough to break the rules.]
The key to these problems is using explicit conversions where 8-bit strings meet Unicode objects.
yeah, but the flaw in the current design is the implicit conversions, not the explicit ones. [2. Explicit is better than implicit.] (of course, the 8-bit string type also needs an "encode" method under my proposal, but that's just a detail ;-)
Some more ideas along the convenience path:
Perhaps changing just the way 8-bit strings are coerced to Unicode would help: strings would then be interpreted as Latin-1.
ok.
str(Unicode) and "t" would still return UTF-8 to assure loss- less conversion.
maybe. or maybe str(Unicode) should return a unicode string? think about it! (after all, I'm pretty sure that ord() and chr() should do the right thing, also for character codes above 127)
Another way to tackle this would be to first try UTF-8 conversion during auto-conversion and then fallback to Latin-1 in case it fails. Has anyone tried this ? Guido mentioned that TCL does something along these lines...
haven't found any traces of that in the source code. hmm, you're right -- it looks like it attempts to "fix" invalid UTF-8 data (on a character by character basis), instead of choking on it. scary. [12. In the face of ambiguity, refuse the temptation to guess.] more tomorrow. </F>
[Fredrik]
-- my proposal: expose both types, but let them contain characters from the same character set -- at least when used as strings.
as before, 8-bit strings can be used to store binary data, so we don't need a separate ByteArray type. in an 8-bit string, there's always one character per byte.
[imho: small changes to the existing code base, about as efficient as can be, no attempt to second-guess the user, fully backwards com- patible, fully compliant with the definition of strings in the language reference, patches are available, etc...]
Sorry, all this proposal does is change the default encoding on conversions from UTF-8 to Latin-1. That's very western-culture-centric. You already have control over the encoding: use unicode(s, "latin-1"). If there are places where you don't have enough control (e.g. file I/O), let's add control there. --Guido van Rossum (home page: http://www.python.org/~guido/)
Sorry, all this proposal does is change the default encoding on conversions from UTF-8 to Latin-1. That's very western-culture-centric.
That decision was made by ISO and the Unicode consortium, not me. I don't know why, and I don't really care -- I'm arguing that strings should contain characters, just like the language reference says, and that all characters should be from the same character repertoire and use the same character codes.
Fredrik Lundh wrote:
...
But alright, I give up. I've wasted way too much time on this, my patches were rejected, and nobody seems to care. Not exactly inspiring.
I can understand how frustrating this is. Sometimes something seems just so clean and mathematically obvious that you can't see why others don't see it that way. A character is the "smallest unit of text." Strings are lists of characters. Characters in character sets have numbers. Python users should never know or care whether a string object is an 8-bit string or a Unicode string. There should be no distinction. u"" should be a syntactic shortcut. The primary reason I have not been involved is that I have not had a chance to look at the implementation and figure out if there is an overriding implementation-based reason to ignore the obvious right thing (e.g the right thing will break too much code or be too slow or...). "Unicode objects" should be an implementation detail (if they exist at all). Strings are strings are strings. The Python programmer shouldn't care about whether one string was read from a Unicode file and another from an ASCII file and one typed in with "u" and one without. It's all the same thing! If the programmer wants to do an explicit UTF-8 decode on a string (whether it is Unicode or 8-bit string...no difference) then that decode should proceed by looking at each character, deriving an integer and then treating that integer as an octet according to the UTF-8 specification. Char -> Integer -> Byte -> Char The end result (and hopefully the performance) would be the same but the model is much, much cleaner if there is only one kind of string. We should not ignore the example set by every other language (and yes, I'm including XML here :) ). I'm as desperate (if not as vocal) as Fredrick is here. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html
I haven't weighed in on this one, mainly because I don't even need ISO-1, let alone Unicode, (and damned proud of it, too!). But Fredrik's glob example was horrifying. I do know that I am always concious of whether a particular string is a sequence of characters, or a sequence of bytes. Seems to me the Py3K answer is to make those separate types. Until then, I guess I'll just remain completely xenophobic (and damned proud of it, too!). - Gordon
[/F]
... But alright, I give up. I've wasted way too much time on this, my patches were rejected, and nobody seems to care. Not exactly inspiring.
I lost track of this stuff months ago, and since I use only 7-bit ASCII in my own source code and file names and etc etc, UTF-8 and Latin-1 are identical to me <0.5 wink>. [Guido]
Sorry, all this proposal does is change the default encoding on conversions from UTF-8 to Latin-1. That's very western-culture-centric.
Well, if you talk with an Asian, they'll probably tell you that Unicode itself is Eurocentric, and especially UTF-8 (UTF-7 introduces less bloat for non-Latin-1 Unicode characters). Most everyone likes their own national gimmicks best. Or, as Andy once said (paraphrasing), the virtue of UTF-8 is that it annoys everyone. I do expect that the vase bulk of users would be less surprised if Latin-1 *were* the default encoding. Then the default would be usable as-is for many more people; UTF-8 is usable as-is only for me (i.e., 7-bit Americans). The non-Euros are in for a world of pain no matter what. just-because-some-groups-can't-win-doesn't-mean-everyone-must- lose-ly y'rs - tim
Tim Peters wrote:
[Guido about going Latin-1]
Sorry, all this proposal does is change the default encoding on conversions from UTF-8 to Latin-1. That's very western-culture-centric.
Well, if you talk with an Asian, they'll probably tell you that Unicode itself is Eurocentric, and especially UTF-8 (UTF-7 introduces less bloat for non-Latin-1 Unicode characters). Most everyone likes their own national gimmicks best. Or, as Andy once said (paraphrasing), the virtue of UTF-8 is that it annoys everyone.
I do expect that the vase bulk of users would be less surprised if Latin-1 *were* the default encoding. Then the default would be usable as-is for many more people; UTF-8 is usable as-is only for me (i.e., 7-bit Americans). The non-Euros are in for a world of pain no matter what.
just-because-some-groups-can't-win-doesn't-mean-everyone-must- lose-ly y'rs - tim
People tend to forget that UTF-8 is a loss-less Unicode encoding while Latin-1 reduces Unicode to its lower 8 bits: conversion from non-Latin-1 Unicode to strings would simply not work, conversion from non-Latin-1 strings to Unicode would only be possible via unicode(). Thus mixing Unicode and strings would then run perfectly in all western countries using Latin-1 while the rest of the world would need to convert all their strings to Unicode... giving them an advantage over the western world we couldn't possibly accept ;-) FYI, here's a summary of which conversions take place (going Latin-1 would disable most of the Unicode integration in favour of conversion errors): Python: ------- string + unicode: unicode(string,'utf-8') + unicode string.method(unicode): unicode(string,'utf-8').method(unicode) print unicode: print unicode.encode('utf-8'); with stdout redirection this can be changed to any other encoding str(unicode): unicode.encode('utf-8') repr(unicode): repr(unicode.encode('unicode-escape')) C (PyArg_ParserTuple): ---------------------- "s" + unicode: same as "s" + unicode.encode('utf-8') "s#" + unicode: same as "s#" + unicode.encode('unicode-internal') "t" + unicode: same as "t" + unicode.encode('utf-8') "t#" + unicode: same as "t#" + unicode.encode('utf-8') This effects all C modules and builtins. In case a C module wants to receive a certain predefined encoding, it can use the new "es" and "es#" parser markers. Ways to enter Unicode: ---------------------- u'' + string same as unicode(string,'utf-8') unicode(string,encname) any supported encoding u'...unicode-escape...' unicode-escape currently accepts Latin-1 chars as single-char input; using escape sequences any Unicode char can be entered (*) codecs.open(filename,mode,encname) opens an encoded file for reading and writing Unicode directly raw_input() + stdin redirection (see one of my earlier posts for code) returns UTF-8 strings based on the input encoding Hmm, perhaps a codecs.raw_input(encname) which returns Unicode directly wouldn't be a bad idea either ?! (*) This should probably be changed to be source code encoding dependent, so that u"...data..." matches "...data..." in appearance in the Python source code (see below). IO: --- open(file,'w').write(unicode) same as open(file,'w').write(unicode.encode('utf-8')) open(file,'wb').write(unicode) same as open(file,'wb').write(unicode.encode('unicode-internal')) codecs.open(file,'wb',encname).write(unicode) same as open(file,'wb').write(unicode.encode(encname)) codecs.open(file,'rb',encname).read() same as unicode(open(file,'rb').read(),encname) stdin + stdout can be redirected using StreamRecoders to handle any of the supported encodings The Python parser should probably also be extended to read encoded Python source code using some hint at the start of the source file (perhaps only allowing a small subset of the supported encodings, e.g. ASCII, Latin-1, UTF-8 and UTF-16). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (7)
-
Fredrik Lundh
-
Fredrik Lundh
-
Gordon McMillan
-
Guido van Rossum
-
M.-A. Lemburg
-
Paul Prescod
-
Tim Peters