Tim Peters wrote:
[Guido about going Latin-1]
Sorry, all this proposal does is change the default encoding on conversions from UTF-8 to Latin-1. That's very western-culture-centric.
Well, if you talk with an Asian, they'll probably tell you that Unicode itself is Eurocentric, and especially UTF-8 (UTF-7 introduces less bloat for non-Latin-1 Unicode characters). Most everyone likes their own national gimmicks best. Or, as Andy once said (paraphrasing), the virtue of UTF-8 is that it annoys everyone.
I do expect that the vase bulk of users would be less surprised if Latin-1 *were* the default encoding. Then the default would be usable as-is for many more people; UTF-8 is usable as-is only for me (i.e., 7-bit Americans). The non-Euros are in for a world of pain no matter what.
just-because-some-groups-can't-win-doesn't-mean-everyone-must- lose-ly y'rs - tim
People tend to forget that UTF-8 is a loss-less Unicode encoding while Latin-1 reduces Unicode to its lower 8 bits: conversion from non-Latin-1 Unicode to strings would simply not work, conversion from non-Latin-1 strings to Unicode would only be possible via unicode().
Thus mixing Unicode and strings would then run perfectly in all western countries using Latin-1 while the rest of the world would need to convert all their strings to Unicode... giving them an advantage over the western world we couldn't possibly accept ;-)
FYI, here's a summary of which conversions take place (going Latin-1 would disable most of the Unicode integration in favour of conversion errors):
Python: ------- string + unicode: unicode(string,'utf-8') + unicode string.method(unicode): unicode(string,'utf-8').method(unicode) print unicode: print unicode.encode('utf-8'); with stdout redirection this can be changed to any other encoding str(unicode): unicode.encode('utf-8') repr(unicode): repr(unicode.encode('unicode-escape'))
C (PyArg_ParserTuple): ---------------------- "s" + unicode: same as "s" + unicode.encode('utf-8') "s#" + unicode: same as "s#" + unicode.encode('unicode-internal') "t" + unicode: same as "t" + unicode.encode('utf-8') "t#" + unicode: same as "t#" + unicode.encode('utf-8')
This effects all C modules and builtins. In case a C module wants to receive a certain predefined encoding, it can use the new "es" and "es#" parser markers.
Ways to enter Unicode: ---------------------- u'' + string same as unicode(string,'utf-8') unicode(string,encname) any supported encoding u'...unicode-escape...' unicode-escape currently accepts Latin-1 chars as single-char input; using escape sequences any Unicode char can be entered (*) codecs.open(filename,mode,encname) opens an encoded file for reading and writing Unicode directly raw_input() + stdin redirection (see one of my earlier posts for code) returns UTF-8 strings based on the input encoding
Hmm, perhaps a codecs.raw_input(encname) which returns Unicode directly wouldn't be a bad idea either ?!
(*) This should probably be changed to be source code encoding dependent, so that u"...data..." matches "...data..." in appearance in the Python source code (see below).
IO: --- open(file,'w').write(unicode) same as open(file,'w').write(unicode.encode('utf-8')) open(file,'wb').write(unicode) same as open(file,'wb').write(unicode.encode('unicode-internal')) codecs.open(file,'wb',encname).write(unicode) same as open(file,'wb').write(unicode.encode(encname)) codecs.open(file,'rb',encname).read() same as unicode(open(file,'rb').read(),encname) stdin + stdout can be redirected using StreamRecoders to handle any of the supported encodings
The Python parser should probably also be extended to read encoded Python source code using some hint at the start of the source file (perhaps only allowing a small subset of the supported encodings, e.g. ASCII, Latin-1, UTF-8 and UTF-16).