[Python-Dev] Encoding of 8-bit strings and Python source code
M.-A. Lemburg
mal@lemburg.com
Thu, 27 Apr 2000 13:23:23 +0200
Tim Peters wrote:
>
> [Guido about going Latin-1]
> > Sorry, all this proposal does is change the default encoding on
> > conversions from UTF-8 to Latin-1. That's very
> > western-culture-centric.
>
> Well, if you talk with an Asian, they'll probably tell you that Unicode
> itself is Eurocentric, and especially UTF-8 (UTF-7 introduces less bloat for
> non-Latin-1 Unicode characters). Most everyone likes their own national
> gimmicks best. Or, as Andy once said (paraphrasing), the virtue of UTF-8 is
> that it annoys everyone.
>
> I do expect that the vase bulk of users would be less surprised if Latin-1
> *were* the default encoding. Then the default would be usable as-is for
> many more people; UTF-8 is usable as-is only for me (i.e., 7-bit Americans).
> The non-Euros are in for a world of pain no matter what.
>
> just-because-some-groups-can't-win-doesn't-mean-everyone-must-
> lose-ly y'rs - tim
People tend to forget that UTF-8 is a loss-less Unicode
encoding while Latin-1 reduces Unicode to its lower 8 bits:
conversion from non-Latin-1 Unicode to strings would simply
not work, conversion from non-Latin-1 strings to Unicode
would only be possible via unicode().
Thus mixing Unicode and strings would then run perfectly in all
western countries using Latin-1 while the rest of the
world would need to convert all their strings to Unicode...
giving them an advantage over the western world we couldn't
possibly accept ;-)
FYI, here's a summary of which conversions take place (going Latin-1
would disable most of the Unicode integration in favour of conversion
errors):
Python:
-------
string + unicode: unicode(string,'utf-8') + unicode
string.method(unicode): unicode(string,'utf-8').method(unicode)
print unicode: print unicode.encode('utf-8'); with stdout
redirection this can be changed to any
other encoding
str(unicode): unicode.encode('utf-8')
repr(unicode): repr(unicode.encode('unicode-escape'))
C (PyArg_ParserTuple):
----------------------
"s" + unicode: same as "s" + unicode.encode('utf-8')
"s#" + unicode: same as "s#" + unicode.encode('unicode-internal')
"t" + unicode: same as "t" + unicode.encode('utf-8')
"t#" + unicode: same as "t#" + unicode.encode('utf-8')
This effects all C modules and builtins. In case a C module
wants to receive a certain predefined encoding, it can
use the new "es" and "es#" parser markers.
Ways to enter Unicode:
----------------------
u'' + string same as unicode(string,'utf-8')
unicode(string,encname) any supported encoding
u'...unicode-escape...' unicode-escape currently accepts
Latin-1 chars as single-char input; using
escape sequences any Unicode char can be
entered (*)
codecs.open(filename,mode,encname)
opens an encoded file for
reading and writing Unicode directly
raw_input() + stdin redirection (see one of my earlier posts for code)
returns UTF-8 strings based on the input
encoding
Hmm, perhaps a codecs.raw_input(encname) which returns Unicode
directly wouldn't be a bad idea either ?!
(*) This should probably be changed to be source code
encoding dependent, so that u"...data..." matches
"...data..." in appearance in the Python source code
(see below).
IO:
---
open(file,'w').write(unicode)
same as open(file,'w').write(unicode.encode('utf-8'))
open(file,'wb').write(unicode)
same as open(file,'wb').write(unicode.encode('unicode-internal'))
codecs.open(file,'wb',encname).write(unicode)
same as open(file,'wb').write(unicode.encode(encname))
codecs.open(file,'rb',encname).read()
same as unicode(open(file,'rb').read(),encname)
stdin + stdout
can be redirected using StreamRecoders to handle any
of the supported encodings
The Python parser should probably also be extended to read
encoded Python source code using some hint at the start of
the source file (perhaps only allowing a small subset of the
supported encodings, e.g. ASCII, Latin-1, UTF-8 and UTF-16).
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/