[Python-Dev] Encoding of 8-bit strings and Python source code

M.-A. Lemburg mal@lemburg.com
Thu, 27 Apr 2000 13:23:23 +0200


Tim Peters wrote:
> 
> [Guido about going Latin-1]
> > Sorry, all this proposal does is change the default encoding on
> > conversions from UTF-8 to Latin-1.  That's very
> > western-culture-centric.
> 
> Well, if you talk with an Asian, they'll probably tell you that Unicode
> itself is Eurocentric, and especially UTF-8 (UTF-7 introduces less bloat for
> non-Latin-1 Unicode characters).  Most everyone likes their own national
> gimmicks best.  Or, as Andy once said (paraphrasing), the virtue of UTF-8 is
> that it annoys everyone.
> 
> I do expect that the vase bulk of users would be less surprised if Latin-1
> *were* the default encoding.  Then the default would be usable as-is for
> many more people; UTF-8 is usable as-is only for me (i.e., 7-bit Americans).
> The non-Euros are in for a world of pain no matter what.
> 
> just-because-some-groups-can't-win-doesn't-mean-everyone-must-
>     lose-ly y'rs  - tim

People tend to forget that UTF-8 is a loss-less Unicode
encoding while Latin-1 reduces Unicode to its lower 8 bits:
conversion from non-Latin-1 Unicode to strings would simply
not work, conversion from non-Latin-1 strings to Unicode
would only be possible via unicode().

Thus mixing Unicode and strings would then run perfectly in all
western countries using Latin-1 while the rest of the
world would need to convert all their strings to Unicode...
giving them an advantage over the western world we couldn't
possibly accept ;-)

FYI, here's a summary of which conversions take place (going Latin-1
would disable most of the Unicode integration in favour of conversion
errors):

Python:
-------
string + unicode:	unicode(string,'utf-8') + unicode
string.method(unicode):	unicode(string,'utf-8').method(unicode)
print unicode:		print unicode.encode('utf-8'); with stdout
			redirection this can be changed to any
			other encoding
str(unicode):		unicode.encode('utf-8')
repr(unicode):		repr(unicode.encode('unicode-escape'))


C (PyArg_ParserTuple):
----------------------
"s" + unicode:		same as "s" + unicode.encode('utf-8')
"s#" + unicode:		same as "s#" + unicode.encode('unicode-internal')
"t" + unicode:		same as "t" + unicode.encode('utf-8')
"t#" + unicode:		same as "t#" + unicode.encode('utf-8')

This effects all C modules and builtins. In case a C module
wants to receive a certain predefined encoding, it can
use the new "es" and "es#" parser markers.


Ways to enter Unicode:
----------------------
u'' + string 		same as unicode(string,'utf-8')
unicode(string,encname) any supported encoding
u'...unicode-escape...' unicode-escape currently accepts
			Latin-1 chars as single-char input; using
			escape sequences any Unicode char can be
			entered (*)
codecs.open(filename,mode,encname)
			opens an encoded file for
			reading and writing Unicode directly
raw_input() + stdin redirection (see one of my earlier posts for code)
			returns UTF-8 strings based on the input
			encoding

Hmm, perhaps a codecs.raw_input(encname) which returns Unicode
directly wouldn't be a bad idea either ?!

(*) This should probably be changed to be source code
encoding dependent, so that u"...data..." matches
"...data..." in appearance in the Python source code
(see below).


IO:
---
open(file,'w').write(unicode)
	same as open(file,'w').write(unicode.encode('utf-8'))
open(file,'wb').write(unicode)
	same as open(file,'wb').write(unicode.encode('unicode-internal'))
codecs.open(file,'wb',encname).write(unicode)
	same as open(file,'wb').write(unicode.encode(encname))
codecs.open(file,'rb',encname).read()
	same as unicode(open(file,'rb').read(),encname)
stdin + stdout
	can be redirected using StreamRecoders to handle any
	of the supported encodings

The Python parser should probably also be extended to read
encoded Python source code using some hint at the start of
the source file (perhaps only allowing a small subset of the
supported encodings, e.g. ASCII, Latin-1, UTF-8 and UTF-16).


-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/