[I18n-sig] Re: Unicode debate

M.-A. Lemburg mal@lemburg.com
Fri, 28 Apr 2000 20:52:04 +0200

Just van Rossum wrote:
> At 2:09 PM +0200 28-04-2000, M.-A. Lemburg wrote:
> >> 1, because 2 can lead to surprises when two strings containing binary goop
> >> are added and only one was a literal in a source file with an explicit
> >> encoding.
> >
> [...]
> >I should have been more precise:
> >
> >2. provided both strings have encodings which can be converted
> >   to Unicode, coerce them to Unicode and then apply the action;
> >   otherwise proceed as in 1., i.e. the result has an undefined
> >   encoding.
> >
> >If 2. does try to convert to Unicode, conversion errors should
> >be raised (just like they are now for Unicode coercion errors).
> But that doesn't solve the binary goop problem: two binary gooplets may
> have different "encodings", which happen to be valid (ie. not raise an
> exception). Conversion to unicode is no way what you want.

See the first line ;-) ... "provided both strings have encodings
which can be converted to Unicode" ... binary encodings would
not fall under these.

str('...data1...','binary') + str('...data2...','UTF-8')
would yield str('...data1......data2...','undefined')

Plus, we'd need to add a third case:

3. Of course, actions on strings of the same encoding should
   result in strings of the same encodings, e.g.
   str('...data1...','enc1') + str('...data2...','enc1')
   should yield str('...data1......data2...','enc1')

> >Some more tricky business:
> >
> >How should str('bla', 'enc1') and str('bla', 'enc2') compare ?
> >What about the hash values of the two ?
> I proposed to *only* use the encoding attr when dealing with 8-bit
> string/unicode string combo's. Just ignore it completely when there's no
> unicode string in sight.

You can't ignore it completely because that would quickly
render it useless: point 3. is very important to assure that
strings with known encoding propogate their encoding as they
get processed. Otherwise you'd soon only deal with undefined
encoding strings and the whole strategy would be pointless.

Hmm, I think this road doesn't lead anywhere (but it was fun
anyway ;). As I've written a few times before: if you intend
to go Unicode, make all your strings Unicode.

Perhaps there should be an experimental command line flag which
turns "..." in source code into u"..." to be able to test this
setup ?! 

If someone is interested, I have a patch which adds
a -U flag. The Python compiler will then interpret all '...'
strings as u'...' strings. Hmm, that switch should probably
be called something like -Py3k ;-)

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/