Re: [I18n-sig] Unicode strings: an alternative
(Boy, is it quiet here all of a sudden ;-) Sorry for the duplication of stuff, but I'd like to reiterate my points, to separate them from my implementation proposal, as that's just what it is: an implementation detail. These things are important to me: - get rid of the Unicode-ness of wide strings, in order to - make narrow and wide strings as similar as possible - implicit conversion between narrow and wide strings should happen purely on the basis of the character codes; no assumption at all should be made about the encoding, ie. what the character code _means_. - downcasting from wide to narrow may raise OverflowError if there are characters in the wide string that are > 255 - str(s) should always return s if s is a string, whether narrow or wide - file objects need to be responsible for handling wide strings - the above two points should make it possible for - if no encoding is known, Unicode is the default, whether narrow or wide The above points seem to have the following consequences: - the 'u' in \uXXXX notation no longer makes much sense, since it is not neccesary for the character to be a Unicode code point: it's just a 2-byte int. \wXXXX might be an option. - the u"" notation is no longer neccesary: if a string literal contains a character > 255 the string should automatically become a wide string. - narrow strings should also have an encode() method. - the builtin unicode() function might be redundant if: - it is possible to specify a source encoding. I'm not sure if this is best done through an extra argument for encode() or that it should be a new method, eg. transcode(). - s.encode() or s.transcode() are allowed to output a wide string, as in aNarrowString.encode("UCS-2") and s.transcode("Mac-Roman", "UCS-2"). My proposal to extend the "old" string type to be able to contain wide strings is of course largely unrelated to all this. Yet it may provide some additional C compatibility (especially now that silent conversion to utf-8 is out) as well as a workaround for the str()-having-to-return-a-narrow-string bottleneck. Just
Just> Sorry for the duplication of stuff, but I'd like to reiterate my Just> points, to separate them from my implementation proposal, as Just> that's just what it is: an implementation detail. Just> These things are important to me: ... For the encoding-challenged like me, does it make sense to explicitly state that you can't mix character widths within a single string, or is that just so obvious that I deserve a head slap just for mentioning it? -- Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/ "We have become ... the stewards of life's continuity on earth. We did not ask for this role... We may not be suited to it, but here we are." - Stephen Jay Gould
On Thu, 4 May 2000 22:22:38 +0100, Just van Rossum <just@letterror.com> wrote:
(Boy, is it quiet here all of a sudden ;-)
Sorry for the duplication of stuff, but I'd like to reiterate my points, to separate them from my implementation proposal, as that's just what it is: an implementation detail.
These things are important to me: - get rid of the Unicode-ness of wide strings, in order to - make narrow and wide strings as similar as possible - implicit conversion between narrow and wide strings should happen purely on the basis of the character codes; no assumption at all should be made about the encoding, ie. what the character code _means_. - downcasting from wide to narrow may raise OverflowError if there are characters in the wide string that are > 255 - str(s) should always return s if s is a string, whether narrow or wide - file objects need to be responsible for handling wide strings - the above two points should make it possible for - if no encoding is known, Unicode is the default, whether narrow or wide
The above points seem to have the following consequences: - the 'u' in \uXXXX notation no longer makes much sense, since it is not neccesary for the character to be a Unicode code point: it's just a 2-byte int. \wXXXX might be an option. - the u"" notation is no longer neccesary: if a string literal contains a character > 255 the string should automatically become a wide string. - narrow strings should also have an encode() method. - the builtin unicode() function might be redundant if: - it is possible to specify a source encoding. I'm not sure if this is best done through an extra argument for encode() or that it should be a new method, eg. transcode().
- s.encode() or s.transcode() are allowed to output a wide string, as in aNarrowString.encode("UCS-2") and s.transcode("Mac-Roman", "UCS-2").
One other pleasant consequence: - String comparisons work character-by character, even if the representation of those characters have different widths.
My proposal to extend the "old" string type to be able to contain wide strings is of course largely unrelated to all this. Yet it may provide some additional C compatibility (especially now that silent conversion to utf-8 is out) as well as a workaround for the str()-having-to-return-a-narrow-string bottleneck.
Toby Dickenson tdickenson@geminidataloggers.com
At 10:07 AM +0100 05-05-2000, Toby Dickenson wrote:
One other pleasant consequence:
- String comparisons work character-by character, even if the representation of those characters have different widths.
Exactly. By saying "(wide) strings are not tied to Unicode" the question whether wide strings should or should not be sorted according to the Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's too hard anyway"... Just
Just van Rossum writes:
At 10:07 AM +0100 05-05-2000, Toby Dickenson wrote:
One other pleasant consequence:
- String comparisons work character-by character, even if the representation of those characters have different widths.
Exactly. By saying "(wide) strings are not tied to Unicode" the question whether wide strings should or should not be sorted according to the Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's too hard anyway"...
Wait a second. There is nothing about Unicode that would prevent you from defining string equality as byte-level equality. This strikes me as the wrong way to deal with the complex collation issues of Unicode. It seems to me that by default wide-strings compare at the byte-level (i.e., '=' is a byte level comparison). If you want a normalized comparison, then you make an explicit function call for that. This is no different from comparing strings in a case sensitive vs. case insensitive manner. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
[Me]
Exactly. By saying "(wide) strings are not tied to Unicode" the question whether wide strings should or should not be sorted according to the Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's too hard anyway"...
[Tom Emerson]
Wait a second.
There is nothing about Unicode that would prevent you from defining string equality as byte-level equality.
Agreed.
This strikes me as the wrong way to deal with the complex collation issues of Unicode.
All I was trying to say, was that by looking at it this way, it is even more obvious that the builtin comparison should not deal with Unicode sorting & collation issues. It seems you're saying the exact same thing:
It seems to me that by default wide-strings compare at the byte-level (i.e., '=' is a byte level comparison). If you want a normalized comparison, then you make an explicit function call for that.
Exactly.
This is no different from comparing strings in a case sensitive vs. case insensitive manner.
Good point. All this taken together still means to me that comparisons between wide and narrow strings should take place at the character level, which implies that coercion from narrow to wide is done at the character level, without looking at the encoding. (Which in my book in turn still implies that as long as we're talking about Unicode, narrow strings are effectively Latin-1.) Just
Just van Rossum writes:
Good point. All this taken together still means to me that comparisons between wide and narrow strings should take place at the character level, which implies that coercion from narrow to wide is done at the character level, without looking at the encoding. (Which in my book in turn still implies that as long as we're talking about Unicode, narrow strings are effectively Latin-1.)
Only true if "wide" strings are encoded in UCS-2 or UCS-4. If "wide characters" are Unicode, but stored in UTF-8 encoding, then you loose. Hmmmm... how often do you expect to compare narrow vs. wide strings, using default comparison (i.e. = or !=)? What if I'm using Latin 3 and use the byte comparison? I may very well have two strings (one narrow, one wide) that compare equal, even though they're not. Not exactly what I would expect. -tree [I'm flying from Seattle to Boston today, so eventually I will disappear for a while] -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
Just van Rossum:
Exactly. By saying "(wide) strings are not tied to Unicode" the question whether wide strings should or should not be sorted according to the Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's too hard anyway"...
I personally like the idea speaking of "wide strings" containing wide character codes instead of Unicode objects. Unfortunately there are many methods which need to interpret the content of strings according to some encoding knowledge: for example 'upper()', 'lower()', 'swapcase()', 'lstrip()' and so on need to know, to which class certain characters belong. This problem was already some kind of visible in 1.5.2, since these methods were available as library functions from the string module and they did work with a global state maintained by the 'setlocale()' C-library function. Quoting from the C library man pages: """ The details of what constitutes an uppercase or lowercase letter depend on the current locale. For example, the default "C" locale does not know about umlauts, so no con version is done for them. In some non - English locales, there are lowercase letters with no corresponding uppercase equivalent; the German sharp s is one example. """ I guess applying 'upper' to a chinese char will not make much sense. Now these former string module functions were moved into the Python object core. So the current Python string and Unicode object API is somewhat "western centric". ;-) At least Marc's implementation in 'unicodectype.c' contains the hard coded assumption, that wide strings contain really unicode characters. print u"äöü".upper().encode("latin1") shows "ÄÖÜ" independent from the locale setting. This makes sense. The output from print u"äöü".upper().encode() however looks ugly here on my screen... UTF-8 ... blech:Ã ÃÃ Regards and have a nice weekend, Peter -- Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260 office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)
(Boy, is it quiet here all of a sudden ;-)
Maybe because (according to one report on NPR here) 80% of the world's email systems are victimized by the ILOVEYOU virus? You & I are not affected because it's Windows specific (a visual basic script, I got a copy mailed to me so I could have a good look :-). Note that there are already mutations, one of which pretends to be a joke.
Sorry for the duplication of stuff, but I'd like to reiterate my points, to separate them from my implementation proposal, as that's just what it is: an implementation detail.
These things are important to me: - get rid of the Unicode-ness of wide strings, in order to - make narrow and wide strings as similar as possible - implicit conversion between narrow and wide strings should happen purely on the basis of the character codes; no assumption at all should be made about the encoding, ie. what the character code _means_. - downcasting from wide to narrow may raise OverflowError if there are characters in the wide string that are > 255 - str(s) should always return s if s is a string, whether narrow or wide - file objects need to be responsible for handling wide strings - the above two points should make it possible for - if no encoding is known, Unicode is the default, whether narrow or wide
The above points seem to have the following consequences: - the 'u' in \uXXXX notation no longer makes much sense, since it is not neccesary for the character to be a Unicode code point: it's just a 2-byte int. \wXXXX might be an option. - the u"" notation is no longer neccesary: if a string literal contains a character > 255 the string should automatically become a wide string. - narrow strings should also have an encode() method. - the builtin unicode() function might be redundant if: - it is possible to specify a source encoding. I'm not sure if this is best done through an extra argument for encode() or that it should be a new method, eg. transcode(). - s.encode() or s.transcode() are allowed to output a wide string, as in aNarrowString.encode("UCS-2") and s.transcode("Mac-Roman", "UCS-2").
My proposal to extend the "old" string type to be able to contain wide strings is of course largely unrelated to all this. Yet it may provide some additional C compatibility (especially now that silent conversion to utf-8 is out) as well as a workaround for the str()-having-to-return-a-narrow-string bottleneck.
I'm not so sure that this is enough. You seem to propose wide strings as vehicles for 16-bit values (and maybe later 32-bit values) apart from their encoding. We already have a data type for that (the array module). The Unicode type does a lot more than storing 16-bit values: it knows lots of encodings to and from Unicode, and it knows things like which characters are upper or lower or title case and how to map between them, which characters are word characters, and so on. All this is highly Unicode specific and is part of what people ask for when then when they request Unicode support. (Example: Unicode has 405 characters classified as numeric, according to the isnumeric() method.) And by the way, don't worry about the comparison. I'm not changing the default comparison (==, cmp()) for Unicode strings to be anything than per 16-bit-quantity. However a Unicode object might in addition has a method to do normalization or whatever, as long as it's language independent and strictly defined by the Unicode standard. Language-specific operations belong in separate modules. --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (6)
-
Guido van Rossum
-
Just van Rossum
-
pf@artcom-gmbh.de
-
Skip Montanaro
-
Toby Dickenson
-
Tom Emerson