Mailman 3 Unicode - Python-Dev

Unicode

Martin v. Loewis

May 14, 2000

9:39 p.m.

comments? (for obvious reasons, I'm especially interested in comments from people using non-ASCII characters on a daily basis...)

...

nobody?

Hi Frederik, I think the problem you try to see is not real. My guideline for using Unicode in Python 1.6 will be that people should be very careful to *not* mix byte strings and Unicode strings. If you are processing text data, obtained from a narrow-string source, you'll always have to make an explicit decision what the encoding is. If you follow this guideline, I think the Unicode type of Python 1.6 will work just fine. If you use Unicode text *a lot*, you may find the need to combine them with plain byte text in a more convenient way. This is the time you should look at the implicit conversion stuff, and see which of the functionality is useful. You then don't need to memorize *all* the rules where implicit conversion would work - just the cases you care about. That may all look difficult - it probably is. But then, it is not more difficult than tuples vs. lists: why does

...

...
...
[a,b,c] = (1,2,3)

work, and

...

...
...
[1,2]+(3,4) Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: illegal argument type for built-in operation

does not? Regards, Martin

Show replies by date

M.-A. Lemburg

May 2000

8:21 a.m.

"Martin v. Loewis" wrote:

...

...
comments? (for obvious reasons, I'm especially interested in comments from people using non-ASCII characters on a daily basis...)

...
nobody?

Hi Frederik,

I think the problem you try to see is not real. My guideline for using Unicode in Python 1.6 will be that people should be very careful to *not* mix byte strings and Unicode strings. If you are processing text data, obtained from a narrow-string source, you'll always have to make an explicit decision what the encoding is.

Right, that's the way to go :-)

...

If you follow this guideline, I think the Unicode type of Python 1.6 will work just fine.

If you use Unicode text *a lot*, you may find the need to combine them with plain byte text in a more convenient way. This is the time you should look at the implicit conversion stuff, and see which of the functionality is useful. You then don't need to memorize *all* the rules where implicit conversion would work - just the cases you care about.

One should better not rely on the implicit conversions. These are really only there to ease porting applications to Unicode and perhaps make some existing APIs deal with Unicode without even knowing about it -- of course this will not always work and those places will need some extra porting effort to make them useful w/r to Unicode. open() is one such candidate. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Fredrik Lundh

9:30 a.m.

Martin v. Loewis wrote:

...

I think the problem you try to see is not real.

it is real. I won't repeat the arguments one more time; please read the W3C character model note and the python-dev archives, and read up on the unicode support in Tcl and Perl.

...

But then, it is not more difficult than tuples vs. lists

your examples always behave the same way, no matter what's in the containers. that's not true for MAL's design. </F>

Martin v. Loewis

6:43 p.m.

...

it is real. I won't repeat the arguments one more time; please read the W3C character model note and the python-dev archives, and read up on the unicode support in Tcl and Perl.

I did read all that, so there really is no point in repeating the arguments - yet I'm still not convinced. One of the causes may be that all your commentary either - discusses an alternative solution to the existing one, merely pointing out the difference, without any strong selling point - explains small examples that work counter-intuitively I'd like to know whether you have an example of a real-world big-application problem that could not be conveniently implemented using the new Unicode API. For all the examples I can think where Unicode would matter (XML processing, CORBA wstring mapping, internationalized messages and GUIs), it would work just fine. So while it may not be perfect, I think it is good enough. Perhaps my problem is that I'm not a perfectionist :-) However, one remark from http://www.w3.org/TR/charmod/ reminded me of an earlier proposal by Bill Janssen. The Character Model says # Because encoded text cannot be interpreted and processed without # knowing the encoding, it is vitally important that the character # encoding is known at all times and places where text is exchanged or # stored. While they were considering document encodings, I think this applies in general. Bill Janssen's proposal was that each (narrow) string should have an attribute .encoding. If set, you'll know what encoding a string has. If not set, it is a byte string, subject to the default encoding. I'd still like to see that as a feature in Python. Regards, Martin

Fredrik Lundh

7:30 p.m.

Martin v. Loewis wrote:

...

...
it is real. I won't repeat the arguments one more time; please read the W3C character model note and the python-dev archives, and read up on the unicode support in Tcl and Perl.

I did read all that, so there really is no point in repeating the arguments - yet I'm still not convinced. One of the causes may be that all your commentary either

- discusses an alternative solution to the existing one, merely pointing out the difference, without any strong selling point - explains small examples that work counter-intuitively

umm. I could have sworn that getting rid of counter-intuitive behaviour was rather important in python. maybe we're using the language in radically different ways?

...

I'd like to know whether you have an example of a real-world big-application problem that could not be conveniently implemented using the new Unicode API. For all the examples I can think where Unicode would matter (XML processing, CORBA wstring mapping, internationalized messages and GUIs), it would work just fine.

of course I can kludge my way around the flaws in MAL's design, but why should I have to do that? it's broken. fixing it is easy.

...

Perhaps my problem is that I'm not a perfectionist :-)

perfectionist or not, I only want Python's Unicode support to be as intuitive as anything else in Python. as it stands right now, Perl and Tcl's Unicode support is intuitive. Python's not. (it also backs us into a corner -- once you mess this one up, you cannot fix it in Py3K without breaking lots of code. that's really bad). in contrast, Guido's compromise proposal allows us to do this the right way in 1.7/Py3K (i.e. teach python about source code encodings, system api encodings, and stream i/o encodings). btw, I thought we'd all agreed on GvR's solution for 1.6? what did I miss?

...

So while it may not be perfect, I think it is good enough.

so tell me, if "good enough" is what we're aiming at, why isn't my counter-proposal good enough? if not else, it's much easier to document... </F>

Guido van Rossum

7:49 p.m.

...

in contrast, Guido's compromise proposal allows us to do this the right way in 1.7/Py3K (i.e. teach python about source code encodings, system api encodings, and stream i/o encodings).

btw, I thought we'd all agreed on GvR's solution for 1.6?

what did I miss?

Nothing. We are going to do that (my "ASCII" proposal). I'm just waiting for the final SRE code first. --Guido van Rossum (home page: http://www.python.org/~guido/)

Andrew M. Kuchling

8:10 p.m.

Fredrik Lundh writes:

...

perfectionist or not, I only want Python's Unicode support to be as intuitive as anything else in Python. as it stands right now, Perl and Tcl's Unicode support is intuitive. Python's not.

I don't know about Tcl, but Perl 5.6's Unicode support is still considered experimental. Consider the following excerpts, for example. (And Fredrik's right; we shouldn't release a 1.6 with broken support, or we'll pay for it for *years*... But if GvR's ASCII proposal is considered OK, then great!) ======================== http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-04/msg00084.html:

...

Ah, yes. Unicode. But after two years of work, the one thing that users will want to do - open and read Unicode data - is still not there. Who cares if stuff's now represented internally in Unicode if they can't read the files they need to.

This is a "big" (as in "huge") disappointment for me as well. I hope we'll do better next time. ======================== http://www.egroups.com/message/perl5-porters/67906: But given that interpretation, I'm amazed at how many operators seem to be broken with UTF8. It certainly supports Ilya's contention of "pre-alpha". Here's another example: DB<1> x (256.255.254 . 257.258.259) eq (256.255.254.257.258.259) 0 '' DB<2> Rummaging with Devel::Peek shows that in this case, it's the fault of the . operator. And eq is broken as well: DB<11> x "\x{100}" eq "\xc4\x80" 0 1 DB<12> Aaaaargh! ======================== http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-03/msg00971.html: A couple problems here...passage through a hash key removes the UTF8 flag (as might be expected). Even if keys were to attempt to restore the UTF8 flag (ala Convert::UTF::decode_utf8) or hash keys were real SVs, what then do you do with $h{"\304\254"} and the like? Suggestions: 1. Leave things as they are, but document UTF8 hash keys as experimental and subject to change. or 2. When under use bytes, leave things as they are. Otherwise, have keys turn on the utf8 flag if appropriate. Also give a warning when using a hash key like "\304\254" since keys will in effect return a different string that just happens to have the same interal encoding. ========================

Martin v. Loewis

10:02 p.m.

...

perfectionist or not, I only want Python's Unicode support to be as intuitive as anything else in Python. as it stands right now, Perl and Tcl's Unicode support is intuitive. Python's not.

I haven't much experience with Perl, but I don't think Tcl is intuitive in this area. I really think that they got it all wrong. They use the string type for "plain bytes", just as we do, but then have the notion of "correct" and "incorrect" UTF-8 (i.e. strings with violations of the encoding rule). For a "plain bytes" string, the following might happen - the string is scanned for non-UTF-8 characters - if any are found, the string is converted into UTF-8, essentially treating the original string as Latin-1. - it then continues to use the UTF-8 "version" of the original string, and converts it back on demand. Maybe I got something wrong, but the Unicode support in Tcl makes me worry very much.

...

btw, I thought we'd all agreed on GvR's solution for 1.6?

what did I miss?

I like the 'only ASCII is converted' approach very much, so I'm not objecting to that solution - just as I wasn't objecting to the previous one.

...

so tell me, if "good enough" is what we're aiming at, why isn't my counter-proposal good enough?

Do you mean the one in http://www.python.org/pipermail/python-dev/2000-April/005218.html which I suppose is the same one as the "java-like approach"? AFAICT, all it does is to change the default encoding from UTF-8 to Latin-1. I can't follow why this should be *better*, but it would be certainly as good... In comparison, restricting the "character" interpretation of the string type (in terms of your proposal) to 7-bit characters has the advantage that it is less error-prone, as Guido points out. Regards, Martin

Fredrik Lundh

7:36 a.m.

Martin v. Loewis wrote:

...

...
perfectionist or not, I only want Python's Unicode support to be as intuitive as anything else in Python. as it stands right now, Perl and Tcl's Unicode support is intuitive. Python's not.

I haven't much experience with Perl, but I don't think Tcl is intuitive in this area. I really think that they got it all wrong.

"all wrong"? Tcl works hard to maintain the characters are characters model (implementation level 2), just like Perl. the length of a string is always the number of characters, slicing works as it should, the internal representation is as efficient as you can make it. but yes, they have a somewhat dubious autoconversion mechanism in there. if something isn't valid UTF-8, it's assumed to be Latin-1. scary, huh? not really, if you step back and look at how UTF-8 was designed. quoting from RFC 2279: "UTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e. the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length." besides, their design is based on the plan 9 rune stuff. that code was written by the inventors of UTF-8, who has this to say: "There is little a rune-oriented program can do when given bad data except exit, which is unreasonable, or carry on. Originally the conversion routines, described below, returned errors when given invalid UTF, but we found ourselves repeatedly checking for errors and ignoring them. We therefore decided to convert a bad sequence to a valid rune and continue processing. "This technique does have the unfortunate property that con- verting invalid UTF byte strings in and out of runes does not preserve the input, but this circumstance only occurs when non-textual input is given to a textual program." so let's see: they aimed for a high level of unicode support (layer 2, stream encodings, and system api encodings, etc), they've based their design on work by the inventors of UTF-8, they have several years of experience using their implementation in real life, and you seriously claim that they got it "all wrong"? that's weird.

...

AFAICT, all it does is to change the default encoding from UTF-8 to Latin-1.

now you're using "all" in that strange way again... check the archives for the full story (hint: a conceptual design model isn't the same thing as a C implementation)

...

I can't follow why this should be *better*, but it would be certainly as good... In comparison, restricting the "character" interpretation of the string type (in terms of your proposal) to 7-bit characters has the advantage that it is less error-prone, as Guido points out.

the main reason for that is that Python 1.6 doesn't have any way to specify source encodings. add that, so you no longer have to guess what a string *literal* really is, and that problem goes away. but that's something for 1.7. </F>

Fred L. Drake

2:12 p.m.

On Wed, 17 May 2000, Fredrik Lundh wrote:

...

the main reason for that is that Python 1.6 doesn't have any way to specify source encodings. add that, so you no longer have to guess what a string *literal* really is, and that problem goes away. but

You seem to be familiar with the Tcl work, so I'll ask you this question: Does Tcl have a way to specify source encoding? I'm not aware of it, but I've only had time to follow the Tcl world very lightly these past few years. ;) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org>

Fredrik Lundh

2:29 p.m.

Fred L. Drake wrote:

...

On Wed, 17 May 2000, Fredrik Lundh wrote:

...
the main reason for that is that Python 1.6 doesn't have any way to specify source encodings. add that, so you no longer have to guess what a string *literal* really is, and that problem goes away. but

You seem to be familiar with the Tcl work, so I'll ask you this question: Does Tcl have a way to specify source encoding?

Tcl has a system encoding (which is used when passing strings through system APIs), and file/channel-specific encodings. (for info on how they initialize the system encoding, see earlier posts). unfortunately, they're using the system encoding also for source code. for portable code, they recommend sticking to ASCII or using "bootstrap scripts", e.g: set fd [open "app.tcl" r] fconfigure $fd -encoding euc-jp set jpscript [read $fd] close $fd eval $jpscript we can surely do better in 1.7... </F>

Martin v. Loewis

10:55 p.m.

...

You seem to be familiar with the Tcl work, so I'll ask you this question: Does Tcl have a way to specify source encoding? I'm not aware of it, but I've only had time to follow the Tcl world very lightly these past few years. ;)

To my knowledge, no. Tcl (at least 8.3) supports the \u notation for Unicode escapes, and treats all other source code as Latin-1. encoding(n) says # However, because the source command always reads files using the # ISO8859-1 encoding, Tcl will treat each byte in the file as a # separate character that maps to the 00 page in Unicode. Regards Martin

Fredrik Lundh

4:26 p.m.

Martin v. Loewis wrote:

...

To my knowledge, no. Tcl (at least 8.3) supports the \u notation for Unicode escapes, and treats all other source code as Latin-1. encoding(n) says

# However, because the source command always reads files using the # ISO8859-1 encoding, Tcl will treat each byte in the file as a # separate character that maps to the 00 page in Unicode.

as far as I can tell from digging through the sources, the "source" command uses the system encoding. and from the look of it, it's not always iso-latin-1... </F>

Martin v. Loewis

9:44 p.m.

...

...
# However, because the source command always reads files using the # ISO8859-1 encoding, Tcl will treat each byte in the file as a # separate character that maps to the 00 page in Unicode.

as far as I can tell from digging through the sources, the "source" command uses the system encoding. and from the look of it, it's not always iso-latin-1...

Indeed, this appears to be an error in the documentation. sourcing encoding convertto utf-8 ä has an outcome depending on the system encoding; just try koi8-r to see the difference. Regards, Martin

M.-A. Lemburg

10:59 p.m.

Fredrik Lundh wrote:

...

of course I can kludge my way around the flaws in MAL's design, but why should I have to do that? it's broken. fixing it is easy.

Look Fredrik, it's not *my* design. All this was discussed in public and in several rounds late last year. If someone made a mistake and "broke" anything, then we all did... I still don't think so, but that's my personal opinion. -- Now to get back to some non-flammable content: Has anyone played around with the latest sys.set_string_encoding() patches ? I would really like to know what you think. The idea behind it is that you can define what the Unicode implementaion is to expect as encoding when it sees an 8-bit string. The encoding is used for coercion, str(unicode) and printing. It is currently *not* used for the "s" parser marker and hash values (mainly due to internal issues). See my patch comments for details. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Jeremy Hylton

10:38 p.m.

...

...
...
...
...
"MAL" == M -A Lemburg <mal@lemburg.com> writes:

MAL> Fredrik Lundh wrote:

...

...
of course I can kludge my way around the flaws in MAL's design, but why should I have to do that? it's broken. fixing it is easy.

MAL> Look Fredrik, it's not *my* design. All this was discussed in MAL> public and in several rounds late last year. If someone made a MAL> mistake and "broke" anything, then we all did... I still don't MAL> think so, but that's my personal opinion. I find its best to avoid referring to a design as "so-and-so's design" unless you've got something specifically complementary to say. Using the person's name in combination with some criticism of the design tends to produce a defensive reaction. Perhaps it would help make this discussion less contentious. Jeremy

Paul Prescod

8:36 p.m.

"Martin v. Loewis" wrote:

...

...

I'd like to know whether you have an example of a real-world big-application problem that could not be conveniently implemented using the new Unicode API. For all the examples I can think where Unicode would matter (XML processing, CORBA wstring mapping, internationalized messages and GUIs), it would work just fine.

Of course an implicit behavior can never get in the way of big-application building. The question is about principle of least surprise, and simplicity of explanation and understanding. I'm-told-that-even-Perl-and-C++-can-be-used-for-big-apps -ly yrs -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself "Hardly anything more unwelcome can befall a scientific writer than having the foundations of his edifice shaken after the work is finished. I have been placed in this position by a letter from Mr. Bertrand Russell..." - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)

Paul Prescod

5:58 p.m.

"Martin v. Loewis" wrote:

...

...

I think the problem you try to see is not real. My guideline for using Unicode in Python 1.6 will be that people should be very careful to *not* mix byte strings and Unicode strings.

I think that as soon as we are adding admonishions to documentation that things "probably don't behave as you expect, so be careful", we have failed. Sometimes failure is unavaoidable (e.g. floats do not act rationally -- deal with it). But let's not pretend that failure is success.

...

If you are processing text data, obtained from a narrow-string source, you'll always have to make an explicit decision what the encoding is.

Are Python literals a "narrow string source"? It seems blatantly clear to me that the "encoding" of Python literals should be determined at compile time, not runtime. Byte arrays from a file are different.

...

If you use Unicode text *a lot*, you may find the need to combine them with plain byte text in a more convenient way.

Unfortunately there will be many people with no interesting in Unicode who will be dealing with it merely because that is the way APIs are going: XML APIs, Windows APIs, TK, DCOM, SOAP, WebDAV even some X/Unix APIs. Unicode is the new ASCII. I want to get a (Unicode) string from an XML document or SOAP request, compare it to a string literal and never think about Unicode.

...

... why does

...
...
...
[a,b,c] = (1,2,3)

work, and

...
...
...
[1,2]+(3,4) ...

does not?

I dunno. If there is no good reason then it is a bug that should be fixed. The __radd__ operator on lists should iterate over its argument as a sequence. As Fredrik points out, though, this situation is not as dangerous as auto-conversions because a) the latter could be loosened later without breaking code b) the operation always fails. It never does the wrong thing silently and it never succeeds for some inputs. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself "Hardly anything more unwelcome can befall a scientific writer than having the foundations of his edifice shaken after the work is finished. I have been placed in this position by a letter from Mr. Bertrand Russell..." - Frege, Appendix of Basic Laws of Arithmetic (of Russell's Paradox)

9058

Age (days ago)

9067

Last active (days ago)

List overview

Download

17 comments

9 participants

participants (9)

Andrew M. Kuchling
Fred L. Drake
Fredrik Lundh
Fredrik Lundh
Guido van Rossum
Jeremy Hylton
M.-A. Lemburg
Martin v. Loewis
Paul Prescod

Unicode

Martin v. Loewis

M.-A. Lemburg

Fredrik Lundh

Martin v. Loewis

Fredrik Lundh

Guido van Rossum

Andrew M. Kuchling

Martin v. Loewis

Fredrik Lundh

Fred L. Drake

Fredrik Lundh

Martin v. Loewis

Fredrik Lundh

Martin v. Loewis

M.-A. Lemburg

Jeremy Hylton

Paul Prescod

Paul Prescod

tags

participants (9)