default encoding for 8-bit string literals (was Unicode and comparisons)

Hi! [me]:
From my POV (using ISO Latin-1 all the time) it would be "intuitive"(TM) to assume ISO Latin-1 when interpreting u'äöü' in a Python source file so that (u'äöü' == 'äöü') == 1. This is what I see on *my* screen, whether there is a 'u' in Front of the string or not.
M.-A. Lemburg:
u"äöü" is being interpreted as Latin-1. The problem is the string 'äöü' to the right: during coercion this string is being interpreted as UTF-8 and this causes the failure.
You could say: ok, all my strings use Latin-1, but that would introduce other problems... esp. when you take different modules with different encoding assumptions and try to integrate them into an application.
Okay. This wouldn't occur here but we have deal with this possibility.
In dist/src/Misc/unicode.txt you wrote:
Note that you should provide some hint to the encoding you used to write your programs as pragma line in one the first few comment lines of the source file (e.g. '# source file encoding: latin-1').
[me]:
The upcoming 1.6 documentation should probably clarify whether the interpreter pays attention to "pragma"s or not. This is otherwise misleading.
This "pragma" is nothing more than a hint for the source code reader to switch his viewing encoding. The interpreter doesn't treat the file differently. In fact, Python source code is supposed to tbe 7-bit ASCII !
Sigh. In our company we use 'german' as our master language so we have string literals containing iso-8859-1 umlauts all over the place. Okay as long as we don't mix them with Unicode objects, this doesn't hurt anybody. What I would love to see, would be a well defined way to tell the interpreter to use 'latin-1' as default encoding instead of 'UTF-8' when dealing with string literals from our modules. The tokenizer in Python 1.6 already contains smart logic to get the size of TABs right (pasting from tokenizer.c): /* Skip comment, while looking for tab-setting magic */ if (c == '#') { static char *tabforms[] = { "tab-width:", /* Emacs */ ":tabstop=", /* vim, full form */ ":ts=", /* vim, abbreviated form */ "set tabsize=", /* will vi never die? */ /* more templates can be added here to support other editors */ }; .. It wouldn't be to hard to add something there to recognize other "pragma" comments like for example: #content-transfer-encoding: iso-8859-1 But what to do with it? May be adding a default encoding to every string object? Is this bloat? Just an idea. Regards, Peter

Peter Funk wrote:
Hi!
[me]:
From my POV (using ISO Latin-1 all the time) it would be "intuitive"(TM) to assume ISO Latin-1 when interpreting u'äöü' in a Python source file so that (u'äöü' == 'äöü') == 1. This is what I see on *my* screen, whether there is a 'u' in Front of the string or not.
M.-A. Lemburg:
u"äöü" is being interpreted as Latin-1. The problem is the string 'äöü' to the right: during coercion this string is being interpreted as UTF-8 and this causes the failure.
You could say: ok, all my strings use Latin-1, but that would introduce other problems... esp. when you take different modules with different encoding assumptions and try to integrate them into an application.
Okay. This wouldn't occur here but we have deal with this possibility.
In dist/src/Misc/unicode.txt you wrote:
Note that you should provide some hint to the encoding you used to write your programs as pragma line in one the first few comment lines of the source file (e.g. '# source file encoding: latin-1').
[me]:
The upcoming 1.6 documentation should probably clarify whether the interpreter pays attention to "pragma"s or not. This is otherwise misleading.
This "pragma" is nothing more than a hint for the source code reader to switch his viewing encoding. The interpreter doesn't treat the file differently. In fact, Python source code is supposed to tbe 7-bit ASCII !
Sigh. In our company we use 'german' as our master language so we have string literals containing iso-8859-1 umlauts all over the place. Okay as long as we don't mix them with Unicode objects, this doesn't hurt anybody.
What I would love to see, would be a well defined way to tell the interpreter to use 'latin-1' as default encoding instead of 'UTF-8' when dealing with string literals from our modules.
The tokenizer in Python 1.6 already contains smart logic to get the size of TABs right (pasting from tokenizer.c):
/* Skip comment, while looking for tab-setting magic */ if (c == '#') { static char *tabforms[] = { "tab-width:", /* Emacs */ ":tabstop=", /* vim, full form */ ":ts=", /* vim, abbreviated form */ "set tabsize=", /* will vi never die? */ /* more templates can be added here to support other editors */ }; ..
It wouldn't be to hard to add something there to recognize other "pragma" comments like for example: #content-transfer-encoding: iso-8859-1 But what to do with it? May be adding a default encoding to every string object? Is this bloat? Just an idea.
As I have already indicated above this would only solve the problem of string literals in Python source code. It would not however solve the problem with strings in general, since these can be built dynamically or from user input. The only way I can see for #pragma to work here is by auto- converting all static strings in the source code to Unicode and that would probably break more code than do good. Even worse, writing 'abc' in such a program would essentially mean the same thing as u'abc'. I'd suggest turning your Latin-1 strings into Unicode... this will hurt at first, but in the long rung, you win. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Sigh. In our company we use 'german' as our master language so we have string literals containing iso-8859-1 umlauts all over the place. Okay as long as we don't mix them with Unicode objects, this doesn't hurt anybody.
What I would love to see, would be a well defined way to tell the interpreter to use 'latin-1' as default encoding instead of 'UTF-8' when dealing with string literals from our modules.
It would be better if this was supported for u"..." literals, so that it was taken care of at the source code level completely. The running program shouldn't have to worry about what encoding its source code was! For 8-bit literals, this would mean that if you had source code using Latin-1, the literals would be translated from Latin-1 to UTF-8 by the code generator. This would mean that len('ç') would return 2. I'm not sure this is a great idea -- but then I'm not sure that using Latin-1 in source code is a great idea either.
The tokenizer in Python 1.6 already contains smart logic to get the size of TABs right (pasting from tokenizer.c):
/* Skip comment, while looking for tab-setting magic */ if (c == '#') { static char *tabforms[] = { "tab-width:", /* Emacs */ ":tabstop=", /* vim, full form */ ":ts=", /* vim, abbreviated form */ "set tabsize=", /* will vi never die? */ /* more templates can be added here to support other editors */ }; ..
It wouldn't be to hard to add something there to recognize other "pragma" comments like for example: #content-transfer-encoding: iso-8859-1 But what to do with it? May be adding a default encoding to every string object? Is this bloat? Just an idea.
Before we go any further we should design pragmas. The current approach is inefficient and only designed to accommodate editor-specific magical commands. I say it's a Python 1.7 issue. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Sigh. In our company we use 'german' as our master language so we have string literals containing iso-8859-1 umlauts all over the place. Okay as long as we don't mix them with Unicode objects, this doesn't hurt anybody.
What I would love to see, would be a well defined way to tell the interpreter to use 'latin-1' as default encoding instead of 'UTF-8' when dealing with string literals from our modules.
It would be better if this was supported for u"..." literals, so that it was taken care of at the source code level completely. The running program shouldn't have to worry about what encoding its source code was!
u"..." currently interprets the characters it finds as Latin-1 (this is by design, since the first 256 Unicode ordinals map to the Latin-1 characters).
For 8-bit literals, this would mean that if you had source code using Latin-1, the literals would be translated from Latin-1 to UTF-8 by the code generator. This would mean that len('ç') would return 2. I'm not sure this is a great idea -- but then I'm not sure that using Latin-1 in source code is a great idea either.
The tokenizer in Python 1.6 already contains smart logic to get the size of TABs right (pasting from tokenizer.c): ...
Before we go any further we should design pragmas. The current approach is inefficient and only designed to accommodate editor-specific magical commands.
I say it's a Python 1.7 issue.
Good idea :-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

u"..." currently interprets the characters it finds as Latin-1 (this is by design, since the first 256 Unicode ordinals map to the Latin-1 characters).
Nice, except that now we seem to be ambiguous about the source character encoding: it's Latin-1 for Unicode strings and UTF-8 for 8-bit strings...! --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum:
u"..." currently interprets the characters it finds as Latin-1 (this is by design, since the first 256 Unicode ordinals map to the Latin-1 characters).
Nice, except that now we seem to be ambiguous about the source character encoding: it's Latin-1 for Unicode strings and UTF-8 for 8-bit strings...!
This is a little bit difficult to understand and will make the task to write the upcoming 1.6 documentation even more challenging. ;-) But I agree: Changing this should go into 1.7 BTW: Our umlaut strings are sooner or later passed through one central function. All modules usually contain something like this: try: import fintl _ = fintl.gettext execpt ImportError: def _(msg): return msg ... MenuEntry(_("Öffnen"), self.open), MenuEntry(_("Schließen"), self.close) .... you get the picture. It would be easy to change the implementation of 'fintl.gettext' to coerce the resulting strings into Unicode or do whatever is required. But we currently use GNU gettext to produce the messages files that are translated into english, french and italian. AFAIK GNU gettext handles only 8 bit strings anyway. Our customers in far east currently live with the english version but this has merely financial than technical reasons. Regards, Peter -- Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260 office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)

Guido van Rossum wrote:
u"..." currently interprets the characters it finds as Latin-1 (this is by design, since the first 256 Unicode ordinals map to the Latin-1 characters).
Nice, except that now we seem to be ambiguous about the source character encoding: it's Latin-1 for Unicode strings and UTF-8 for 8-bit strings...!
Noo... there is no definition for non-ASCII 8-bit strings in Python source code using the ordinal range 127-255. If you were to define Latin-1 as source code encoding, then we would have to change auto-coercion to make a Latin-1 assumption instead, but... I see the picture: people are getting pretty confused about what is going on. If you write u"xyz" then the ordinals of those characters are taken and stored directly as Unicode characters. If you live in a Latin-1 world, then you happen to be lucky: the Unicode characters match your input. If not, some totally different characters are likely to show if the string were written to a file and displayed using a Unicode aware editor. The same will happen to your normal 8-bit string literals. Nothing unusual so far... if you use Latin-1 strings and write them to a file, you get Latin-1. If you happen to program on DOS, you'll get the DOS ANSI encoding for the German umlauts. Now the key point where all this started was that u'ä' in 'äöü' will raise an error due to 'äöü' being *interpreted* as UTF-8 -- this doesn't mean that 'äöü' will be interpreted as UTF-8 elsewhere in your application. The UTF-8 assumption had to be made in order to get the two worlds to interoperate. We could have just as well chosen Latin-1, but then people currently using say a Russian encoding would get upset for the same reason. One way or another somebody is not going to like whatever we choose, I'm afraid... the simplest solution is to use Unicode for all strings which contain non-ASCII characters and then call .encode() as necessary. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

[MAL]
u"..." currently interprets the characters it finds as Latin-1 (this is by design, since the first 256 Unicode ordinals map to the Latin-1 characters).
[GvR]
Nice, except that now we seem to be ambiguous about the source character encoding: it's Latin-1 for Unicode strings and UTF-8 for 8-bit strings...!
[MAL]
Noo... there is no definition for non-ASCII 8-bit strings in Python source code using the ordinal range 127-255. If you were to define Latin-1 as source code encoding, then we would have to change auto-coercion to make a Latin-1 assumption instead, but... I see the picture: people are getting pretty confused about what is going on.
If you write u"xyz" then the ordinals of those characters are taken and stored directly as Unicode characters. If you live in a Latin-1 world, then you happen to be lucky: the Unicode characters match your input. If not, some totally different characters are likely to show if the string were written to a file and displayed using a Unicode aware editor.
The same will happen to your normal 8-bit string literals. Nothing unusual so far... if you use Latin-1 strings and write them to a file, you get Latin-1. If you happen to program on DOS, you'll get the DOS ANSI encoding for the German umlauts.
Now the key point where all this started was that u'ä' in 'äöü' will raise an error due to 'äöü' being *interpreted* as UTF-8 -- this doesn't mean that 'äöü' will be interpreted as UTF-8 elsewhere in your application.
The UTF-8 assumption had to be made in order to get the two worlds to interoperate. We could have just as well chosen Latin-1, but then people currently using say a Russian encoding would get upset for the same reason.
One way or another somebody is not going to like whatever we choose, I'm afraid... the simplest solution is to use Unicode for all strings which contain non-ASCII characters and then call .encode() as necessary.
I have a different view on this (except that I agree that it's pretty confusing :-). In my definition of a "source character encoding", string literals, whether Unicode or 8-bit strings, are translated from the source encoding to the corresponding run-time values. If I had a C compiler that read its source in EBCDIC but cross-compiled to a machine that used ASCII, I would expect that 'a' in the source would have the integer value 97 (ASCII 'a'), regardless of the EBCDIC value for 'a'. If I type a non-ASCII Latin-1 character in a Unicode literal, it generates the corresponding Unicode character. This means to me that the source character encoding is Latin-1. But when I type the same character in an 8-bit character literal, that literal is interpreted as UTF-8 (e.g. when converting to Unicode using the default conversions). Thus, even though you can do whatever you want with 8-bit literals in your program, the most defensible view is that they are UTF-8 encoded. I would be much happier if all source code was encoded in the same encoding, because otherwise there's no good way to view such code in a general Unicode-aware text viewer! My preference would be to always use UTF-8. This would mean no change for 8-bit literals, but a big change for Unicode literals... And a break with everyone who's currently typing Latin-1 source code and using strings as Latin-1. (Or Latin-7, or whatever.) My next preference would be a pragma to define the source encoding, but that's a 1.7 issue. Maybe the whole thing is... :-( --Guido van Rossum (home page: http://www.python.org/~guido/)

M.-A. Lemburg wrote:
The UTF-8 assumption had to be made in order to get the two worlds to interoperate. We could have just as well chosen Latin-1, but then people currently using say a Russian encoding would get upset for the same reason.
One way or another somebody is not going to like whatever we choose, I'm afraid... the simplest solution is to use Unicode for all strings which contain non-ASCII characters and then call .encode() as necessary.
just a brief head's up: I've been playing with this a bit, and my current view is that the current unicode design is horridly broken when it comes to mixing 8-bit and 16-bit strings. basically, if you pass a uni- code string to a function slicing and dicing 8-bit strings, it will probably not work. and you will probably not under- stand why. I'm working on a proposal that I think will make things simpler and less magic, and far easier to understand. to appear on sunday. </F>

Fredrik Lundh wrote:
M.-A. Lemburg wrote:
The UTF-8 assumption had to be made in order to get the two worlds to interoperate. We could have just as well chosen Latin-1, but then people currently using say a Russian encoding would get upset for the same reason.
One way or another somebody is not going to like whatever we choose, I'm afraid... the simplest solution is to use Unicode for all strings which contain non-ASCII characters and then call .encode() as necessary.
just a brief head's up:
I've been playing with this a bit, and my current view is that the current unicode design is horridly broken when it comes to mixing 8-bit and 16-bit strings.
Why "horribly" ? String and Unicode mix pretty well, IMHO. The magic auto-conversion of Unicode to UTF-8 in C APIs using "s" or "s#" does not always do what the user expects, but it's still better than not having Unicode objects work with these APIs at all.
basically, if you pass a uni- code string to a function slicing and dicing 8-bit strings, it will probably not work. and you will probably not under- stand why.
I'm working on a proposal that I think will make things simpler and less magic, and far easier to understand. to appear on sunday.
Looking forward to it, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (4)
-
Fredrik Lundh
-
Guido van Rossum
-
M.-A. Lemburg
-
pf@artcom-gmbh.de