Mailman 3 repr vs. str and locales again - Python-Dev

repr vs. str and locales again

Guido van Rossum

19 May 2000 19 May '00

8:36 p.m.

The email below suggests a simple solution to a problem that e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns all non-ASCII chars into \oct escapes. Jyrki's solution: use isprint(), which makes it locale-dependent. I can live with this. It needs a Py_CHARMASK() call but otherwise seems to be fine. Anybody got an opinion on this? I'm +0. I would even be +0 on a similar patch for unicode strings (once the ASCII proposal is implemented). --Guido van Rossum (home page: http://www.python.org/~guido/) ------- Forwarded Message Date: Fri, 19 May 2000 10:48:29 +0300 From: Jyrki Kuoppala To: guido@python.org Subject: python bug?: python 1.5.2 fails to print printable 8-bit characters in strings I'm not sure if this exactly is a bug, ie. whether python 1.5.2 is supposed to support locales and 8-bit characters. However, on Linux Debian "unstable" distribution the diff below makes python 1.5.2 handle printable 8-bit characters as one would expect. Problem description: python doesn't properly print printable 8-bit characters for the current locale . Details: With no locale set, 8-bit characters in quoted strings print as backslash-escapes, which I guess is OK: $ unset LC_ALL $ python Python 1.5.2 (#0, Apr 3 2000, 14:46:48) [GCC 2.95.2 20000313 (Debian GNU/Linu x)] on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam

...

...
...
a=('foo','kääk') print a ('foo', 'k\344\344k')

But with a locale with a printable 'ä' character (octal 344) I get: $ export LC_ALL=fi_FI $ python Python 1.5.2 (#0, Apr 3 2000, 14:46:48) [GCC 2.95.2 20000313 (Debian GNU/Linu x)] on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam

...

...
...
a=('foo','kääk') print a ('foo', 'k\344\344k')

I should be getting (output from python patched with the enclosed patch): $ export LC_ALL=fi_FI $ python Python 1.5.2 (#0, May 18 2000, 14:43:46) [GCC 2.95.2 20000313 (Debian GNU/Linu x)] on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam

...

...
...
a=('foo','kääk') print a ('foo', 'kääk')

This hits for example when Zope with squishdot weblog (squishdot 0.3.2-3 with zope 2.1.6-1) creates a text index from posted articles - strings with valid Latin1 characters get indexed as backslash-escaped octal codes, and thus become unsearchable. I am using debian unstable, kernels 2.2.15pre10 and 2.0.36, libc 2.1.3. I suggest that the test for printability in python-1.5.2 /Objects/stringobject.c be fixed to use isprint() which takes the locale into account: - --- python-1.5.2/Objects/stringobject.c.orig Thu Oct 8 05:17:48 1998 +++ python-1.5.2/Objects/stringobject.c Thu May 18 14:36:28 2000 @@ -224,7 +224,7 @@ c = op->ob_sval[i]; if (c == quote || c == '\\') fprintf(fp, "\\%c", c); - - else if (c < ' ' || c >= 0177) + else if (! isprint (c)) fprintf(fp, "\\%03o", c & 0377); else fputc(c, fp); @@ -260,7 +260,7 @@ c = op->ob_sval[i]; if (c == quote || c == '\\') *p++ = '\\', *p++ = c; - - else if (c < ' ' || c >= 0177) { + else if (! isprint (c)) { sprintf(p, "\\%03o", c & 0377); while (*p != '\0') p++; //Jyrki ------- End of Forwarded Message

Show replies by date

M.-A. Lemburg

19 May 19 May

6 p.m.

Guido van Rossum wrote:

...

The email below suggests a simple solution to a problem that e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns all non-ASCII chars into \oct escapes. Jyrki's solution: use isprint(), which makes it locale-dependent. I can live with this.

It needs a Py_CHARMASK() call but otherwise seems to be fine.

Anybody got an opinion on this? I'm +0. I would even be +0 on a similar patch for unicode strings (once the ASCII proposal is implemented).

The subject line is a bit misleading: the patch only touches tp_print, not repr() output. And this is good, IMHO, since otherwise eval(repr(string)) wouldn't necessarily result in string. Unicode objects don't implement a tp_print slot... perhaps they should ? -- About the ASCII proposal: Would you be satisfied with what import sys sys.set_string_encoding('ascii') currently implements ? There are several places where an encoding comes into play with the Unicode implementation. The above API currently changes str(unicode), print unicode and the assumption made by the implementation during coercion of strings to Unicode. It does not change the encoding used to implement the "s" or "t" parser markers and also doesn't change the way the Unicode hash value is computed (these are currently still hard-coded as UTF-8). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

pf＠artcom-gmbh.de

6:14 p.m.

Guido van Rossum asks:

...

The email below suggests a simple solution to a problem that e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns all non-ASCII chars into \oct escapes. Jyrki's solution: use isprint(), which makes it locale-dependent. I can live with this.

How portable is the locale awareness property of 'is_print' among traditional Unix environments, WinXX and MacOS? This works fine on my favorite development platform (Linux), but an accidental use of this new 'feature' might hurt the portability of my Python apps to other platforms. If 'is_print' honors the locale in a similar way on other important platforms I would like this. Otherwise I would prefer the current behaviour so that I can deal with it during the early stages of development on my Linux boxes. Regards, Peter -- Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260 office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)

Fredrik Lundh

6:14 p.m.

Guido van Rossum wrote:

...

Jyrki's solution: use isprint(), which makes it locale-dependent. I can live with this.

It needs a Py_CHARMASK() call but otherwise seems to be fine.

Anybody got an opinion on this? I'm +0. I would even be +0 on a similar patch for unicode strings (once the ASCII proposal is implemented).

does ctype-related locale stuff really mix well with unicode? if yes, -0. if no, +0. (intuitively, I'd say no -- deprecate in 1.6, remove in 1.7) (btw, what about "eval(repr(s)) == s" ?) </F>

Greg Ward

6:15 p.m.

On 19 May 2000, Guido van Rossum said:

...

The email below suggests a simple solution to a problem that e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns all non-ASCII chars into \oct escapes. Jyrki's solution: use isprint(), which makes it locale-dependent. I can live with this.

For "ASCII" strings in this day and age -- which are often not necessarily plain ol' 7-bit ASCII -- I'd say that "32 <= c <= 127" is not the right way to determine printability. 'isprint()' seems much more appropriate to me. Are there other areas of Python that should be locale-sensitive but aren't? A minor objection to this patch is that it's a creeping change that brings in a little bit of locale-sensitivity without addressing a (possibly) wider problem. However, I will immediately shoot down my own objection on the grounds that if we try to fix everything all at once, then nothing will ever get fixed. Locale sensitivity strikes me as the sort of thing that *can* be a "creeping" change -- just fix the bits that bug people most, and eventually all the important bits will be fixed. I have no expertise and therefore no opinion on such a change for Unicode strings. Greg

bwarsaw＠python.org

20 May 20 May

12:21 a.m.

...

...
...
...
...
"GW" == Greg Ward writes:

GW> Locale sensitivity strikes me as the sort of thing that *can* GW> be a "creeping" change -- just fix the bits that bug people GW> most, and eventually all the important bits will be fixed. Another decidedly ignorant Anglophone here, but one problem that I see with localizing stuff is that locale is app- (or at least thread-) global, isn't it? That would suck for applications like Mailman which are (going to be) multilingual in the sense that a single instance of the application will serve up documents in many languages, as opposed to serving up documents in just one of a choice of languages. If it seems I don't know what I'm talking about, you're probably right. I just wanted to point out that there are applications have to deal with many languages at the same time. -Barry

Fredrik Lundh

19 May 19 May

10:16 p.m.

Barry Warsaw wrote:

...

Another decidedly ignorant Anglophone here, but one problem that I see with localizing stuff is that locale is app- (or at least thread-) global, isn't it? That would suck for applications like Mailman which are (going to be) multilingual in the sense that a single instance of the application will serve up documents in many languages, as opposed to serving up documents in just one of a choice of languages.

If it seems I don't know what I'm talking about, you're probably right. I just wanted to point out that there are applications have to deal with many languages at the same time.

Applications may also have to deal with output devices (i.e. GUI toolkits, printers, communication links) that don't necessarily have the same restrictions as the "default console". better do it the right way: deal with encodings at the boundaries, not inside the application. </F>

bwarsaw＠python.org

20 May 20 May

1:39 a.m.

...

...
...
...
...
"FL" == Fredrik Lundh writes:

FL> better do it the right way: deal with encodings at the FL> boundaries, not inside the application. Sounds good to me. :)

Ka-Ping Yee

19 May 19 May

10:26 p.m.

On Fri, 19 May 2000, Guido van Rossum wrote:

...

The email below suggests a simple solution to a problem that e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns all non-ASCII chars into \oct escapes. Jyrki's solution: use isprint(), which makes it locale-dependent. I can live with this.

Changing the behaviour of repr() (a function that internally converts data into data) based on a fixed global system parameter makes me uncomfortable. Wouldn't it make more sense for the locale business to be a property of the stream that the string is being printed on? This was the gist of my proposal for files having a printout method a while ago. I understand if that proposal is a bit too much of a change to swallow at once, but i'd like to ensure the door stays open to let it be possible in the future. Surely there are other language systems that deal with the issue of "nicely" printing their own data structures for human interpretation... anyone have any experience to share? The printout/printon thing originally comes from Smalltalk, i believe. (...which reminds me -- i played with Squeak the other day and thought to myself, it would be cool to browse and edit code in Python with a system browser like that.) Note, however:

...

This hits for example when Zope with squishdot weblog (squishdot 0.3.2-3 with zope 2.1.6-1) creates a text index from posted articles - strings with valid Latin1 characters get indexed as backslash-escaped octal codes, and thus become unsearchable.

The above comment in particular strikes me as very fishy. How on earth can the escaping behaviour of repr() affect the indexing of text? Surely when you do a search, you search for exactly what you asked for. And does the above mean that, with Jyrki's proposed fix, the sorting and searching behaviour of Squishdot will suddenly change, and magically differ from locale to locale? Is that something we want? (That last is not a rhetorical question -- my gut says no, but i don't actually have enough experience working with these issues to know the answer.) -- ?!ng "Simple, yet complex." -- Lenore Snell

Ka-Ping Yee

10:34 p.m.

On Fri, 19 May 2000, Ka-Ping Yee wrote:

...

Changing the behaviour of repr() (a function that internally converts data into data)

Clarification: what i meant by the above is, repr() is not explicitly an input or an output function. It does "some internal computation". Here is one alternative: repr(obj, **kw): options specified in kw dict push each element in kw dict into sys.repr_options now do the normal conversion, referring to whatever options are relevant (such as "locale" if doing strings) for looking up any option, first check kw dict, then look for sys.repr_options[option] restore sys.repr_options This is ugly and i still like printon/printout better, but at least it's a smaller change and won't prevent the implementation of printon/printout later. This suggestion is not thread-safe. -- ?!ng "Simple, yet complex." -- Lenore Snell

M.-A. Lemburg

20 May 20 May

12:36 a.m.

Ka-Ping Yee wrote:

...

On Fri, 19 May 2000, Guido van Rossum wrote:

...
The email below suggests a simple solution to a problem that e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns all non-ASCII chars into \oct escapes. Jyrki's solution: use isprint(), which makes it locale-dependent. I can live with this.

Changing the behaviour of repr() (a function that internally converts data into data) based on a fixed global system parameter makes me uncomfortable. Wouldn't it make more sense for the locale business to be a property of the stream that the string is being printed on?

Umm, Jyrki's patch does *not* affect repr(): it's a patch to the string_print API which is used for the tp_print slot, so the only effect to be seen is when printing a string to a real file object (tp_print is only used by PyObject_Print() and that API is only used for writing to real PyFileObjects -- all other stream get the output of str() or repr()). Perhaps we should drop tp_print for strings altogether and let str() and repr() to decide what to do... (this is what Unicode objects do). The only good reason for implementing tp_print is to write huge amounts of data to a stream without creating intermediate objects -- not really needed for strings, since these *are* the intermediate object usually created for just this purpose ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Ka-Ping Yee

21 May 21 May

4 p.m.

On Fri, 19 May 2000, M.-A. Lemburg wrote:

...

Umm, Jyrki's patch does *not* affect repr(): it's a patch to the string_print API which is used for the tp_print slot,

Very sorry! I didn't actually look to see where the patch was being applied. But then how can this have any effect on squishdot's indexing? -- ?!ng "All models are wrong; some models are useful." -- George Box

pf＠artcom-gmbh.de

9:24 p.m.

Hi! Ka-Ping Yee:

...

On Fri, 19 May 2000, M.-A. Lemburg wrote:

...
Umm, Jyrki's patch does *not* affect repr(): it's a patch to the string_print API which is used for the tp_print slot,

Very sorry! I didn't actually look to see where the patch was being applied.

But then how can this have any effect on squishdot's indexing?

Sigh. Let me explain this in some detail. What do you see here: äöüÄÖÜß? If all went well, you should see some Umlauts which occur quite often in german words, like "Begrüssung", "ätzend" or "Grützkacke" and so on. During the late 80s we here Germany spend a lot of our free time to patch open source tools software like 'elm', 'B-News', 'less' and others to make them "8-Bit clean". For example on ancient Unices like SCO Xenix where the implementations of C-library functions like 'is_print', 'is_lower' where out of reach. After several years everybody seems to agree on ISO-8859-1 as the new european standard character set, which was also often losely called 8-Bit ASCII, because ASCII is a true subset of ISO latin1. Even at least the german versions of Windows used ISO-8859-1. As the WWW began to gain popularity nobody with a sane mind really used these splendid ASCII escapes like for example 'ä' instead of 'ä'. The same holds true for TeX users community where everybody was happy to type real umlauts instead of these ugly backslash escapes sequences used before: \"a\"o\"u ... To make a short: A lot of effort has been spend to make *ALL* programs 8-Bit clean: That is to move the bytes through without translating them from or into a bunch of incompatible multi bytes sequences, which nobody can read or even wants to look at. Now to get to back to your question: There are several nice HTML indexing engines out there. I personally use HTDig. At least on Linux these programs deal fine with HTML files containing 8-bit chars. But if for some reason Umlauts end up as octal escapes ('\344' instead of 'ä') due to the use of a Python 'print some_tuple' during the creation of HTML files, a search engine will be unable to find those words with escaped umlauts. Mit freundlichen Grüßen, Peter P.S.: Hope you didn't find my explanation boring or off-topic.

Fredrik Lundh

9:56 p.m.

Peter Funk wrote:

...

But if for some reason Umlauts end up as octal escapes ('\344' instead of 'ä') due to the use of a Python 'print some_tuple' during the creation of HTML files, a search engine will be unable to find those words with escaped umlauts.

umm. why would anyone use "print some_tuple" when generating HTML pages? what if the tuple contains something that results in a "<" character? </F>

M.-A. Lemburg

22 May 22 May

2:24 a.m.

Ka-Ping Yee wrote:

...

On Fri, 19 May 2000, M.-A. Lemburg wrote:

...
Umm, Jyrki's patch does *not* affect repr(): it's a patch to the string_print API which is used for the tp_print slot,

Very sorry! I didn't actually look to see where the patch was being applied.

But then how can this have any effect on squishdot's indexing?

The only possible reason I can see is that this squishdot application uses 'print' to write the data -- perhaps it pipes it through some other tool ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Fredrik Lundh

4:54 a.m.

M.-A. Lemburg wrote:

...

...
But then how can this have any effect on squishdot's indexing?

The only possible reason I can see is that this squishdot application uses 'print' to write the data -- perhaps it pipes it through some other tool ?

but doesn't the patch only affects code that manages to call tp_print without the PRINT_RAW flag? (that is, in "repr" mode rather than "str" mode) or to put it another way, if they manage to call tp_print without the PRINT_RAW flag, isn't that a bug in their code, rather than in Python? or am I just totally confused? </F>

Guido van Rossum

9:17 a.m.

Let's reboot this thread. Never mind the details of the actual patch, or why it would affect a particular index. Obviously if we're going to patch string_print() we're also going to patch string_repr() (and vice versa) -- the former (without the Py_PRINT_RAW flag) is supposed to be an optimization of the latter. (I hadn't even read the patch that far to realize that it only did one and not the other.) The point is simply this. The repr() function for a string turns it into a valid string literal. There's considerable freedom allowed in this conversion, some of which is taken (e.g. it prefers single quotes but will use double quotes when the string contains single quotes). For safety reasons, control characters are replaced by their octal escapes. This is also done for non-ASCI characters. Lots of people, most of them living in countries where Latin-1 (or another 8-bit ASCII superset) is in actual use, would prefer that non-ASCII characters would be left alone rather than changed into octal escapes. I think it's not unreasonable to ask that what they consider printable characters aren't treated as control characters. I think that using the locale to guide this is reasonable. If the locale is set to imply Latin-1, then we can assume that most output devices are capable of displaying those characters. What good does converting those characters to octal escapes do us then? If the input string was in fact binary goop, then the output will be unreadable goop -- but it won't screw up the output device (as control characters are wont to do, which is the main reason to turn them into octal escapes). So I don't see how the patch can do much harm, I don't expect that it will break much code, and I see a real value for those who use Latin-1 or other 8-bit supersets of ASCII. The one objection could be that the locale may be obsolescent -- but I've only heard /F vent an opinion about that; personally, I doubt that we will be able to remove the locale any time soon, even if we invent a better way. Plus, I think that "better way" should address this issue anyway. If the locale eventually disappears, the feature automatically disappears with it, because you *have* to make a locale.setlocale() call before the behavior of repr() changes. --Guido van Rossum (home page: http://www.python.org/~guido/)

pf＠artcom-gmbh.de

11:48 a.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

Guido van Rossum: [...]

...

The one objection could be that the locale may be obsolescent -- but I've only heard /F vent an opinion about that; personally, I doubt that we will be able to remove the locale any time soon, even if we invent a better way.

AFAIK locale and friends conform to POSIX.1. Calling this obsolescent... hmmm... may offend a *LOT* of people. Try this on comp.os.linux.advocacy ;-) Although I understand Barrys and Pings objections against a global state, it used to work very well: On a typical single user Linux system the user chooses his locale during the first stages of system setup and never has to think about it again. On multi user systems the locale of individual accounts may be customized using several environment variables, which can overide the default locale of the system.

...

Plus, I think that "better way" should address this issue anyway. If the locale eventually disappears, the feature automatically disappears with it, because you *have* to make a locale.setlocale() call before the behavior of repr() changes.

The last sentence is at least not the whole truth. On POSIX systems there are a several environment variables used to control the default locale settings for a users session. For example on my SuSE Linux system currently running in the german locale the environment variable LC_CTYPE=de_DE is automatically set by a file /etc/profile during login, which causes automatically the C-library function toupper('ä') to return an 'Ä' ---you should see a lower case a-umlaut as argument and an upper case umlaut as return value--- without having all applications to call 'setlocale' explicitly. So this simply works well as intended without having to add calls to 'setlocale' to all application program using this C-library functions. Regards, Peter.

Fredrik Lundh

12:50 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

Peter Funk wrote:

...

AFAIK locale and friends conform to POSIX.1. Calling this obsolescent... hmmm... may offend a *LOT* of people. Try this on comp.os.linux.advocacy ;-)

you're missing the point -- now that we've added unicode support to Python, the old 8-bit locale *ctype* stuff no longer works. while some platforms implement a wctype interface, it's not widely available, and it's not always unicode. so in order to provide platform-independent unicode support, Python 1.6 comes with unicode-aware and fully portable replacements for the ctype functions. the code is already in there...

...

On POSIX systems there are a several environment variables used to control the default locale settings for a users session. For example on my SuSE Linux system currently running in the german locale the environment variable LC_CTYPE=de_DE is automatically set by a file /etc/profile during login, which causes automatically the C-library function toupper('ä') to return an 'Ä' ---you should see a lower case a-umlaut as argument and an upper case umlaut as return value--- without having all applications to call 'setlocale' explicitly.

So this simply works well as intended without having to add calls to 'setlocale' to all application program using this C-library functions.

note that this leaves us with four string flavours in 1.6: - 8-bit binary arrays. may contain binary goop, or text in some strange encoding. upper, strip, etc should not be used. - 8-bit text strings using the system encoding. upper, strip, etc works as long as the locale is properly configured. - 8-bit unicode text strings. upper, strip, etc may work, as long as the system encoding is a subset of unicode -- which means US ASCII or ISO Latin 1. - wide unicode text strings. upper, strip, etc always works. is this complexity really worth it? </F>

Guido van Rossum

9:46 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

...

From: "Fredrik Lundh"

Peter Funk wrote:

...
AFAIK locale and friends conform to POSIX.1. Calling this obsolescent... hmmm... may offend a *LOT* of people. Try this on comp.os.linux.advocacy ;-)

you're missing the point -- now that we've added unicode support to Python, the old 8-bit locale *ctype* stuff no longer works. while some platforms implement a wctype interface, it's not widely available, and it's not always unicode.

Huh? We were talking strictly 8-bit strings here. The locale support hasn't changed there.

...

so in order to provide platform-independent unicode support, Python 1.6 comes with unicode-aware and fully portable replacements for the ctype functions.

For those who only need Latin-1 or another 8-bit ASCII superset, the Unicode stuff is overkill.

...

the code is already in there...

...
On POSIX systems there are a several environment variables used to control the default locale settings for a users session. For example on my SuSE Linux system currently running in the german locale the environment variable LC_CTYPE=de_DE is automatically set by a file /etc/profile during login, which causes automatically the C-library function toupper('ä') to return an 'Ä' ---you should see a lower case a-umlaut as argument and an upper case umlaut as return value--- without having all applications to call 'setlocale' explicitly.

So this simply works well as intended without having to add calls to 'setlocale' to all application program using this C-library functions.

note that this leaves us with four string flavours in 1.6:

- 8-bit binary arrays. may contain binary goop, or text in some strange encoding. upper, strip, etc should not be used.

These are not strings.

...

- 8-bit text strings using the system encoding. upper, strip, etc works as long as the locale is properly configured.

- 8-bit unicode text strings. upper, strip, etc may work, as long as the system encoding is a subset of unicode -- which means US ASCII or ISO Latin 1.

This is a figment of your imagination. You can use 8-bit text strings to contain Latin-1, but you have to set your locale to match.

...

- wide unicode text strings. upper, strip, etc always works.

is this complexity really worth it?

Fredrik Lundh

9:07 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

Guido van Rossum wrote:

...

...
Peter Funk wrote:

...
AFAIK locale and friends conform to POSIX.1. Calling this obsolescent... hmmm... may offend a *LOT* of people. Try this on comp.os.linux.advocacy ;-)

you're missing the point -- now that we've added unicode support to Python, the old 8-bit locale *ctype* stuff no longer works. while some platforms implement a wctype interface, it's not widely available, and it's not always unicode.

Huh? We were talking strictly 8-bit strings here. The locale support hasn't changed there.

I meant that the locale support, even though it's part of POSIX, isn't good enough for unicode support...

...

...
so in order to provide platform-independent unicode support, Python 1.6 comes with unicode-aware and fully portable replacements for the ctype functions.

For those who only need Latin-1 or another 8-bit ASCII superset, the Unicode stuff is overkill.

why? besides, overkill or not:

...

...
the code is already in there...

...

...
note that this leaves us with four string flavours in 1.6:

- 8-bit binary arrays. may contain binary goop, or text in some strange encoding. upper, strip, etc should not be used.

These are not strings.

depends on who you're asking, of course:

...

...
...
b = fetch_binary_goop() type(b) dir(b) ['capitalize', 'center', 'count', 'endswith', 'expandtabs', ...

...

...
- 8-bit text strings using the system encoding. upper, strip, etc works as long as the locale is properly configured.

- 8-bit unicode text strings. upper, strip, etc may work, as long as the system encoding is a subset of unicode -- which means US ASCII or ISO Latin 1.

This is a figment of your imagination. You can use 8-bit text strings to contain Latin-1, but you have to set your locale to match.

if that's a supported feature (instead of being deprecated in favour for unicode), maybe we should base the default unicode/string con- versions on the locale too? background: until now, I've been convinced that the goal should be to have two "string-like" types: binary arrays for binary goop (including encoded text), and a Unicode-based string type for text. afaik, that's the solution used in Tcl and Perl, and it's also "conceptually compatible" with things like Java, Windows NT, and XML (and everything else from the web universe). given that, it has been clear to me that anything that is not compatible with this model should be removed as soon as possible (and deprecated as soon as we understand why it won't fly under the new scheme). but if backwards compatibility is more important than a minimalistic design, maybe we need three different "string-like" types: -- binary arrays (still implemented by the 8-bit string type in 1.6) -- 8-bit old-style strings (using the "system encoding", as defined by the locale. if the locale is not set, they're assumed to contain ASCII) -- unicode strings (possibly using a "polymorphic" internal representation) this also solves the default conversion problem: use the locale environ- ment variables to determine the default encoding, and call sys.set_string_encoding from site.py (see my earlier post for details). what have I missed this time? </F> PS. shouldn't sys.set_string_encoding be sys.setstringencoding?

...

...
...
sys ... 'set_string_encoding', 'setcheckinterval', 'setprofile', 'settrace', ...

looks a little strange...

pf＠artcom-gmbh.de

10:47 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

Hi! Fredrik Lund: [...]

...

...
...
so in order to provide platform-independent unicode support, Python 1.6 comes with unicode-aware and fully portable replacements for the ctype functions.

For those who only need Latin-1 or another 8-bit ASCII superset, the Unicode stuff is overkill.

why?

Going from 8 bit strings to 16 bit strings doubles the memory requirements, right? As long as we only deal with English, Spanish, French, Swedish, Italian and several other languages, 8 bit strings work out pretty well. Unicode will be neat if you can effort the additional space. People using Python on small computers in western countries probably don't want to double the size of their data structures for no reasonable benefit.

...

...
This is a figment of your imagination. You can use 8-bit text strings to contain Latin-1, but you have to set your locale to match.

if that's a supported feature (instead of being deprecated in favour for unicode), maybe we should base the default unicode/string con- versions on the locale too?

Many locales effectively use Latin1 but for some other locales there is a difference: $ LANG="es_ES" python # Espanõl uses Latin-1, the same as "de_DE" Python 1.5.2 (#1, Jul 23 1999, 06:38:16) [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam

...

...
...
import string; print string.upper("äöü") ÄÖÜ

...

...
...
import string; print string.upper("äöü") Ä¦¬

I don't know, how many people for example in Russia already depend on this behaviour. I suggest it should stay as is. Regards, Peter -- Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260 office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)

M.-A. Lemburg

23 May 23 May

2:23 a.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

Fredrik Lundh wrote:

...

...
...
- 8-bit text strings using the system encoding. upper, strip, etc works as long as the locale is properly configured.

- 8-bit unicode text strings. upper, strip, etc may work, as long as the system encoding is a subset of unicode -- which means US ASCII or ISO Latin 1.

This is a figment of your imagination. You can use 8-bit text strings to contain Latin-1, but you have to set your locale to match.

if that's a supported feature (instead of being deprecated in favour for unicode), maybe we should base the default unicode/string con- versions on the locale too?

This was proposed by Guido some time ago... the discussion ended with the problem of extracting the encoding definition from the locale names. There are some ways to solve this problem (static mappings, fancy LANG variables etc.), but AFAIK, there is no widely used standard on this yet, so in the end you're stuck with defining the encoding by hand... e.g. setenv LANG de_DE:latin-1 Perhaps we should help out a little and provide Python with a parser for the LANG variable with some added magic to provide useful defaults ?!

...

[...]

this also solves the default conversion problem: use the locale environ- ment variables to determine the default encoding, and call sys.set_string_encoding from site.py (see my earlier post for details).

Right, that would indeed open up a path for consent...

...

</F>

PS. shouldn't sys.set_string_encoding be sys.setstringencoding?

Perhaps... these were really only added as experimental feature to test the various possibilities (and a possible implementation). My original intention was removing these after final consent -- perhaps we should keep the functionality (expanded to a per thread setting; the global is a temporary hack) ?!

...

...
...
...
sys ... 'set_string_encoding', 'setcheckinterval', 'setprofile', 'settrace', ...

looks a little strange...

True; see above for the reason why ;-) PS: What do you think about the current internal design of sys.set_string_encoding() ? Note that hash() and the "st" parser markers still use UTF-8. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Ka-Ping Yee

22 May 22 May

9:47 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

On Mon, 22 May 2000, Guido van Rossum wrote:

...

...
note that this leaves us with four string flavours in 1.6:

- 8-bit binary arrays. may contain binary goop, or text in some strange encoding. upper, strip, etc should not be used.

These are not strings.

Indeed -- but at the moment, we're letting people continue to use strings this way, since they already do it.

...

...
- 8-bit text strings using the system encoding. upper, strip, etc works as long as the locale is properly configured.

- 8-bit unicode text strings. upper, strip, etc may work, as long as the system encoding is a subset of unicode -- which means US ASCII or ISO Latin 1.

This is a figment of your imagination. You can use 8-bit text strings to contain Latin-1, but you have to set your locale to match.

I would like it to be only the latter, as Fred, i, and others have previously suggested, and as corresponds to your ASCII proposal for treatment of 8-bit strings. But doesn't the current locale-dependent behaviour of upper() etc. mean that strings are getting interpreted in the first way?

...

...
is this complexity really worth it?

From a backwards compatibility point of view, yes. Basically, programs that don't use Unicode should see no change in semantics.

I'm afraid i have to agree with this, because i don't see any other option that lets us escape from any of these four ways of using strings... -- ?!ng

Fred L. Drake

10:35 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

On Mon, 22 May 2000, Ka-Ping Yee wrote:

...

I would like it to be only the latter, as Fred, i, and others

Please refer to Fredrik as Fredrik or /F; I don't think anyone else refers to him as "Fred", and I got really confused when I saw this! ;) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org>

Guido van Rossum

23 May 23 May

2:08 a.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

[Fredrik]

...

...
...
- 8-bit binary arrays. may contain binary goop, or text in some strange encoding. upper, strip, etc should not be used.

[Guido]

...

...
These are not strings.

[Ping]

...

Indeed -- but at the moment, we're letting people continue to use strings this way, since they already do it.

Oops, mistake. I thought that Fredrik (not Fred! that's another person in this context!) meant the array module, but upon re-reading he didn't.

...

...
...
- 8-bit text strings using the system encoding. upper, strip, etc works as long as the locale is properly configured.

- 8-bit unicode text strings. upper, strip, etc may work, as long as the system encoding is a subset of unicode -- which means US ASCII or ISO Latin 1.

This is a figment of your imagination. You can use 8-bit text strings to contain Latin-1, but you have to set your locale to match.

I would like it to be only the latter, as Fred, i, and others Fredrik, right? have previously suggested, and as corresponds to your ASCII proposal for treatment of 8-bit strings.

But doesn't the current locale-dependent behaviour of upper() etc. mean that strings are getting interpreted in the first way?

That's what I meant to say -- 8-bit strings use the system encoding guided by the locale.

...

...
...
is this complexity really worth it?

From a backwards compatibility point of view, yes. Basically, programs that don't use Unicode should see no change in semantics.

I'm afraid i have to agree with this, because i don't see any other option that lets us escape from any of these four ways of using strings...

Which is why I find Fredrik's attitude unproductive. And where's the SRE release? --Guido van Rossum (home page: http://www.python.org/~guido/)

Fredrik Lundh

9:34 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

...

Which is why I find Fredrik's attitude unproductive.

given that locale support isn't included if you make a default build, I don't think deprecating it would hurt that many people... but that's me; when designing libraries, I've always strived to find the *minimal* set of functions (and code) that makes it possible for a programmer to do her job well. I'm especially wary of blind alleys (sure, you can use locale, but that'll only take you this far, and you have to start all over if you want to do it right). btw, talking about productivity, go check out the case sensitivity threads on comp.lang.python. imagine if all those people hammered away on the 1.6 alpha instead...

...

And where's the SRE release?

at the usual place: http://w1.132.telia.com/~u13208596/sre/index.htm still one showstopper left, which is why I haven't made the long- awaited public "now it's finished, dammit" announcement yet. but it shouldn't be that far away. </F>

Guido van Rossum

22 May 22 May

8:39 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

...

From: pf@artcom-gmbh.de (Peter Funk)

Guido van Rossum: [...]

...
The one objection could be that the locale may be obsolescent -- but I've only heard /F vent an opinion about that; personally, I doubt that we will be able to remove the locale any time soon, even if we invent a better way.

AFAIK locale and friends conform to POSIX.1. Calling this obsolescent... hmmm... may offend a *LOT* of people. Try this on comp.os.linux.advocacy ;-)

Although I understand Barrys and Pings objections against a global state, it used to work very well: On a typical single user Linux system the user chooses his locale during the first stages of system setup and never has to think about it again. On multi user systems the locale of individual accounts may be customized using several environment variables, which can overide the default locale of the system.

...
Plus, I think that "better way" should address this issue anyway. If the locale eventually disappears, the feature automatically disappears with it, because you *have* to make a locale.setlocale() call before the behavior of repr() changes.

The last sentence is at least not the whole truth.

On POSIX systems there are a several environment variables used to control the default locale settings for a users session. For example on my SuSE Linux system currently running in the german locale the environment variable LC_CTYPE=de_DE is automatically set by a file /etc/profile during login, which causes automatically the C-library function toupper('ä') to return an 'Ä' ---you should see a lower case a-umlaut as argument and an upper case umlaut as return value--- without having all applications to call 'setlocale' explicitly.

So this simply works well as intended without having to add calls to 'setlocale' to all application program using this C-library functions.

I don;t believe that. According to the ANSI standard, a C program *must* call setlocale(LC_..., "") if it wants the environment variables to be honored; without this call, the locale is always the "C" locale, which should *not* honor the environment variables. --Guido van Rossum (home page: http://www.python.org/~guido/)

pf＠artcom-gmbh.de

6:32 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

Hi! [...] [me]:

...

...
So this simply works well as intended without having to add calls to 'setlocale' to all application program using this C-library functions.

[Guido van Rossum]:

...

I don;t believe that. According to the ANSI standard, a C program *must* call setlocale(LC_..., "") if it wants the environment variables to be honored; without this call, the locale is always the "C" locale, which should *not* honor the environment variables.

...

...
...
import string print string.upper("ä") Ä

This was the vanilla Python 1.5.2 as originally delivered by SuSE Linux. But yes, you are right. :-( My memory was confused by this practical experience. Now I like to quote from the man pages here: man toupper: [...] BUGS The details of what constitutes an uppercase or lowercase letter depend on the current locale. For example, the default "C" locale does not know about umlauts, so no con version is done for them. In some non - English locales, there are lowercase letters with no corresponding uppercase equivalent; the German sharp s is one example. man setlocale: [...] A program may be made portable to all locales by calling setlocale(LC_ALL, "" ) after program initialization, by using the values returned from a localeconv() call for locale - dependent information and by using strcoll() or strxfrm() to compare strings. [...] CONFORMING TO ANSI C, POSIX.1 Linux (that is, libc) supports the portable locales "C" and "POSIX". In the good old days there used to be sup port for the European Latin-1 "ISO-8859-1" locale (e.g. in libc-4.5.21 and libc-4.6.27), and the Russian "KOI-8" (more precisely, "koi-8r") locale (e.g. in libc-4.6.27), so that having an environment variable LC_CTYPE=ISO-8859-1 sufficed to make isprint() return the right answer. These days non-English speaking Europeans have to work a bit harder, and must install actual locale files. [...] In recent Linux distributions almost every Linux C-program seems to contain this obligatory 'setlocale(LC_ALL, "");' line, so it's easy to forget about it. However the core Python interpreter does not. it seems the Linux C-Library is not fully ANSI compliant in this case. It seems to honour the setting of $LANG regardless whether a program calls 'setlocale' or not. Regards, Peter

Guido van Rossum

10:24 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

...

pf@pefunbk> python Python 1.5.2 (#1, Jul 23 1999, 06:38:16) [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam

...
...
...
import string print string.upper("ä") Ä

This threw me off too. However try this: python -c 'print "ä".upper()' It will print "ä". A mystery? No, the GNU readline library calls setlocale(). It is wrong, but I can't help it. But it only affects interactive use of Python.

...

In recent Linux distributions almost every Linux C-program seems to contain this obligatory 'setlocale(LC_ALL, "");' line, so it's easy to forget about it. However the core Python interpreter does not. it seems the Linux C-Library is not fully ANSI compliant in this case. It seems to honour the setting of $LANG regardless whether a program calls 'setlocale' or not.

No, the explanation is in GNU readline. Compile this little program and see for yourself: #include #include main() { printf("toupper(%c) = %c\n", 'ä', toupper('ä')); } --Guido van Rossum (home page: http://www.python.org/~guido/)

pf＠artcom-gmbh.de

8:31 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

Hi, Guido van Rossum:

...

...
pf@pefunbk> python Python 1.5.2 (#1, Jul 23 1999, 06:38:16) [GCC egcs-2.91.66 19990314/Linux (egcs- on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam

...
...
...
import string print string.upper("ä") Ä

This threw me off too. However try this:

python -c 'print "ä".upper()'

Yes, you are right. :-( Conclusion: If the 'locale' module would ever become depreciated then ...ummm... we poor mortals will simply have to add a line 'import readline' to our Python programs. Nifty... ;-) Regards, Peter

Fredrik Lundh

8:43 p.m.

New subject: Some information about locale (was Re: repr vs. str and locales again)

Peter Funk wrote:

...

Conclusion: If the 'locale' module would ever become depreciated then ...ummm... we poor mortals will simply have to add a line 'import readline' to our Python programs. Nifty... ;-)

won't help if python is changed to use the *unicode* ctype functions... ...but on the other hand, if you use unicode strings for anything that is not plain ASCII, upper and friends will do the right thing even if you forget to import readline. </F>

8738

Age (days ago)

8742

Last active (days ago)

List overview

Download

31 comments

9 participants

participants (9)

bwarsaw＠python.org
Fred L. Drake
Fredrik Lundh
Fredrik Lundh
Greg Ward
Guido van Rossum
Ka-Ping Yee
M.-A. Lemburg
pf＠artcom-gmbh.de

repr vs. str and locales again

M.-A. Lemburg

Greg Ward

Fredrik Lundh

Ka-Ping Yee

Ka-Ping Yee

M.-A. Lemburg

Ka-Ping Yee

Fredrik Lundh

M.-A. Lemburg

Fredrik Lundh

Fredrik Lundh

Fredrik Lundh

M.-A. Lemburg

Ka-Ping Yee

Fredrik Lundh

Fredrik Lundh

tags

participants (9)