[Python-Dev] repr vs. str and locales again

Guido van Rossum guido@python.org
Fri, 19 May 2000 08:06:52 -0700


The email below suggests a simple solution to a problem that
e.g. Fran\347ois Pinard brought up long ago; repr() of a string turns
all non-ASCII chars into \oct escapes.  Jyrki's solution: use
isprint(), which makes it locale-dependent.  I can live with this.

It needs a Py_CHARMASK() call but otherwise seems to be fine.

Anybody got an opinion on this?  I'm +0.  I would even be +0 on a
similar patch for unicode strings (once the ASCII proposal is
implemented).

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Fri, 19 May 2000 10:48:29 +0300
From:    Jyrki Kuoppala <jkp@kaapeli.fi>
To:      guido@python.org
Subject: python bug?: python 1.5.2 fails to print printable 8-bit characters in
	   strings

I'm not sure if this exactly is a bug, ie. whether python 1.5.2 is
supposed to support locales and 8-bit characters.  However, on Linux
Debian "unstable" distribution the diff below makes python 1.5.2
handle printable 8-bit characters as one would expect.

Problem description:

python doesn't properly print printable 8-bit characters for the current locale
.

Details:

With no locale set, 8-bit characters in quoted strings print as
backslash-escapes, which I guess is OK:

$ unset LC_ALL
$ python
Python 1.5.2 (#0, Apr  3 2000, 14:46:48)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','kääk')
>>> print a
('foo', 'k\344\344k')
>>>

But with a locale with a printable 'ä' character (octal 344) I get:

$ export LC_ALL=fi_FI
$ python
Python 1.5.2 (#0, Apr  3 2000, 14:46:48)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','kääk')
>>> print a
('foo', 'k\344\344k')
>>>

I should be getting (output from python patched with the enclosed patch):

$ export LC_ALL=fi_FI
$ python
Python 1.5.2 (#0, May 18 2000, 14:43:46)  [GCC 2.95.2 20000313 (Debian GNU/Linu
x)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a=('foo','kääk')
>>> print a
('foo', 'kääk')
>>>                              

This hits for example when Zope with squishdot weblog (squishdot
0.3.2-3 with zope 2.1.6-1) creates a text index from posted articles -
strings with valid Latin1 characters get indexed as backslash-escaped
octal codes, and thus become unsearchable.

I am using debian unstable, kernels 2.2.15pre10 and 2.0.36, libc 2.1.3.

I suggest that the test for printability in python-1.5.2
/Objects/stringobject.c be fixed to use isprint() which takes the
locale into account:

- --- python-1.5.2/Objects/stringobject.c.orig	Thu Oct  8 05:17:48 1998
+++ python-1.5.2/Objects/stringobject.c	Thu May 18 14:36:28 2000
@@ -224,7 +224,7 @@
 		c = op->ob_sval[i];
 		if (c == quote || c == '\\')
 			fprintf(fp, "\\%c", c);
- -		else if (c < ' ' || c >= 0177)
+		else if (! isprint (c))
 			fprintf(fp, "\\%03o", c & 0377);
 		else
 			fputc(c, fp);
@@ -260,7 +260,7 @@
 			c = op->ob_sval[i];
 			if (c == quote || c == '\\')
 				*p++ = '\\', *p++ = c;
- -			else if (c < ' ' || c >= 0177) {
+			else if (! isprint (c)) {
 				sprintf(p, "\\%03o", c & 0377);
 				while (*p != '\0')
 					p++;



//Jyrki

------- End of Forwarded Message