[Tutor] Handling international characters

Fri May 23 08:08:02 2003

At 22:10 2003-05-22 -0300, Jorge Godoy wrote:
>Is there anyway I can solve/handle that?

It's supposed to be like that. You aren't printing strings you know,
you are printing entire data structures. These data structures might
contain anything, from modules or functions to themselves.

Like this:

 >>> import re
 >>> a = [re, 0.1, 'Hellö']
 >>> a.append(a)
 >>> print a
[<module 're' from 'G:\Python22\lib\re.pyc'>, 0.10000000000000001, 
'Hell\xf6', [...]]

There are two builtin python functions for extracting a text string
from any object. The str() function returns something which is
hopefully pleasing to the eye. The repr() function extracts something
which identifies the object as exactly as possible.

 >>> f = 0.1
 >>> print str(f)
0.1
 >>> print repr(f)
0.10000000000000001

One tenth can't be described exactly as a binary number, any
more that a third can be described exactly as a decimal number.
repr() shows that, str() hides it.

 >>> s = 'å'
 >>> print str(s)
å
 >>> print repr(s)
'\xe5'

But note that '\xe5' is *not* a python unicode object. The
corresponding unicode object would be u'\xe5'. It's just that
they happen to be coded the same if we use Latin1.

 >>>s = 'å'
 >>> l = [s, unicode(s, 'latin1')]
 >>> print l
['\xe5', u'\xe5']
 >>> for x in l: print x
...
å
å

In the default encoding, ISO8859-1, an a with a ring over it is
stored as the numeric value which is described as e5 in hexadecimal
notation. Anything that would look odd to an American computer user ;)
is printed as a hexadecimal number in "repr".

One good thing about that is that regardless of how clumpsy our
computers or the email systems in between are, I'll be able to
extract "Érica" from the repr() description, since it's all seven
bit ASCII.

If you just loop though your data structure and print each element,
it will come out printed properly.

If this is supposed to be read by someone, do you really want to keep
the brackets and quote marks?

 >>> r = [['Jorge', 'Godoy'], ['Juliano', 'Godoy'], ['Érica', 'Balaniuc']]
 >>> print r
[['Jorge', 'Godoy'], ['Juliano', 'Godoy'], ['\xc9rica', 'Balaniuc']]
 >>> for row in r:
...     for element in row:
...             print element,
...     print
...
Jorge Godoy
Juliano Godoy
Érica Balaniuc

(This should come out right. If might not fare well through the email 
though...)

>This is really bugging me for a while. I've tried changing some
>parameters at site.py but I had no success (I tracked another problem
>to what I thought was an XML problem, but then with the editing tool
>that also parses the XML everything works... I don't know what's going
>on).

This is the wrong path to take. Don't mess with site.py.

If you convert data between unicode and old fashioned 8 bit strings,
you must do that conversion explicitly, and state what encoding you
use.

string.decode(encoding) => unicode_string

unicode_string.encode(encoding) => string

 >>> s = "%s %s" % tuple(r[-1])
 >>> s
'\xc9rica Balaniuc'
 >>> repr(s)
"'\\xc9rica Balaniuc'"
 >>> print s
Érica Balaniuc
 >>> u = s.decode('latin-1')
 >>> print u
Érica Balaniuc
 >>> print u.encode('latin1') # will print right in windows/unix etc
Érica Balaniuc
 >>> print u.encode('cp850') # will print right in DOS box
?rica Balaniuc

--
Magnus Lycka (It's really Lyck&aring;), magnus@thinkware.se
Thinkware AB, Sweden, www.thinkware.se
I code Python ~ The shortest path from thought to working program