Internationalization bug?? [Python 2.2.1, RedHat 8.0, Swedish]

Urban Anjar urban.anjar at hik.se
Sat Oct 12 12:57:11 EDT 2002


Hi,
I have found something that looks like a bug, or at least a not so
pleasant feature. In Swedish we often use the characters å, ä and ö (a
with a ring, a with two dots and o with two dots) and I don't get them
to work perfectly
well in Python.

Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
[GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> S = 'abc'
>>> print S
abc
>>> print len(S)
3

That is perfectly OK, but...

>>> S = 'åäö'
>>> print S
åäö
>>> print len(S)
6
Seems like every swedish character occupies 2 byte 
and len() returns number of byte but not number of 
characters...


Look at this code snippet:

#!/usr/bin/python
def rev(S):
     if  S:
         return S[-1] + rev(S[:-1])
     else:
         return ''

str = 'abcåäö'
print rev(str)

Running it gives:
[urban at falcon urban]$ ./rev
?äå?cba

I was waiting for  'öäåcba'

Of course I can analyze how characters are representated in detail and
make
some kind of workaround, but I think this is not the Python way. In
assembler or C I have to think of things like that but do I have to do
that in Python?

Another example:

>>> L = ['Åke','Ärla','Östen']
>>> print L
['\xc3\x85ke', '\xc3\x84rla', '\xc3\x96sten']

Please let me know if I do something wrong or if you too think
about this as a bug.

There is some noice about Unicode... Does that solve my problems?
How do I use it? 

Sincerely,
Urban Anjar



More information about the Python-list mailing list