length of unicode strings

Wed Aug 21 22:22:13 EDT 2002

Trond Eivind Glomsrød wrote:
> When running on a utf-8 system, python doesn't seem to take it input
> in unicode:
> 
> 
> Python 2.2.1 (#1, Aug 19 2002, 18:04:04)
> [GCC 3.2 (Red Hat Linux Rawhide 3.2-1)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 

:( unicode is hard.  I won't pretend to understand, but as no other 
replies exist this may be useful.

>>>>a="å"
>>>>a
>>>
> '\xc3\xa5'

Here we do indeed seem to have a UTF8 representation of the character.

indeed,
 >>> len(unicode('\xc3\xa5', "utf8"))
1

> 
>>>>len(a)
>>>
> 2
> 
>>>>b=u"å"
>>>>b

What we see here is, effectively,
b=u"\xc3\xa5"

ie, we are creating a unicode string from a 2 character ascii string. 
I'm really not sure what the semantics of the default encoding are here, 
but I would expect it to work if you changed the default encoding in site.py

That isnt generally a good idea tho - but as I don't really understand 
how everything interacts in this case, I wont speculate nor advise :)

>>>
> u'\xc3\xa5'
> 
>>>>len(b)
>>>
> 2
> 
>>>>a.isalpha()
>>>
> 0
> 
> Any particular things to configure? Enabling the
> locale.getdefaultlocale() part in site.py doesn't help :(

At the end of the day, it seem the character you want is \xe5, and, if 
decoded properly, the len() function works correctly.  eg:

 >>> a=u"\xe5"
 >>> a
u'\xe5'
 >>> a.isalpha()
True
 >>> len(a)
1
 >>> a.encode("utf8")
'\xc3\xa5'

Mark.