length of unicode strings

Mark Hammond mhammond at skippinet.com.au
Thu Aug 22 04:22:13 CEST 2002

Trond Eivind Glomsrød wrote:
> When running on a utf-8 system, python doesn't seem to take it input
> in unicode:
> Python 2.2.1 (#1, Aug 19 2002, 18:04:04)
> [GCC 3.2 (Red Hat Linux Rawhide 3.2-1)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.

:( unicode is hard.  I won't pretend to understand, but as no other 
replies exist this may be useful.

> '\xc3\xa5'

Here we do indeed seem to have a UTF8 representation of the character.

 >>> len(unicode('\xc3\xa5', "utf8"))

> 2

What we see here is, effectively,

ie, we are creating a unicode string from a 2 character ascii string. 
I'm really not sure what the semantics of the default encoding are here, 
but I would expect it to work if you changed the default encoding in site.py

That isnt generally a good idea tho - but as I don't really understand 
how everything interacts in this case, I wont speculate nor advise :)

> u'\xc3\xa5'
> 2
> 0
> Any particular things to configure? Enabling the
> locale.getdefaultlocale() part in site.py doesn't help :(

At the end of the day, it seem the character you want is \xe5, and, if 
decoded properly, the len() function works correctly.  eg:

 >>> a=u"\xe5"
 >>> a
 >>> a.isalpha()
 >>> len(a)
 >>> a.encode("utf8")


More information about the Python-list mailing list