length of unicode strings
Mark Hammond
mhammond at skippinet.com.au
Wed Aug 21 22:22:13 EDT 2002
Trond Eivind Glomsrød wrote:
> When running on a utf-8 system, python doesn't seem to take it input
> in unicode:
>
>
> Python 2.2.1 (#1, Aug 19 2002, 18:04:04)
> [GCC 3.2 (Red Hat Linux Rawhide 3.2-1)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>
:( unicode is hard. I won't pretend to understand, but as no other
replies exist this may be useful.
>>>>a="å"
>>>>a
>>>
> '\xc3\xa5'
Here we do indeed seem to have a UTF8 representation of the character.
indeed,
>>> len(unicode('\xc3\xa5', "utf8"))
1
>
>>>>len(a)
>>>
> 2
>
>>>>b=u"å"
>>>>b
What we see here is, effectively,
b=u"\xc3\xa5"
ie, we are creating a unicode string from a 2 character ascii string.
I'm really not sure what the semantics of the default encoding are here,
but I would expect it to work if you changed the default encoding in site.py
That isnt generally a good idea tho - but as I don't really understand
how everything interacts in this case, I wont speculate nor advise :)
>>>
> u'\xc3\xa5'
>
>>>>len(b)
>>>
> 2
>
>>>>a.isalpha()
>>>
> 0
>
> Any particular things to configure? Enabling the
> locale.getdefaultlocale() part in site.py doesn't help :(
At the end of the day, it seem the character you want is \xe5, and, if
decoded properly, the len() function works correctly. eg:
>>> a=u"\xe5"
>>> a
u'\xe5'
>>> a.isalpha()
True
>>> len(a)
1
>>> a.encode("utf8")
'\xc3\xa5'
Mark.
More information about the Python-list
mailing list