length of unicode strings

Trond Eivind Glomsrød teg at redhat.com
Fri Aug 23 17:51:47 CEST 2002


Mark Hammond <mhammond at skippinet.com.au> writes:

> Trond Eivind Glomsrød wrote:
> > When running on a utf-8 system, python doesn't seem to take it input
> > in unicode:
> > Python 2.2.1 (#1, Aug 19 2002, 18:04:04)
> > [GCC 3.2 (Red Hat Linux Rawhide 3.2-1)] on linux2
> > Type "help", "copyright", "credits" or "license" for more information.
> >
> 
> :( unicode is hard.  I won't pretend to understand, but as no other
> replies exist this may be useful.
> 
> >>>>a="å"
> >>>>a
> >>>
> > '\xc3\xa5'
> 
> Here we do indeed seem to have a UTF8 representation of the
> character.

The entire system is running a utf-8 locale... the problem is that
python doesn't treat is as such, and I don't see a way to make it do so.

What I'll probably need is a way for python to set all these strings
as unicode by default...

> 
> indeed,
>  >>> len(unicode('\xc3\xa5', "utf8"))
> 1
> 
> >
> >>>>len(a)
> >>>
> > 2
> >
> >>>>b=u"å"
> >>>>b
> 
> What we see here is, effectively,
> b=u"\xc3\xa5"

Yes, I included the above to show that.

> ie, we are creating a unicode string from a 2 character ascii
> string. I'm really not sure what the semantics of the default encoding
> are here, but I would expect it to work if you changed the default
> encoding in site.py

>>> import sys
>>> sys.getdefaultencoding()
'utf'
>>> a="å"
>>> len(a)
2
>>>

(this what you get from enabling the locale sensitive encoding
detection in site.py)

Hardcoding it to utf-8 doesn't help either...

> > u'\xc3\xa5'
> >
> >>>>len(b)
> >>>
> > 2
> >
> >>>>a.isalpha()
> >>>
> > 0
> > Any particular things to configure? Enabling the
> > locale.getdefaultlocale() part in site.py doesn't help :(
> 
> At the end of the day, it seem the character you want is \xe5, and, if
> decoded properly, the len() function works correctly.  eg:

Yes. It boils down to a need to get python to recognize the string as
unicode automatically and mark it as such.


-- 
Trond Eivind Glomsrød
Red Hat, Inc.



More information about the Python-list mailing list