Internationalization bug?? [Python 2.2.1, RedHat 8.0, Swedish]
pwatson at mail.com
Sun Oct 13 13:26:08 CEST 2002
If len() returns the number of bytes, what can Urban Anjar use to get the
number of characters?
"Martin v. Loewis" <martin at v.loewis.de> wrote in message
news:m3of9zjm90.fsf at mira.informatik.hu-berlin.de...
> urban.anjar at hik.se (Urban Anjar) writes:
> > >>> S = 'åäö'
> > >>> print S
> > åäö
> > >>> print len(S)
> > 6
> > Seems like every swedish character occupies 2 byte
> > and len() returns number of byte but not number of
> > characters...
> It appears you are using an UTF-8 locale. In UTF-8, every accented
> latin character takes two bytes; many characters (CJK in particular)
> even take three bytes.
> You are somewhat misguided assuming that each character takes only a
> single byte. If that was the case, you could only support 256
> characters, but UTF-8 (and Unicode) supports many more characters.
> Perhaps you misinterpreted the meaning of the len function: For a byte
> string, it gives you the number of bytes, not (necessarily) the number
> of characters.
> To work with characters, you may want to try Unicode. If you do
> s = unicode(s,"utf-8")
> print len(s)
> you should see that you really have three characters only.
> > Of course I can analyze how characters are representated in detail
> > and make some kind of workaround, but I think this is not the Python
> > way. In assembler or C I have to think of things like that but do I
> > have to do that in Python?
> If you use byte strings, yes. If you use Unicode strings, you can
> revert the string on the character level.
> Of course, to print it on your terminal, you have to convert it back
> to the encoding your terminal uses, i.e.
> s = rev(s)
> print s.encode("utf-8")
More information about the Python-list