How to find number of characters in a unicode string?

Leo Kislov Leo.Kislov at gmail.com
Wed Oct 11 01:50:21 EDT 2006


Lawrence D'Oliveiro wrote:
> In message <pan.2006.09.18.20.29.20.510034 at gmx.net>, Marc 'BlackJack'
> Rintsch wrote:
>
> > In <20060918221814.08625ea2.randhol+valid_for_reply_from_news at pvv.org>,
> > Preben Randhol wrote:
> >
> >> Is there a way to calculate in characters
> >> and not in bytes to represent the characters.
> >
> > Decode the byte string and use `len()` on the unicode string.
>
> Hmmm, for some reason
>
>     len(u"C\u0327")
>
> returns 2.

If python ever provide this functionality it would be I guess
u"C\u0327".width() == 1. But it's not clear when unicode.org will
provide recommended fixed font character width information for *all*
characters. I recently stumbled upon Tamil language, where for example
u'\u0b95\u0bcd', u'\u0b95\u0bbe', u'\u0b95\u0bca', u'\u0b95\u0bcc'
looks like they have width 1,2,3 and 4 columns. To add insult to injury
these 4 symbols are all considered *single* letter symbols :) If your
email reader is able to show them, here they are in all their glory:
க், கா, கொ, கௌ.




More information about the Python-list mailing list