How to find number of characters in a unicode string?

Fri Sep 29 04:27:05 EDT 2006

At Friday 29/9/2006 04:52, Lawrence D'Oliveiro wrote:

> >> Is there a way to calculate in characters
> >> and not in bytes to represent the characters.
> >
> > Decode the byte string and use `len()` on the unicode string.
>
>Hmmm, for some reason
>
>     len(u"C\u0327")
>
>returns 2.

That's correct, these are two unicode characters, 
C and combining-cedilla; display as Ç. From 
<http://en.wikipedia.org/wiki/Unicode>:

"Unicode takes the role of providing a unique 
code point — a number, not a glyph — for each 
character. In other words, Unicode represents a 
character in an abstract way, and leaves the 
visual rendering (size, shape, font or style) to 
other software [...] This simple aim becomes 
complicated, however, by concessions made by 
Unicode's designers, in the hope of encouraging a 
more rapid adoption of Unicode. [...] A lot of 
essentially identical characters were encoded 
multiple times at different code points to 
preserve distinctions used by legacy encodings 
and therefore allow conversion from those 
encodings to Unicode (and back) without losing 
any information. [...] Also, while Unicode allows 
for combining characters, it also contains 
precomposed versions of most letter/diacritic 
combinations in normal use. These make conversion 
to and from legacy encodings simpler and allow 
applications to use Unicode as an internal text 
format without having to implement combining 
characters. For example é can be represented in 
Unicode as U+0065 (Latin small letter e) followed 
by U+0301 (combining acute) but it can also be 
represented as the precomposed character U+00E9 
(Latin small letter e with acute)."

Gabriel Genellina
Softlab SRL 

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas