How to find number of characters in a unicode string?
Gabriel Genellina
gagsl-py at yahoo.com.ar
Fri Sep 29 04:27:05 EDT 2006
At Friday 29/9/2006 04:52, Lawrence D'Oliveiro wrote:
> >> Is there a way to calculate in characters
> >> and not in bytes to represent the characters.
> >
> > Decode the byte string and use `len()` on the unicode string.
>
>Hmmm, for some reason
>
> len(u"C\u0327")
>
>returns 2.
That's correct, these are two unicode characters,
C and combining-cedilla; display as Ç. From
<http://en.wikipedia.org/wiki/Unicode>:
"Unicode takes the role of providing a unique
code point — a number, not a glyph — for each
character. In other words, Unicode represents a
character in an abstract way, and leaves the
visual rendering (size, shape, font or style) to
other software [...] This simple aim becomes
complicated, however, by concessions made by
Unicode's designers, in the hope of encouraging a
more rapid adoption of Unicode. [...] A lot of
essentially identical characters were encoded
multiple times at different code points to
preserve distinctions used by legacy encodings
and therefore allow conversion from those
encodings to Unicode (and back) without losing
any information. [...] Also, while Unicode allows
for combining characters, it also contains
precomposed versions of most letter/diacritic
combinations in normal use. These make conversion
to and from legacy encodings simpler and allow
applications to use Unicode as an internal text
format without having to implement combining
characters. For example é can be represented in
Unicode as U+0065 (Latin small letter e) followed
by U+0301 (combining acute) but it can also be
represented as the precomposed character U+00E9
(Latin small letter e with acute)."
Gabriel Genellina
Softlab SRL
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
More information about the Python-list
mailing list