accessing individual characters in unicode strings

John Machin sjmachin at lexicon.net
Sat Apr 12 05:33:58 EDT 2008


On Apr 12, 3:45 pm, Peter Robinson <pe... at sd-editions.com> wrote:
> Dear list
> I am at my wits end on what seemed a very simple task:
> I have some greek text, nicely encoded in utf8, going in and out of a
> xml database, being passed over and beautifully displayed on the web.
> For example: the most common greek word of all 'kai' (or και if your
> mailer can see utf8)
> So all I want to do is:
> step through this string a character at a time, and do something for
> each character (actually set a width attribute somewhere else for each
> character)
>
> Should be simple, yes?
> turns out to be near impossible.  I tried using a simple index
> character routine such as ustr[0]..ustr[1]... and this gives rubbish.
> So I use len() to find out how long my simple greek string is, and of
> course it is NOT three characters long.

The utf8-encoded incarnation is three characters long and it's six
bytes long. utf-8 is not unicode.

>
> A day of intensive searching around the lists tells me that unicode
> and python is a moving target: so many fixes are suggested for similar
> problems, none apparently working with mine.
>
> Here is the best I can do, so far
> I convert the utf8 string using
> ustr  = repr(unicode(thisword, 'iso-8859-7'))

Don't do that. If you have a utf8 string, convert it to unicode like
this:

ustr = unicode(the_utf8_string, 'utf8')

If you have a string encoded in iso-8859-7, convert it to unicode like
this:

ustr = unicode(the_iso_8859_7_string, 'iso-8859-7')

Then inspect it like this:
print repr(ustr)

Here's a sample interactive session:

>>> thisword = '\xce\xba\xce\xb1\xce\xb9'
>>> ustr = unicode(thisword, 'utf8')
>>> len(ustr)
3
>>> print repr(ustr)
u'\u03ba\u03b1\u03b9'
>>> import unicodedata
>>> [unicodedata.name(x) for x in ustr]
['GREEK SMALL LETTER KAPPA', 'GREEK SMALL LETTER ALPHA', 'GREEK SMALL
LETTER IOTA']

Suggested reading: the Python Unicode HOWTO at http://www.amk.ca/python/howto/unicode

This may be handy: http://unicode.org/charts/PDF/U0370.pdf

HTH,
John



More information about the Python-list mailing list