accessing individual characters in unicode strings
peter at sd-editions.com
Sat Apr 12 07:45:25 CEST 2008
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or και if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
Should be simple, yes?
turns out to be near impossible. I tried using a simple index
character routine such as ustr..ustr... and this gives rubbish.
So I use len() to find out how long my simple greek string is, and of
course it is NOT three characters long.
A day of intensive searching around the lists tells me that unicode
and python is a moving target: so many fixes are suggested for similar
problems, none apparently working with mine.
Here is the best I can do, so far
I convert the utf8 string using
ustr = repr(unicode(thisword, 'iso-8859-7'))
for kai this gives the following:
so now things should be simple, yes? just go through this and identify
Not so simple at all.
k, kappa: turns out to be TWO \u strings, not one: thus \u039e\u038a
similarly, iota is also two \u strings: \u039e\u0389
alpha is a \u string followed by a \x string: \u039e\xb1
looking elsewhere in the record,
my particular favourite is the midpoint character: this comes out as
and in the middle of all this, there are some non-unicode characters:
\u039e\u038fc is o followed by c!
well, I don't have many characters to deal this and I could cope with
this mess by tedious matching character by character.
But surely, there is a better way...
Peter Robinson: peter at sd-editions.com
More information about the Python-list