accessing individual characters in unicode strings

Peter Robinson peter at sd-editions.com
Sat Apr 12 07:45:25 CEST 2008


Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a  
xml database, being passed over and beautifully displayed on the web.   
For example: the most common greek word of all 'kai' (or και if your  
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for  
each character (actually set a width attribute somewhere else for each  
character)

Should be simple, yes?
turns out to be near impossible.  I tried using a simple index  
character routine such as ustr[0]..ustr[1]... and this gives rubbish.   
So I use len() to find out how long my simple greek string is, and of  
course it is NOT three characters long.

A day of intensive searching around the lists tells me that unicode  
and python is a moving target: so many fixes are suggested for similar  
problems, none apparently working with mine.

Here is the best I can do, so far
I convert the utf8 string using
ustr  = repr(unicode(thisword, 'iso-8859-7'))
for kai this gives the following:

u'\u039e\u038a\u039e\xb1\u039e\u0389'

so now things should be simple, yes? just go through this and identify  
each character...

Not so simple at all.
k, kappa: turns out to be TWO \u strings, not one: thus \u039e\u038a
similarly, iota is also two \u strings:  \u039e\u0389
alpha is a \u string followed by a \x string: \u039e\xb1

looking elsewhere in the record,

my particular favourite is the midpoint character: this comes out as  
\u03b1\x90\xa7 !
and in the middle of all this, there are some non-unicode characters:  
\u039e\u038fc is o followed by c!

well, I don't have many characters to deal this and I could cope with  
this mess by tedious matching character by character.
But surely, there is a better way...
help please


Peter Robinson: peter at sd-editions.com





More information about the Python-list mailing list