Unicode characters
Diez B. Roggisch
deets at nospam.web.de
Mon Sep 4 10:07:52 EDT 2006
Paul Johnston wrote:
> Hi
> I have a string which I convert into a list then read through it
> printing its glyph and numeric representation
>
> #-*- coding: utf-8 -*-
>
> thestring = "abcd"
> thelist = list(thestring)
>
> for c in thelist:
> print c,
> print ord(c)
>
> Works fine for latin characters but when I put in a unicode character
> a two byte character gives me two characters. For example an arabic
> alef returns
>
> * 216
> * 167
>
> ( the first asterix is the empty set symbol the second a double "s")
>
> Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
> sequential listings i.e.
> 216 167
> 216 168
> 216 169
> So it is reading the correct details.
>
>
> Is there anyway to get the c in the for loop to recognise it is
> reading a multiple byte character.
> I have followed the info in PEP 0263 and am using Python 2.4.3 Build
> 12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2
Use unicode objects instead of byte strings. The above string literal is
_not_ affected by the coding:-header whatsoever.
That applies only to
u"some text"
literals, and makes them a unicode object.
The normal string literals are just bytes - because of your encoding being
properly set in the editor, an entered multibyte-character is stored as
such.
In a nutshell: try the above using u"abcd".
Diez
More information about the Python-list
mailing list