the unicode saga continues...
doomster at knuut.de
Sat Nov 14 08:32:07 CET 2009
Ethan Furman wrote:
> Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
> (Intel)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> print u'\xed'
> >>> print u'\xed'.encode('cp437')
> >>> print u'\xed'.encode('cp850')
> >>> print u'\xed'.encode('cp1252')
> >>> import locale
> >>> locale.getdefaultlocale()
> ('en_US', 'cp1252')
> My confusion lies in my apparant codepage (cp1252), and the discrepancy
> with character u'\xed' which is absolutely an i with an accent; yet when
> I encode with cp1252 and print it, I get an o with a line.
For the record: I read a small Greek letter phi in your posting, not an o
with a line. If I encode according to my default locale (UTF-8), I get the
letter i with the accent. If I encode with codepage 1252, I get a marker for
an invalid character on my terminal. This is using Debian though, not MS
Try printing the repr() of that. The point is that internally, you have the
codepoint u00ED (u'\xed'). Then, you encode this thing in various codepages,
which yields a string of bytes representing this thing ('\xa1', '\xa1' and
'\xed'), useful for storing on disk when the file uses said codepage or
other forms of IO.
Now, with a Unicode string, the output (print) knows what to do, it encodes
it according to the defaultlocale and sends the resulting bytes to stdout.
With a byte string, I think it directly forwards the content to stdout.
* If you want to verify your code, rather use 'print repr(..)'.
* I could imagine that your locale is simply not set up correctly.
More information about the Python-list