the unicode saga continues...

Sat Nov 14 01:52:16 EST 2009

"Ethan Furman" <ethan at stoneleaf.us> wrote in message 
news:4AFE4141.4020102 at stoneleaf.us...
> So I've added unicode support to my dbf package, but I also have some 
> rather large programs that aren't ready to make the switch over yet.  So 
> as a workaround I added a (rather lame) option to convert the 
> unicode-ified data that was decoded from the dbf table back into an 
> encoded format.
>
> Here's the fun part:  in figuring out what the option should be for use 
> with my system, I tried some tests...
>
> Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit 
> (Intel)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> print u'\xed'
> í
> >>> print u'\xed'.encode('cp437')
> í
> >>> print u'\xed'.encode('cp850')
> í
> >>> print u'\xed'.encode('cp1252')
> φ
> >>> import locale
> >>> locale.getdefaultlocale()
> ('en_US', 'cp1252')
>
> My confusion lies in my apparant codepage (cp1252), and the discrepancy 
> with character u'\xed' which is absolutely an i with an accent; yet when I 
> encode with cp1252 and print it, I get an o with a line.
>
> Can anybody clue me in to what's going on here?

Yes, your console window actually uses cp437, cp850 happens to map to the 
same character, and cp1252 does not.  cp1252 is the default Windows encoding 
(what Notepad uses, for example):

Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] 
on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'cp1252')
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print u'\xed'.encode('cp437')
í
>>> print u'\xed'.encode('cp1252')
φ

-Mark