incorrect upper()/lower() of UTF-8
Jim Henry
jamespp35 at yahoo.com
Fri Jun 28 15:45:25 EDT 2002
It appears that python does not uppercase/lowercase utf-8 strings properly:
char
latin-1
utf-16
utf-8
ñ
F1
00F1
C3B1
Ñ
D1
00D1
C391
# create unicode/UTF-16 string with ñ
>>> s = u"set description to mañana"
>>> s
u'set description to ma\xf1ana' <= correct latin-1 or
UTF-16 (F1)
# encode it as utf-8
>>> s8 = s.encode("utf8")
>>> s8
'set description to ma\xc3\xb1ana' <= correct UTF-8 (C3B1)
# upshift the UTF-16 string
>>> s.upper()
u'SET DESCRIPTION TO MA\xd1ANA' <= correct latin-1 or UTF-16 (D1)
# upshift the UTF-8 string
>>> s8.upper()
'SET DESCRIPTION TO MA\xc3\xb1ANA' <= INCORRECT: should be C391
# updshift the UTF-8 string and convert back to UTF-16
>>> t = s8.upper()
>>> t
'SET DESCRIPTION TO MA\xc3\xb1ANA'
>>> u.encode("latin-1")
'SET DESCRIPTION TO MA\xf1ANA' <= INCORRECT, will print as "MAñANA"
instead of "MAÑANA"
Am I missing something??
More information about the Python-list
mailing list