incorrect upper()/lower() of UTF-8

Jim Henry jamespp35 at yahoo.com
Fri Jun 28 15:45:25 EDT 2002


It appears that python does not uppercase/lowercase utf-8 strings properly:

      char
     latin-1
     utf-16
     utf-8

      ñ
     F1
     00F1
     C3B1

      Ñ
     D1
     00D1
     C391


# create unicode/UTF-16 string with ñ
>>> s = u"set description to mañana"
>>> s
u'set description to ma\xf1ana'                   <= correct latin-1 or
UTF-16 (F1)

# encode it as utf-8
>>> s8 = s.encode("utf8")
>>> s8
'set description to ma\xc3\xb1ana'             <= correct UTF-8 (C3B1)

# upshift the UTF-16 string
>>> s.upper()
u'SET DESCRIPTION TO MA\xd1ANA'       <= correct latin-1 or UTF-16 (D1)

# upshift the UTF-8 string
>>> s8.upper()
'SET DESCRIPTION TO MA\xc3\xb1ANA'  <= INCORRECT: should be C391

# updshift the UTF-8 string and convert back to UTF-16

>>> t = s8.upper()
>>> t
'SET DESCRIPTION TO MA\xc3\xb1ANA'
>>> u.encode("latin-1")
'SET DESCRIPTION TO MA\xf1ANA'         <= INCORRECT, will print as "MAñANA"
instead of "MAÑANA"

Am I missing something??





More information about the Python-list mailing list