UnicodeEncodeError in compile

jmfauth wxjmfauth at gmail.com
Tue Jan 10 04:42:21 EST 2012


1) If I copy/paste these CJK chars from Google Groups in two of my
interactive
interpreters (no "dos/cmd console"), I have no problem.

>>> import unicodedata as ud
>>> ud.name('工')
'CJK UNIFIED IDEOGRAPH-5DE5'
>>> ud.name('具')
'CJK UNIFIED IDEOGRAPH-5177'
>>> hex(ord(('工')))
'0x5de5'
>>> hex(ord('具'))
'0x5177'
>>>

2) It semms the mbcs codec has some difficulties with
these chars.

>>> '\u5de5'.encode('mbcs')
Traceback (most recent call last):
  File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid character
>>> '\u5de5'.encode('utf-8')
b'\xe5\xb7\xa5'
>>> '\u5de5'.encode('utf-32-be')
b'\x00\x00]\xe5'

3) On the usage of mbcs in files IO interaction --> core devs.

My conclusion.
The bottle neck is on the mbcs side.

jmf




More information about the Python-list mailing list