How to Split Chinese Character with backslash representation?

J. Clifford Dyer jcd at sdf.lonestar.org
Fri Oct 27 23:36:37 CEST 2006


Paul McGuire wrote:
> "Wijaya Edward" <ewijaya at i2r.a-star.edu.sg> wrote in message 
> news:mailman.1319.1161920633.11739.python-list at python.org...
>> Hi all,
>>
>> I was trying to split a string that
>> represent chinese characters below:
>>
>>
>>>>> str = '\xc5\xeb\xc7\xd5\xbc'
>>>>> fields2 = split(r'\\',str)
> 
> There are no backslash characters in the string str, so split finds nothing 
> to split on.  I know it looks like there are, but the backslashes shown are 
> part of the \x escape sequence for defining characters when you can't or 
> don't want to use plain ASCII characters (such as in your example in which 
> the characters are all in the range 0x80 to 0xff). 

Moreover, you are not splitting on a backslash; since you used a 
r'raw_string', you are in fact splitting on TWO backslashes.  It looks 
like you want to treat str as a raw string to get at the slashes, but it 
isn't a raw string and I don't think you can directly convert it to one. 
  If you want the numeric values of each byte, you can do the following:

Py >>> char_values = [ ord(c) for c in str ]
Py >>> char_values
[ 197, 235, 199, 213, 188 ]
Py >>>

Note that those numbers are decimal equivalents of the hex values given 
in your string, but are now in integer format.

On the other hand, you may want to use str.encode('gbk') (or whatever 
your encoding is) so that you're actually dealing with characters rather 
than bytes:

Py >>> str.decode('gbk')

Traceback (most recent call last):
   File "<pyshell#29>", line 1, in -toplevel-
     str.decode('gbk')
UnicodeDecodeError: 'gbk' codec can't decode byte 0xbc in position 4: 
incomplete multibyte sequence
Py >>> str[0:4].decode('gbk')
u'\u70f9\u94a6'

Py >>> print str[0:4].decode('gbk')
烹钦
Py >>> print str[0:4]
ÅëÇÕ

OK, so gbk choked on the odd character at the end.  Maybe you need a 
different encoding, or maybe your string got truncated somewhere along 
the line....

Cheers,
Cliff



More information about the Python-list mailing list