decode unicode string using 'unicode_escape' codecs

aurora aurora00 at gmail.com
Thu Jan 12 21:59:47 EST 2006


I have some unicode string with some characters encode using python  
notation like '\n' for LF. I need to convert that to the actual LF  
character. There is a 'unicode_escape' codec that seems to suit my purpose.

>>> encoded = u'A\\nA'
>>> decoded = encoded.decode('unicode_escape')
>>> print len(decoded)
3

Note that both encoded and decoded are unicode string. I'm trying to use  
the builtin codec because I assume it has better performance that for me  
to write pure Python decoding. But I'm not converting between byte string  
and unicode string.

However it runs into problem in some cases.

encoded = u'€\\n€'
decoded = encoded.decode('unicode_escape')

Traceback (most recent call last):
   File "g:\bin\py_repos\mindretrieve\trunk\minds\x.py", line 9, in ?
     decoded = encoded.decode('unicode_escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in  
position 0: ordinal not in range(128)

Reading the docuemnt more carefully, I found out what has happened.  
decode('unicode_escape') takes byte string as operand and convert it into  
unicode string. Since encoded is already unicode, it is first implicitly  
converted to byte string using 'ascii' encoding. In this case it fails  
because of the '€' character.

So I resigned to the fact that 'unicode_escape' doesn't do what I want.  
But I think more deeply. I come up with this Python source code. It runs  
OK and outputs 3.

---------------------------------
# -*- coding: utf-8 -*-
print len(u'€\n€')  # 3
---------------------------------

Think about what happened in the second line. First the parser decodes the  
bytes into an unicode string with UTF-8 encoding. Then it applies syntax  
run to decode the unicode characters '\n' to LF. The second is what I  
want. There must be something available to the Python interpreter that is  
not available to the user. So it there something I have overlook?

Anyway I just want to leverage the builtin codecs for performance. I  
figure this would be faster than

   encoded.replace('\\n', '\n')
   ...and so on...

If there are other suggestion it would be greatly appriciated :)

wy
	



More information about the Python-list mailing list