converting html escape sequences to unicode characters

Craig Ringer craig at
Fri Dec 10 09:09:44 CET 2004

On Fri, 2004-12-10 at 08:36, harrelson wrote:
> I have a list of about 2500 html escape sequences (decimal) that I need
> to convert to utf-8.  Stuff like:

I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:

>>> escapeseq = '비'
>>> uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
>>> uescape
>>> print uescape
(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).

I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.

>>> entities = ['비', '행', '기', '로',
'보', '낼', '거', '에', '요', '내',
'면', '금', '이', '얼', '마', '지',
>>> def unescape(escapeseq):
...     return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
>>> print ' '.join([ unescape(x) for x in entities ])
비 행 기 로 보 낼 거 에 요 내 면 금 이 얼 마 지 잠

Craig Ringer

More information about the Python-list mailing list