Special chars with HTMLParser
Piet van Oostrum
piet at cs.uu.nl
Wed Aug 5 14:28:15 CEST 2009
>>>>> Fafounet <fafounet at gmail.com> (F) wrote:
>F> I am parsing a web page with special chars such as é (which
>F> stands for é).
>F> I know I can have the unicode character é from unicode
>F> but with those extra characters I don' t know.
>F> I tried to implement handle_charref within HTMLParser without success.
>F> Furthermore, if I have the data abécd, handle_data will get "ab",
>F> handle_charref will get xe9 and then handle_data doesn't have the end
>F> of the string ("cd").
The character references indicate Unicode ordinals, not iso-8859-1
characters. In your example it will give the proper character because
iso-8859-1 coincides with the first part of the Unicode ordinals, but
for character outside of iso-8859-1 it will fail.
This should give you an idea:
from htmlentitydefs import name2codepoint
def handle_charref(self, name):
num = int(name[1:], 16)
num = int(name, 10)
print 'char:', repr(unichr(num))
def handle_entityref(self, name):
print 'char:', unichr(name2codepoint[name])
If your HTML may be illegal you should add some exception handling.
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org
More information about the Python-list