Special chars with HTMLParser

Fafounet fafounet at gmail.com
Wed Aug 5 09:03:46 EDT 2009


Thank you, now I can get the correct character.

Now when I have the string abécd I can get ab then é thanks to
your function and then cd. But how is it possible to know that cd is
still the same word ?


Fabien


> The character references indicate Unicode ordinals, not iso-8859-1
> characters. In your example it will give the proper character because
> iso-8859-1 coincides with the first part of the Unicode ordinals, but
> for character outside of iso-8859-1 it will fail.
>
> This should give you an idea:
>
> from htmlentitydefs import name2codepoint
> ...
>     def handle_charref(self, name):
>         if name.startswith('x'):
>             num = int(name[1:], 16)
>         else:
>             num = int(name, 10)
>         print 'char:', repr(unichr(num))
>
>     def handle_entityref(self, name):
>         print 'char:', unichr(name2codepoint[name])
>
> If your HTML may be illegal you should add some exception handling.
> --
> Piet van Oostrum <p... at cs.uu.nl>
> URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4]
> Private email: p... at vanoostrum.org




More information about the Python-list mailing list