Special chars with HTMLParser
Fafounet
fafounet at gmail.com
Wed Aug 5 09:03:46 EDT 2009
Thank you, now I can get the correct character.
Now when I have the string abécd I can get ab then é thanks to
your function and then cd. But how is it possible to know that cd is
still the same word ?
Fabien
> The character references indicate Unicode ordinals, not iso-8859-1
> characters. In your example it will give the proper character because
> iso-8859-1 coincides with the first part of the Unicode ordinals, but
> for character outside of iso-8859-1 it will fail.
>
> This should give you an idea:
>
> from htmlentitydefs import name2codepoint
> ...
> def handle_charref(self, name):
> if name.startswith('x'):
> num = int(name[1:], 16)
> else:
> num = int(name, 10)
> print 'char:', repr(unichr(num))
>
> def handle_entityref(self, name):
> print 'char:', unichr(name2codepoint[name])
>
> If your HTML may be illegal you should add some exception handling.
> --
> Piet van Oostrum <p... at cs.uu.nl>
> URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4]
> Private email: p... at vanoostrum.org
More information about the Python-list
mailing list