Special chars with HTMLParser
Piet van Oostrum
piet at cs.uu.nl
Wed Aug 5 14:20:47 EDT 2009
>>>>> Fafounet <fafounet at gmail.com> (F) wrote:
>F> Thank you, now I can get the correct character.
>F> Now when I have the string abécd I can get ab then é thanks to
>F> your function and then cd. But how is it possible to know that cd is
>F> still the same word ?
That depends on your definition of `word'. And that is
language-dependent.
What you normally do is collect the text in a (unicode) string variable.
This happens in handle_data, handle_charref and handle_entityref.
Then you check that the previously collected stuff was a word (e.g.
consisting of Unicode letters), and that the new stuff also consists of
letters. If your language has additional word constituents like - or '
you have to add this.
You can do this with unicodedata.category or with a regular
expression. If your locale is correct \w in a regular expression may be
helpful.
--
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org
More information about the Python-list
mailing list