Should HTML entity translation accept "&"?
nagle at animats.com
Mon Jan 7 02:09:48 CET 2008
Another in our ongoing series on "Parsing Real-World HTML".
It's wrong, of course. But Firefox will accept as HTML escapes
as well as the correct forms
To be "compatible", a Python screen scraper at
has a function "htmldecode", which is supposed to recognize
HTML escapes and generate Unicode. (Why isn't this a standard
Python library function? Its inverse is available.)
This uses the regular expression
charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)
to recognize HTML escapes.
Note the ";?", which makes the closing ";" optional.
This seems fine until we hit something valid but unusual like
for which "htmldecode" tries to convert "1234567" into
a Unicode character with that decimal number, and gets a
For our own purposes, I rewrote "htmldecode" to require a
sequence ending in ";", which means some bogus HTML escapes won't
be recognized, but correct HTML will be processed correctly.
What's general opinion of this behavior? Too strict, or OK?
More information about the Python-list