sgmllib and entityref handling in python2.0
Petar Karafezov
petar at metamarkets.com
Mon May 21 14:44:48 EDT 2001
Hello all -
I've noticed an interesting behavior of SGMLParser defined in sgmllib. The
issue - as I am seeing it is the way entityref was defined on line 23:
...
entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')
...
This would match anything that looks like entityref and that ends on a
non-alphabetic or numeric character. The W3C docs described that entityrefs
end on ';' and so does the python docs when talking how SGMLParser works.
The result of this is that a HTML fragment like:
<a
href="http://www.abc.com/page.html?a=1&b=2">http://www.abc.com/page.html?a=1
&b=2</a>
Would get parsed:
<a
href="http://www.abc.com/page.html?a=1&b=2">http://www.abc.com/page.html?a=1
&b;=2</a>
---^ (extra semicolon included)
This is happening, because the entityref search gets ended at the '=' (same
happens at every character that's not a letter or a number - space for
instance).
I did change line 23 to:
entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*);')
and now I am getting the correct behavior.
Could anyone share thoughts on this issue ??
Regards
Petar Karafezov
MetaMarkets.com
415-575-3015
-------------------------------------------
Investing Out Loud at
http://www.metamarkets.com
-------------------------------------------
More information about the Python-list
mailing list