sgmllib and entityref handling in python2.0

Mon May 21 14:44:48 EDT 2001

Hello all -

I've noticed an interesting behavior of SGMLParser defined in sgmllib. The
issue - as I am seeing it is the way entityref was defined on line 23:

...
entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]') 
...

This would match anything that looks like entityref and that ends on a
non-alphabetic or numeric character. The W3C docs described that entityrefs
end on ';' and so does the python docs when talking how SGMLParser works.

The result of this is that a HTML fragment like:

<a
href="http://www.abc.com/page.html?a=1&b=2">http://www.abc.com/page.html?a=1
&b=2</a>

Would get parsed:

<a
href="http://www.abc.com/page.html?a=1&b=2">http://www.abc.com/page.html?a=1
&b;=2</a>

---^ (extra semicolon included)

This is happening, because the entityref search gets ended at the '=' (same
happens at every character that's not a letter or a number - space for
instance). 

I did change line 23 to:

entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*);') 

and now I am getting the correct behavior. 

Could anyone share thoughts on this issue ??

Regards
Petar Karafezov

MetaMarkets.com
415-575-3015
-------------------------------------------
Investing Out Loud at
http://www.metamarkets.com
-------------------------------------------