sgmllib and entityref handling in python2.0

piet at cs.uu.nl piet at cs.uu.nl
Tue May 22 06:41:45 EDT 2001


>>>>> Petar Karafezov <petar at metamarkets.com> (PK) writes:

PK> Hello all -
PK> I've noticed an interesting behavior of SGMLParser defined in sgmllib. The
PK> issue - as I am seeing it is the way entityref was defined on line 23:

PK> ...
PK> entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]') 
PK> ...

PK> This would match anything that looks like entityref and that ends on a
PK> non-alphabetic or numeric character. The W3C docs described that entityrefs
PK> end on ';' and so does the python docs when talking how SGMLParser works.

In SGML the semicolon is optional in some contexts, but in the HTML spec it
is highly recommended to always use it:

        Note. In SGML, it is possible to eliminate the final ";" after a
        character reference in some cases (e.g., at a line
        break or immediately before a tag). In other circumstances it
        may not be eliminated (e.g., in the middle of a word). We
        strongly suggest using the ";" in all cases to avoid problems
        with user agents that require this character to be present.
        (http://www.w3.org/TR/html4/charset.html#h-5.3)

PK> The result of this is that a HTML fragment like:

PK> <a
PK> href="http://www.abc.com/page.html?a=1&b=2">http://www.abc.com/page.html?a=1
PK> &b=2</a>

PK> Would get parsed:

PK> <a
PK> href="http://www.abc.com/page.html?a=1&b=2">http://www.abc.com/page.html?a=1
PK> &b;=2</a>

In fact the & is illegal there, it should be replaced by & or &
(http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.2)

     B.2.2 Ampersands in URI attribute values

         The URI that is constructed when a form is submitted may be used
         as an anchor-style link (e.g., the href attribute for the A
         element). Unfortunately, the use of the "&" character to separate
         form fields interacts with its use in SGML attribute values to
         delimit character entity references. For example, to use the URI
         "http://host/?x=1&y=2" as a linking URI, it must be written <A
         href="http://host/?x=1&y=2"> or <A
         href="http://host/?x=1&y=2">.

         We recommend that HTTP server implementors, and in particular, CGI
         implementors support the use of ";" in place of "&" to save
         authors the trouble of escaping "&" characters in this manner.
	
PK> ---^ (extra semicolon included)


PK> This is happening, because the entityref search gets ended at the '=' (same
PK> happens at every character that's not a letter or a number - space for
PK> instance). 


PK> I did change line 23 to:

PK> entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*);') 

PK> and now I am getting the correct behavior. 

PK> Could anyone share thoughts on this issue ??

PK> Regards
PK> Petar Karafezov

PK> MetaMarkets.com
PK> 415-575-3015
PK> -------------------------------------------
PK> Investing Out Loud at
PK> http://www.metamarkets.com
PK> -------------------------------------------


-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum at hccnet.nl



More information about the Python-list mailing list