sgmllib.py not good at handling <br/>

Alex Martelli aleaxit at yahoo.com
Wed May 16 07:50:34 EDT 2001


"Chris Withers" <chrisw at nipltd.com> writes:

> Alex Martelli wrote:
> > 
> > No!  The reverse.  But sgmllib does NOT cover all of SGML
> > (not even any _substantial_ fraction of it: SGML is really
> > huge, which is why it was subsetted to produce XML!-), just
> > what little of it is needed to parse typical HTML, as
> > the library reference manual says.
> 
> So, this is actualy a bug in sgmllib? ;-)

The docs for sgmllib (Library Reference Manual, 13.1) state:
"""
In fact, it does not provide a full SGML parser -- it only 
parses SGML insofar as it is used by HTML, and the module 
only exists as a base for the htmllib module.
"""

If you want to call a module's reason for existence
"a bug" in that module, feel free to do so.  It's 
unlikely that others will understand you, if you choose
to attach arbitrary meaning to every word you use, but
that never stopped Humpty Dumpty, so why should
it stop you, after all?


> How would I go about improving it to that it does something sensible with
> XHTML-ish tags like <br/>?
> 
> What should I be reading or who should I be talking to?

D:\Python21\Lib\sgmllib.py (if that is the directory in
which you have installed Python -- modify suitably:-) is
likely to prove the most fruitful reading material for
this task.  (Don't bother with the 2.0 version -- the
2.1 one was modified quite a bit).  The start of method
parse_starttag, around line 260, is the point where some
intervention is most likely to be fruitful: it handles
"short tags" of the form <tag/data/ as equivalent to
<tag>data</tag> -- you probably don't need that for HTML
parsing, but, anyway, just before that attempt you can
try matching <tag/>, with appropriate whitespace-tolerance, 
and perform very similar (but simpler) tasks to those the
current short-tag match is doing.  Should be a patch of
a few lines, and then you can submit it to sourceforce
with some likelihood of having it accepted for Python 2.2.

If you want your patch's description to be understood, I
suggest you refer to the current lack of matching for
<tag/> as a "limitation" rather than a "bug", though:-).


Alex






More information about the Python-list mailing list