[Python-Dev] htmllib vs. HTMLParser
Guido van Rossum
guido at python.org
Mon Oct 27 11:52:53 EST 2003
> Over in the Web SIG, it was noted that the HTML parser in htmllib has
> handlers for HTML 2.0 elements, and it should really support HTML 4.01, the
> current version. I'm looking into doing this.
>
> We actually have two HTML parsers: htmllib.py and the more recent
> HTMLParser.py. The initial check-in comment for 2001/05/18 for
> HTMLParser.py reads:
>
> A much improved HTML parser -- a replacement for sgmllib. The API is
> derived from but not quite compatible with that of sgmllib, so it's a
> new file. I suppose it needs documentation, and htmllib needs to be
> changed to use this instead of sgmllib, and sgmllib needs to be
> declared obsolete. But that can all be done later.
>
> sgmllib only handles those bits of SGML needed for HTML, and anyone doing
> serious SGML work is going to have to use a real SGML parser, so deprecating
> sgmllib is reasonable. HTMLParser needs no changes for HTML 4.01; only
> htmllib needs to get a bunch more handler methods.
>
> Should I try to do this for 2.4?
I'm unclear on what you plan to do -- repeal sgmllib an rewrite
htmllib to use HTMLParser internally for a backwards compatible
interface?
> (I can't find an explanation of how the API differs between the two modules
> but can figure it out by inspecting the code, and will try to keep the
> htmllib module backward-compatible.)
That would be required for a few releases, yes.
I'm okay with deprecating sgmllib faster than htmllib.
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-Dev
mailing list