[Python-Dev] htmllib vs. HTMLParser

amk at amk.ca amk at amk.ca
Mon Oct 27 11:02:21 EST 2003


Over in the Web SIG, it was noted that the HTML parser in htmllib has
handlers for HTML 2.0 elements, and it should really support HTML 4.01, the
current version.  I'm looking into doing this.

We actually have two HTML parsers: htmllib.py and the more recent
HTMLParser.py.  The initial check-in comment for 2001/05/18 for
HTMLParser.py reads:

      A much improved HTML parser -- a replacement for sgmllib.  The API is
      derived from but not quite compatible with that of sgmllib, so it's a
      new file.  I suppose it needs documentation, and htmllib needs to be
      changed to use this instead of sgmllib, and sgmllib needs to be
      declared obsolete.  But that can all be done later.

sgmllib only handles those bits of SGML needed for HTML, and anyone doing
serious SGML work is going to have to use a real SGML parser, so deprecating 
sgmllib is reasonable.  HTMLParser needs no changes for HTML 4.01; only
htmllib needs to get a bunch more handler methods.

Should I try to do this for 2.4?

(I can't find an explanation of how the API differs between the two modules
but can figure it out by inspecting the code, and will try to keep the
htmllib module backward-compatible.)

--amk



More information about the Python-Dev mailing list