
Over in the Web SIG, it was noted that the HTML parser in htmllib has handlers for HTML 2.0 elements, and it should really support HTML 4.01, the current version. I'm looking into doing this. We actually have two HTML parsers: htmllib.py and the more recent HTMLParser.py. The initial check-in comment for 2001/05/18 for HTMLParser.py reads: A much improved HTML parser -- a replacement for sgmllib. The API is derived from but not quite compatible with that of sgmllib, so it's a new file. I suppose it needs documentation, and htmllib needs to be changed to use this instead of sgmllib, and sgmllib needs to be declared obsolete. But that can all be done later. sgmllib only handles those bits of SGML needed for HTML, and anyone doing serious SGML work is going to have to use a real SGML parser, so deprecating sgmllib is reasonable. HTMLParser needs no changes for HTML 4.01; only htmllib needs to get a bunch more handler methods. Should I try to do this for 2.4? (I can't find an explanation of how the API differs between the two modules but can figure it out by inspecting the code, and will try to keep the htmllib module backward-compatible.) --amk

I'm unclear on what you plan to do -- repeal sgmllib an rewrite htmllib to use HTMLParser internally for a backwards compatible interface?
That would be required for a few releases, yes. I'm okay with deprecating sgmllib faster than htmllib. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Mon, Oct 27, 2003 at 08:52:53AM -0800, Guido van Rossum wrote:
Correct; that's what your initial checkin message for HTMLParser.py suggests doing, and if I'm touching htmllib.py to add the HTML 4.01 stuff, I may as well make the other change, too.
I'm okay with deprecating sgmllib faster than htmllib.
sgmllib gets deprecated; htmllib never gets deprecated. HTMLParser is a barebones HTML parser that provides no default handlers (handle_head, handle_title, etc.), and htmllib extends it, adding default handlers for the various things in HTML 4.01. --amk

OK, got it. Sounds good to me! --Guido van Rossum (home page: http://www.python.org/~guido/)

Glad to see you volunteering! But IMO simply adding some handler methods won't really do it. You also need to introduce some knowledge about the semantics of the syntax. For example, a new "block"-level element should close all "in-line" elements that are currently open. Etc. It would also be handy to have a version of the parser that takes an HTML page and returns a parse tree, rather than the halfway solution we currently have, forcing the user to design and write a lot of code to get anything done. Bill

On Mon, Oct 27, 2003 at 04:53:32PM -0800, Bill Janssen wrote:
Perhaps, but it might be a mug's game. I was on the Lynx developer list for a while, and bad HTML requires many, many hacks to be processed sensibly. Given that XHTML use is slowly rising, that work may not be necessary, but I'll keep it in mind. --amk

Perhaps, but it might be a mug's game. I was on the Lynx developer list for a while, and bad HTML requires many, many hacks to be processed sensibly.
Yes, I know what you mean. I would personally be happy to simply reject bad HTML (return None from the parser), and force the user to do what he currently has to do to handle it. Bill

I'm unclear on what you plan to do -- repeal sgmllib an rewrite htmllib to use HTMLParser internally for a backwards compatible interface?
That would be required for a few releases, yes. I'm okay with deprecating sgmllib faster than htmllib. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Mon, Oct 27, 2003 at 08:52:53AM -0800, Guido van Rossum wrote:
Correct; that's what your initial checkin message for HTMLParser.py suggests doing, and if I'm touching htmllib.py to add the HTML 4.01 stuff, I may as well make the other change, too.
I'm okay with deprecating sgmllib faster than htmllib.
sgmllib gets deprecated; htmllib never gets deprecated. HTMLParser is a barebones HTML parser that provides no default handlers (handle_head, handle_title, etc.), and htmllib extends it, adding default handlers for the various things in HTML 4.01. --amk

OK, got it. Sounds good to me! --Guido van Rossum (home page: http://www.python.org/~guido/)

Glad to see you volunteering! But IMO simply adding some handler methods won't really do it. You also need to introduce some knowledge about the semantics of the syntax. For example, a new "block"-level element should close all "in-line" elements that are currently open. Etc. It would also be handy to have a version of the parser that takes an HTML page and returns a parse tree, rather than the halfway solution we currently have, forcing the user to design and write a lot of code to get anything done. Bill

On Mon, Oct 27, 2003 at 04:53:32PM -0800, Bill Janssen wrote:
Perhaps, but it might be a mug's game. I was on the Lynx developer list for a while, and bad HTML requires many, many hacks to be processed sensibly. Given that XHTML use is slowly rising, that work may not be necessary, but I'll keep it in mind. --amk

Perhaps, but it might be a mug's game. I was on the Lynx developer list for a while, and bad HTML requires many, many hacks to be processed sensibly.
Yes, I know what you mean. I would personally be happy to simply reject bad HTML (return None from the parser), and force the user to do what he currently has to do to handle it. Bill
participants (3)
-
amk@amk.ca
-
Bill Janssen
-
Guido van Rossum