[Web-SIG] HTML parsers and DOM;
WWW::Mechanize work-alike [was: Re: Python version...]
aquarius-lists at kryogenix.org
Sun Nov 30 17:18:56 EST 2003
John J Lee spoo'd forth:
> On Sun, 30 Nov 2003, Stuart Langridge wrote:
>> > Is this aimed at the standard library? xml.dom.ext.reader.HtmlLib?
>> Um. What I was looking for was something that could parse HTML
>> (including invalid HTML) and give me a DOM tree. I tried Twisted's
> Fine, but what we're talking about here is what should go into Python's
> standard library.
True enough. I fear, though, that without *something* that can cope
with invalid HTML, a WWW::Mechanize-style thing is going to be pretty
>> I think
>> that a DOM parser for HTML is pretty important, even if that parser
>> *actually* just does "convert broken HTML to valid XHTML and then feed
>> it to minidom" or something similar. Are there any others?
> There are lots of XML DOM implementations for Python (only one HTML DOM
> implementation, though: 4DOM -- and that's out of date), including the one
> that's already in the standard library. Parsing arbitrary HTML is hard,
> though (xml.dom.ext.reader.HtmlLib doesn't even manage to generate an HTML
> DOM from arbitrary *correct* HTML, and correct HTML is not often seen in
> the wild ;-). tidylib is the only sane way I know of. See below.
*nod* Your notes on tidylib are useful -- I didn't know about it. That
said, though, without it in the stdlib, it's no better than HtmlLib
(well, it's maintained, true, but it's still not available to the
>> > Why isn't it a subclass of urllib.OpenerDirector (or, better, from
>> Because I didn't know about it. This is because "urllib.urlopen" is
>> hardwired into my fingers, and then I just overrode it with
>> ClientCookie when I needed cookie handling. I'm entirely happy to have
>> it work totally differently; this was really a proof-of-concept to get
>> the ball rolling rather than a submission for direct inclusion.
> Sure (you don't mean proof-of-concept, but I know what you mean).
Very true, yes, and thanks :)
> Should tidylib be in the standard library? On one hand, I lean towards
> "no", because HTML is (in theory) on the way out. OTOH, if it's going to
> take another thirty years for HTML to completely go away, that may be a
> silly attitude to take! Opinions? If it were to be in the std. lib., I
> guess somebody would need to write a non-ctypes wrapper.
I really think that HTML is not going away any time soon. Moreover,
there are still issues with XHTML (like which content-type to serve it
as). It's certainly reasonable to make tools only *produce* newer
variants, but you have to be able to consume all kinds of invalid
rubbish or you'll never be able to look at the web at all :)
>> > No .forward() / .backward() methods?
>> Didn't think of them until after I sent the message out. They'd be
>> pretty trivial to implement, though, although I don't know what you'd
>> do about the "This page contains POSTDATA" issue that browsers get.
> You're allowed to do whatever you like, really (RFC 2616 section 13.13).
Either re-posting or not doing so are both iffy, though, hence the
choice. Admittedly, you could have backward() and forward() take a
repostData parameter, but you'd have to know beforehand whether you'd
want to do it, since use isn't interactive. Hm.
Medio tutissimus ibis.
(You will travel safest in a middle course)
-- family motto
More information about the Web-SIG