[Web-SIG] Python version of WWW::Mechanize

Stuart Langridge aquarius-lists at kryogenix.org
Sun Nov 30 15:26:24 EST 2003


John J Lee spoo'd forth:
> On Sun, 30 Nov 2003, Stuart Langridge wrote:
> [...]
>> I've put together a first cut of something that works like
>> WWW::Mechanize at http://www.kryogenix.org/days/2003/11/30/pybrowser.
>> Obviously it'll need a little more work on it, but it seems to work OK
>> initially. Do let me know if it doesn't seem to work!
> 
> Good, some code!
> 
> Some comments:
> 
> Is this aimed at the standard library?  xml.dom.ext.reader.HtmlLib?
> Unless I'm confused about it (quite likely actually, thanks to PyXML
> insisting on fiddling with the xml package instead of creating its own),
> that's not part of the standard library.  Is PyXML going to be by 2.4,
> perhaps?  Even then, would 4DOM go in?  The original maintainers have
> dropped it, it's slow, and it's not up-to-date with the DOM level 2 spec.
> Personally, if I were going to depend on DOM outside the standard library,
> I'd want a forms interface that was higher level -- but I've already done
> that in DOMForm (though no browser class yet), and I guess it's a matter
> of taste whether you like a higher-level forms interface.  What do other
> people think?

Um. What I was looking for was something that could parse HTML
(including invalid HTML) and give me a DOM tree. I tried Twisted's
microdom, but settled on HtmlLib. Unfortunately, my selection criterion
was the intersection of "what do I have installed on my machine" and
"what comes up in a Google search for 'python html dom'" :-) I think
that a DOM parser for HTML is pretty important, even if that parser
*actually* just does "convert broken HTML to valid XHTML and then feed
it to minidom" or something similar. Are there any others?
 
> Why isn't it a subclass of urllib.OpenerDirector (or, better, from
> something like my (untested sketch of a) UserAgent in
> http://wwwsearch.sf.net/bits/ua.py)? Certainly the interface of
> OpenerDirector needs to be exposed by Browser (appropriately overridden).
> I see no reason why it shouldn't be a subclass, in fact: composition seems
> like needless complication.  WWW::Mechanize is a subclass of
> LWP::UserAgent, and the author doesn't seem to have run into any problems.
> And why is the method analogous to OpenerDirector.open() named .get(),
> when the URL might be POST, or even some completely different scheme
> (ftp:, file:...)?
> 
> It uses urlopen, which means Browser state (eg. cookies) is global.  This
> problem goes away if you subclass from OpenerDirector.

Because I didn't know about it. This is because "urllib.urlopen" is
hardwired into my fingers, and then I just overrode it with
ClientCookie when I needed cookie handling. I'm entirely happy to have
it work totally differently; this was really a proof-of-concept to get
the ball rolling rather than a submission for direct inclusion.
 
> No multipart/form-data encoding?

Oops.
 
> I think there has to be some way of (optionally) linking up any browser
> class to tidylib.

I agree; tidylib is nice. AFAIK, though (and I probably am wrong) the
only interface to Tidy is mxTidy, and I can never get it to install...

> Any tests?

Um, um, unit testing, I'm sure it says that on a post it note somewhere
around here...
 
> No .forward() / .backward() methods?

Didn't think of them until after I sent the message out. They'd be
pretty trivial to implement, though, although I don't know what you'd
do about the "This page contains POSTDATA" issue that browsers get.
 
> I think it's useful to have a separate nr argument for follow_link so you
> can do (as in WWW::Mechanize):
> 
>  browser.follow_link("download", nr=3)

Ha! Yes, that would be clever. I'd also like to be able to pass a
compiled regex to follow_link() and form(), as well as a string.

sil

-- 
Don't panic (even if your terminals start printing "all your dialup
accounts are belong to us" repeatedly)
	   -- bambam



More information about the Web-SIG mailing list