Access to database other web sites
John J. Lee
jjl at pobox.com
Fri Sep 26 12:57:00 EDT 2003
tibi87 at hanmail.net (Jenny) writes:
> I am doing research about realationship between sales rates and
> discounted prices or recommendation frequency. To do this, I need to
> access the database of commercial web sites via internet. I think this
> is possible because it it simmilar to the work of price comparison
> sites and web robot.
IIUYC, what you're contemplating is called "web scraping" -- at least,
it is by Cameron Laird, and I like the name. Others might know it as
"web client programming". Cameron wrote an article about this a while
back (Unix Review?) which you might like if you're a newbie -- Google
for it (but note that the Perl book he mentions has actually been
replaced by a newer one by Sean Burke, also from O'Reilly).
> I am studying python these days because I thinks it is a good language
> for the work.
[...]
I think so too.
> I welcome any informaion about this problem. Thanks in advance.
In the standard library, you'll want to look at these modules: httplib
(low level HTTP -- you probably don't want to use this), urllib2
(opens URLs as if they were files, handles redirections, proxies
etc. for you) and HTMLParser. The standard library also includes
sgmllib & htmllib, but you'll probably want to use HTMLParser instead
if you want that kind of event-driven parsing at all. Regular
expressions (re module) can also come in handy.
Personally, I've decided that I prefer the DOM style of parsing for
anything complicated -- it's just less work than the event-driven
style (though I don't much like the DOM API). PyXML has an HTML DOM
implementation called 4DOM. Use that together with mxTidy or
uTidylib: they will clean up the horrid HTML you'll find on the web to
the point where 4DOM can make sense of it. Another option is to use
mxTidy/uTidylib to output XHTML, which allows you to use any XML DOM
implementation -- eg. pxdom, minidom, libxml...
You might find my modules useful too. ClientCookie has an interface
just like urllib2 (and uses it to do its work), but handles cookies
and some other stuff too. ClientForm makes it easier to work with
HTML forms. ClientTable is currently a heap of junk, don't use it ;-)
I've just rewritten ClientForm on top of the DOM, which lets you
switch back and forth between the two APIs (and also lets you handle
JavaScript, rather badly ATM) -- coming RSN...
http://wwwsearch.sourceforge.net/
The other, completely different, way of web scraping is to use the
"automation" capabilities of the various big web browsers: Microsoft
Internet Explorer, KDE's Konqueror and Mozilla are all scriptable from
Python. You need the Python for Windows extensions, PyKDE or PyXPCOM
respectively to control those browsers. Advantages: easy handling of
JavaScript and other assorted nonsense, and they're generally
reasonably well-tested and stable pieces of software (not to mention
de-facto standards). Disadvantages: poor portability in some cases,
and they're rather big, complicated, closed applications that are hard
to modify (compared to the pure Python approach) and to distribute
(which last, I guess, isn't a problem for you, since you'll be the
only one using your software). Other problems: COM (for MSIE) is a
bit of a headache for newbies, PyXPCOM last time I looked seemed a
pain to install (Brendan Eich mentioned in a newsgroup post that that
has been changing recently, though), and PyKDE might not be that well
tested (it's a very big wrapper!).
One other bunch of software worthy of mention: you can use Jython to
access various Java libraries. HTTPClient and httpunit look like they
might be useful. In particular, the latter has some JavaScript
support.
John
More information about the Python-list
mailing list