parsing complex web pages

Wed Jun 18 19:47:50 EDT 2003

John Hunter <jdhunter at ace.bsd.uchicago.edu> writes:

> I sometimes parse very complex web pages like
> 
>   http://biz.yahoo.com/z/a/x/xom.html
> 
> to get all the quantitative information out of them.  I find that
> parsing the text is easier than parsing the html, because there is so
> much extra syntax in the html that requires writing complex regexs or
> parsers, whereas with the text version simple string splits and finds
> often suffice.

I think a mixture of both is often best for maintainability.  In fact,
everything you can throw at the problem is probably useful in some
way.  Of course, maintainability is always a bit of a joke when
web-scraping, but it can be approached, and isn't intrinsically at
odds with writing simple code.

<vapourware>

I wrote some code to parse HTML tables and forms into a specialised
object model useful for web testing and scraping (the tables code is
'very alpha').  After I stabilise the code I already have, I plan to
rewrite that high-level code on top of a sloppy HTML DOM parser.  I
think that would drastically simplify a lot of HTML parsing: you could
move back and forth between the high-level HTMLForm, HTMLTable (etc.)
objects and the lower-level DOM (perhaps with pointers into the
original HTML, too).  You could then say things like 'find the first
table that has a TH element containing the string "widget production",
and give me an HTMLTable object for that'.  Going in the other
direction, once you'd found the cell you wanted in that table, you
could then drop back to the DOM to grab a link from it, for example.
DOM would also help to enable javascript interpretation, which is a
major pain ATM.

What's really needed to generate DOM trees is either something like
Perl's HTML::TreeBuilder (which is good and sloppy, but unfortunately
IIRC doesn't produce DOM trees useful for javascript) or a wrapper of
Mozilla's code (or another portable open source browser).  HTMLtidy +
a parser like 4Suite's might also work, but I'm not yet sure whether
that's the best way to do it -- one wants to preserve as much
information as possible.

</vapourware>

> The current strategy I use is to call lynx via popen to convert the
> page to text.  If you do:
> 
>   lynx -dump http://biz.yahoo.com/z/a/x/xom.html
> 
> you'll see that the text comes back for the most part in a nicely
> formatted, readily parsable layout; eg, the layout of many of the
> tables is preserved.

Nice idea!  Full marks for code reuse and simplicity :-)

> If you do one of the standard html2txt conversions using HTMLParser
> and a DumbWriter, the text layout is not as well preserved (albeit
> still parsable)
[...]

I suppose you're lucky if that preserves any useful table structure
information.

> I am more or less happy with the lynx solution, but would be happier
> with a pure python solution.

If it works well for you, why not stick with it?

> Is there a good python -> text solution that preserves table layouts
> and other visual formatting?
> 
> What do people recommend for quick parsing of complicated web pages?

regexps, string methods &c., stripping tags and parsing text, (or
stripping text and parsing tags, I guess...), htmllib / sgmllib /
HTMLParser, 4Suite (HTML DOM), HTMLtidy, XML parsers, reusing browsers
(your excellent simple way, or complicated stuff like IE's MSHTML --
anybody here tried reusing Mozilla / XPCOM or Konqueror / KParts?),
something like HTML::TokeParser (hope I remember the name right) if it
existed in Python, Perl modules using pyperl (including a table parser
with bizarrely complicated declarative matching options), Java code
using Jython or JPE (I notice http://httpunit.sourceforge.net/ already
does some of the things I want to do in Python, and
http://maxq.tigris.org/ is interesting and itself uses Jython),
kitchen sinks...

...and my own modules of course :-)

John