[Chicago] web page content scraper

Warren Lindsey warren.lindsey at gmail.com
Thu Aug 28 23:07:39 CEST 2008


That looks like you've got half of your talk laid out already. Good job!

-----Original Message-----
From: Ian Bicking <ianb at colorstudy.com>
Sent: Thursday, August 28, 2008 3:34 PM
To: The Chicago Python Users Group <chicago at python.org>
Subject: Re: [Chicago] web page content scraper

Pete wrote:
> On Apr 9, 2008, at 11:27 AM, Adrian Holovaty wrote:
> 
>> On Tue, Apr 8, 2008 at 9:25 AM, Tom Printy 
>> <tprinty at mail.edisonave.net> wrote:
>>> Wow this library is super cool. Anyone got slides or notes from the
>>> talk?
>>
>> Hey, that's my library and was my talk. Note that the current version
>> of templatemaker (on Google Code) is pretty "dumb" when dealing with
>> HTML.
>>
>> Since that talk, I've developed a new one, based on lxml, that
>> analyzes differences in the HTML trees. It's a *lot* better (I'd even
>> call it *awesome*), but I haven't released it open-source yet. Stay
>> tuned.
> 
> Ian bicking wrote something similar IIRC, also based on lxml.  If you're 
> both gonna be there, would you like to talk about them briefly?  Anyone 
> want to speak for BeautifulSoup?  I'm thinking just 5-10 minutes on each.

I think that Adrian and my difference finders have very different 
motivations.  Mine (in lxml.html.diff) is primarily for viewing changes 
to content, while trying to ignore most changes to the structure of a 
page.  It is really intended to be used with content written by hand, 
typically in a WYSIWYG editor, where the text is intentional but other 
parts of the structure might not be entirely intentional (or at least 
not interesting).  Adrian's is focused on machine-generated content, 
detecting interesting changes in generated pages so the underlying 
information can be extracted.

Whether its written in BeautifulSoup or lxml probably wouldn't be 
terribly interesting -- both parse the HTML into some structure, and 
then we both deal with the structured data.  In lxml.html.diff I 
actually invert the structure, where lxml (and etree) has text as 
attributes of the elements, I make the elements an attribute of the 
text.  So it's hardly lxml, except for the fact that it is parsed by 
lxml.  The same thing written with BeautifulSoup would look very similar.

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
_______________________________________________
Chicago mailing list
Chicago at python.org
http://mail.python.org/mailman/listinfo/chicago



More information about the Chicago mailing list