[Chicago] web page content scraper
Massimo Di Pierro
mdipierro at cs.depaul.edu
Thu Apr 10 00:18:58 CEST 2008
I posted my toy screen scraper on
http://mdp.cti.depaul.edu/examples/static/scraper.py
There seem to be a lot of expertise on the list on this topic so
perhaps you can help make it better or just use it to make your
better. Currently it scrapes correctly wikipedia page given two
examples, extracts all repeated tags, removes all text and replaces
it with "text", finds all link and it handles javascript properly.
It uses the LCS applied to symbols (tags) instead of character.
Something I suggested when Adrian gave us an excellent presentation
on this topic.
What's missing:
1) after it finds all links, which it does (except for links in css
and javascript) it should loop over then, download the images, rename
them and rename the links.
2) It may introduce some spurious close tags. They have to be removed.
Eventually I would like to add a button to the web2py admin interface
that says: "make my app look like that one" and it will scrape the
other one, download images and build a new web2py template. I am
close but I could use some help.
Massimo
On Apr 9, 2008, at 2:23 PM, Christopher Allan Webber wrote:
> It sounds interesting. I'm interested in seeing the technical reasons
> for the change to lxml, and possibly how that benefitted you. Maybe
> do another talk (or at least a lightning talk) at another ChiPy
> meeting once you're ready to open it?
>
> "Adrian Holovaty" <web at holovaty.com> writes:
>
>> On Tue, Apr 8, 2008 at 9:25 AM, Tom Printy
>> <tprinty at mail.edisonave.net> wrote:
>>> Wow this library is super cool. Anyone got slides or notes from the
>>> talk?
>>
>> Hey, that's my library and was my talk. Note that the current version
>> of templatemaker (on Google Code) is pretty "dumb" when dealing with
>> HTML.
>>
>> Since that talk, I've developed a new one, based on lxml, that
>> analyzes differences in the HTML trees. It's a *lot* better (I'd even
>> call it *awesome*), but I haven't released it open-source yet. Stay
>> tuned.
>>
>> Adrian
>> _______________________________________________
>> Chicago mailing list
>> Chicago at python.org
>> http://mail.python.org/mailman/listinfo/chicago
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
More information about the Chicago
mailing list