[Chicago] web page content scraper
Cosmin Stejerean
cstejerean at gmail.com
Thu Apr 10 00:26:52 CEST 2008
I like the idea of using web2py to clone the look and feel of a
website (but be aware of
http://www.37signals.com/svn/posts/575-but-theres-only-so-many-ways-to-do-something-right).
Also, you might find some buyers for such a project
http://mattmaroon.com/?p=337
- Cosmin
On Wed, Apr 9, 2008 at 5:18 PM, Massimo Di Pierro
<mdipierro at cs.depaul.edu> wrote:
> I posted my toy screen scraper on
>
> http://mdp.cti.depaul.edu/examples/static/scraper.py
>
> There seem to be a lot of expertise on the list on this topic so
> perhaps you can help make it better or just use it to make your
> better. Currently it scrapes correctly wikipedia page given two
> examples, extracts all repeated tags, removes all text and replaces
> it with "text", finds all link and it handles javascript properly.
>
> It uses the LCS applied to symbols (tags) instead of character.
> Something I suggested when Adrian gave us an excellent presentation
> on this topic.
>
> What's missing:
> 1) after it finds all links, which it does (except for links in css
> and javascript) it should loop over then, download the images, rename
> them and rename the links.
> 2) It may introduce some spurious close tags. They have to be removed.
>
> Eventually I would like to add a button to the web2py admin interface
> that says: "make my app look like that one" and it will scrape the
> other one, download images and build a new web2py template. I am
> close but I could use some help.
>
> Massimo
>
>
>
> On Apr 9, 2008, at 2:23 PM, Christopher Allan Webber wrote:
>
> > It sounds interesting. I'm interested in seeing the technical reasons
> > for the change to lxml, and possibly how that benefitted you. Maybe
> > do another talk (or at least a lightning talk) at another ChiPy
> > meeting once you're ready to open it?
> >
> > "Adrian Holovaty" <web at holovaty.com> writes:
> >
> >> On Tue, Apr 8, 2008 at 9:25 AM, Tom Printy
> >> <tprinty at mail.edisonave.net> wrote:
> >>> Wow this library is super cool. Anyone got slides or notes from the
> >>> talk?
> >>
> >> Hey, that's my library and was my talk. Note that the current version
> >> of templatemaker (on Google Code) is pretty "dumb" when dealing with
> >> HTML.
> >>
> >> Since that talk, I've developed a new one, based on lxml, that
> >> analyzes differences in the HTML trees. It's a *lot* better (I'd even
> >> call it *awesome*), but I haven't released it open-source yet. Stay
> >> tuned.
> >>
> >> Adrian
> >> _______________________________________________
> >> Chicago mailing list
> >> Chicago at python.org
> >> http://mail.python.org/mailman/listinfo/chicago
> > _______________________________________________
> > Chicago mailing list
> > Chicago at python.org
> > http://mail.python.org/mailman/listinfo/chicago
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>
--
Cosmin Stejerean
http://blog.offbytwo.com
More information about the Chicago
mailing list