[Chicago] web page content scraper

Cosmin Stejerean cstejerean at gmail.com
Thu Apr 10 00:26:52 CEST 2008


I like the idea of using web2py to clone the look and feel of a
website (but be aware of
http://www.37signals.com/svn/posts/575-but-theres-only-so-many-ways-to-do-something-right).
Also, you might find some buyers for such a project
http://mattmaroon.com/?p=337

- Cosmin

On Wed, Apr 9, 2008 at 5:18 PM, Massimo Di Pierro
<mdipierro at cs.depaul.edu> wrote:
> I posted my toy screen scraper on
>
>     http://mdp.cti.depaul.edu/examples/static/scraper.py
>
>  There seem to be a lot of expertise on the list on this topic so
>  perhaps you can help make it better or just use it to make your
>  better. Currently it scrapes correctly wikipedia page given two
>  examples, extracts all repeated tags, removes all text and replaces
>  it with "text", finds all link and it handles javascript properly.
>
>  It uses the LCS applied to symbols (tags) instead of character.
>  Something I suggested when Adrian gave us an excellent presentation
>  on this topic.
>
>  What's missing:
>  1) after it finds all links, which it does (except for links in css
>  and javascript) it should loop over then, download the images, rename
>  them and rename the links.
>  2) It may introduce some spurious close tags. They have to be removed.
>
>  Eventually I would like to add a button to the web2py admin interface
>  that says: "make my app look like that one" and it will scrape the
>  other one, download images and build a new web2py template. I am
>  close but I could use some help.
>
>  Massimo
>
>
>
>  On Apr 9, 2008, at 2:23 PM, Christopher Allan Webber wrote:
>
>  > It sounds interesting.  I'm interested in seeing the technical reasons
>  > for the change to lxml, and possibly how that benefitted you.  Maybe
>  > do another talk (or at least a lightning talk) at another ChiPy
>  > meeting once you're ready to open it?
>  >
>  > "Adrian Holovaty" <web at holovaty.com> writes:
>  >
>  >> On Tue, Apr 8, 2008 at 9:25 AM, Tom Printy
>  >> <tprinty at mail.edisonave.net> wrote:
>  >>> Wow this library is super cool. Anyone got slides or notes from the
>  >>>  talk?
>  >>
>  >> Hey, that's my library and was my talk. Note that the current version
>  >> of templatemaker (on Google Code) is pretty "dumb" when dealing with
>  >> HTML.
>  >>
>  >> Since that talk, I've developed a new one, based on lxml, that
>  >> analyzes differences in the HTML trees. It's a *lot* better (I'd even
>  >> call it *awesome*), but I haven't released it open-source yet. Stay
>  >> tuned.
>  >>
>  >> Adrian
>  >> _______________________________________________
>  >> Chicago mailing list
>  >> Chicago at python.org
>  >> http://mail.python.org/mailman/listinfo/chicago
>  > _______________________________________________
>  > Chicago mailing list
>  > Chicago at python.org
>  > http://mail.python.org/mailman/listinfo/chicago
>
>  _______________________________________________
>  Chicago mailing list
>  Chicago at python.org
>  http://mail.python.org/mailman/listinfo/chicago
>



-- 
Cosmin Stejerean
http://blog.offbytwo.com


More information about the Chicago mailing list