[BangPypers] Website change tracker

Tue Jun 19 11:04:34 CEST 2012

On Fri, Jun 8, 2012 at 10:36 PM, vid <vid at svaksha.com> wrote:

> On Fri, Jun 8, 2012 at 4:09 PM, kracethekingmaker
> <kracethekingmaker at gmail.com> wrote:
> >
> >> Hello,
> >>
> >> I am newbie to Python coding. And, I had a question. I want to write a
> >> script which will check content changes in websites&  send e-mail to a
> >>
> >> admin whenever there are changes.
> >
> > How many times in a day or how often will this check be performed ?
> >
> > You must look into how to use md5, diff utilities, for web scraping
> scrapy
> > library is advised.
> >
> >> Ideally this script/program should be scalable for say about 1000
> websites
> >> at a time..
>
> 1000 sites at a time? Wow, that's huge. Scraping that many sites is
> resource intensive, would need a nice big stable server that can
> handle the huge data dumps. Fwiw, Scrapy will only dump the data in
> the json files so check out a little about the database you want to
> use, the frontend to serve it, a queueing system to scale 1000 sites,
> etc... Also, some sites instantly ban scrapers. Watch out for that,
> and goodluck :)
>

 This is much more easier than you think. It looks big because
 you are solving it as a full-scale scraping problem. This is in fact
 more in the lines of an "incremental crawler".

 Write a simple crawler that keeps track of a few key entrypoint
 URLs on every site. You can typically get them from the sitemap
 or from querying google. The crawler can be hand-written or use
 existing frameworks like pycurl, scrapy etc.

 1. When crawling, use a HEAD request to fetch the page. This
 ensures you only get the headers of the page not the data. Store
the metadata of interest to a file - use an MD5 hash of the URL as
a unique name and use a two level directory scheme of squid.
The fields of interest would be last-modified-time, etag (if any)
and content-length.

2. Recrawl at fixed intervals. Before requesting a URL load its
metadata from the cache if it exists - Fill in the "If-Modified-Since"
header and put the last-modified-time in there. Also you can optionally
add "If-None-Match" for the etag, if found.

3. If page is not modified, server returns HTTP 304 error. Handle it.
Otherwise download the page or do whatever other actions. Update the
cache if modified.

For 1000 sites, partition the sites into multiple sets and do such
incremental
crawls frequently. Use random selection to pick up the sites per set.

Use random selection of starting URLs to ensure you visit most parts
of a site every subsequent crawl.

I have written such systems before and still maintain them. It is an
interesting
area. Ask if you have specific questions.

>
> --
> Regards,
> Vid
> ॥ http://svaksha.com ॥
> _______________________________________________
> BangPypers mailing list
> BangPypers at python.org
> http://mail.python.org/mailman/listinfo/bangpypers
>

-- 
Regards,

--Anand