philip at semanchuk.com
Wed Nov 19 15:41:38 CET 2008
On Nov 19, 2008, at 7:12 AM, Mr.SpOOn wrote:
> On Wed, Nov 19, 2008 at 2:39 AM, Mensanator <mensanator at aol.com>
>> Another hobby I have is tracking movie box-office receipts
>> (where you can make interesting graphs comparing Titanic
>> to Harry Potter or how well the various sequels do, if Pierce
>> Brosnan saved the James Bond franchise, what can you say about
>> Daniel Craig?). Lots of potential database problems there.
>> Not to mention automating the data collection from the Internet
>> Movie Database by writing a web page scraper than can grab
>> six months worth of data in a single session (you probably
>> wouldn't need this if you cough up a subscription fee for
>> professional access, but I'm not THAT serious about it).
> This is really interesting. What would one need to do such a thing?
> The only program web related I did in Python was generating a rss feed
> from a local newspaper static site, using BeautifulSoup. But I never
> put it on an online host. I'm not even sure if I could run. What
> requisites should have the host to run python code?
I'm not sure why you'd need to host the Python code anywhere other
than your home computer. If you wanted to pull thousands of pages from
a site like that, you'd need to respect their robots.txt file. Don't
forget to look for a crawl-delay specification. Even if they don't
specify one, you shouldn't let your bot hammer their servers at full
speed -- give it a delay, let it run in the background, it might take
you three days versus an hour to collect the data you need but that's
not too big of deal in the service of good manners, is it?
You might also want to change the user-agent string that you send out.
Some sites serve up different content to bots than to browsers.
You could even use wget to scrape the site instead of rolling your own
bot if you're more interested in the data manipulation aspect of the
project than the bot writing.
More information about the Python-list