[Tutor] Question about scraping

Alan Gauld alan.gauld at btinternet.com
Fri May 30 20:20:50 CEST 2014


On 30/05/14 18:25, Matthew Ngaha wrote:
> Hey all. I've been meaning to get into web scraping and was pointed to
> the directions of lxml (library) and scrapy (framework). Can I ask in
> terms of web scraping, what's the difference between a library and a
> framework?

I don;t know of anything web specific. A framework tends to be a much 
bigger thing than a library. It dictates the architecture of the 
solution rather than just providing a few functions/classes.

> Surely everyone should use a framework

Why?
A framework is usually the fastest way to get started from zero but if 
you are integrating with an existing solution then a framework can add 
layers of unneeded complexity. As always the correct solution depends on 
the problem.

> I also have another question due to reading this: "[Tutor] HTML
> Parsing" . It seems some experienced coders don't find scraping as
> useful since web sites offer apis for their data. Is the idea/concept
> here the same as scraping?

No, its completely different. Scraping means trying to decipher a public 
web page that is designed for display in a browser. Web pages are prone 
to frequent change and the data often moves around within the page 
meaning constant updates to your scraper. Also web pages are 
increasingly dynamically generated which makes scraping much harder.

An API is relatively stable and returns just the data elements of
the page. As such its usually easier to use, more secure,
more stable, faster (lower bandwidth required) and has much
less impact on the providers network/servers thus improving
performance for everyone.

> And is there any use of scraping anymore
> when sites are now offering their data?

If a site offers an API that returns the data you need then use it,
If not you have few alternatives to scraping (although scraping
may be 'illegal' anyway due to the impact on other users). But scraping, 
whether a web page or a GUI or an old mainframe terminal
is always a fragile and unsatisfactory solution. An API will
always be better in the long term if it exists.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.flickr.com/photos/alangauldphotos



More information about the Tutor mailing list