[Tutor] Request review: A DSL for scraping a web page

Alan Gauld alan.gauld at btinternet.com
Thu Apr 2 10:22:02 CEST 2015


On 02/04/15 04:18, Joe Farro wrote:
> Hello,
>
> I recently wrote a python package and was wondering if anyone might have
> time to review it?

This list is for people learning Python and answering questions
about the core language and standard library. I suspect this is
more appropriate to the main python list.

However to make any meaningful comments we would probably need a bit 
more of a specification to know what your module does.

> The package implements a DSL that is intended to make web-scraping a bit
> more maintainable :)

DSL?

> I generally find my scraping code ends up being rather chaotic with
> querying, regex manipulations, conditional processing, conversions, etc.,
> ending up being to close together and sometimes interwoven. It's stressful.

Have you looked at the existing web scraping tools in Python?
There are several to pick from. They all avoid the kind of mess
you describe.

 > The DSL attempts to mitigate this by doing only two things:
 > finding stuff and saving it as a string. The post-processing
 > is left to be done down the pipeline. It's almost just
 > a configuration file.

> Here is an example that would get the text and URL for every link in a page:
>
>      $ a
>          save each: links
>              | [href]
>                  save: url
>              | text
>                  save: link_text

And how is that run?
What is the syntax for the config file?
It is not self evident. The other example on github is no less obscure.
I'm sure it means something to you but it is not obvious.

OK, I see there is much more on the github. Sadly too much for me to 
plough through just now.

> The result would be something along these lines:
>
>      {
>          'links': [
>              {
>                  'url': 'http://www.something.com/hm',
>                  'link_text': 'The text in the link'
>              },
>              # etc... another dict for each <a> tag
>          ]
>      }

This seems straightforward.

> The hope is that having all the selectors in one place will make them more
> manageable and possibly simplify the post-processing.
>
> This is my first go at something along these lines, so any feedback is
> welcomed.

I think the main python list is a better bet for feedback on something 
of this size.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list