[Tutor] Request review: A DSL for scraping a web page
Alan Gauld
alan.gauld at btinternet.com
Thu Apr 2 10:22:02 CEST 2015
On 02/04/15 04:18, Joe Farro wrote:
> Hello,
>
> I recently wrote a python package and was wondering if anyone might have
> time to review it?
This list is for people learning Python and answering questions
about the core language and standard library. I suspect this is
more appropriate to the main python list.
However to make any meaningful comments we would probably need a bit
more of a specification to know what your module does.
> The package implements a DSL that is intended to make web-scraping a bit
> more maintainable :)
DSL?
> I generally find my scraping code ends up being rather chaotic with
> querying, regex manipulations, conditional processing, conversions, etc.,
> ending up being to close together and sometimes interwoven. It's stressful.
Have you looked at the existing web scraping tools in Python?
There are several to pick from. They all avoid the kind of mess
you describe.
> The DSL attempts to mitigate this by doing only two things:
> finding stuff and saving it as a string. The post-processing
> is left to be done down the pipeline. It's almost just
> a configuration file.
> Here is an example that would get the text and URL for every link in a page:
>
> $ a
> save each: links
> | [href]
> save: url
> | text
> save: link_text
And how is that run?
What is the syntax for the config file?
It is not self evident. The other example on github is no less obscure.
I'm sure it means something to you but it is not obvious.
OK, I see there is much more on the github. Sadly too much for me to
plough through just now.
> The result would be something along these lines:
>
> {
> 'links': [
> {
> 'url': 'http://www.something.com/hm',
> 'link_text': 'The text in the link'
> },
> # etc... another dict for each <a> tag
> ]
> }
This seems straightforward.
> The hope is that having all the selectors in one place will make them more
> manageable and possibly simplify the post-processing.
>
> This is my first go at something along these lines, so any feedback is
> welcomed.
I think the main python list is a better bet for feedback on something
of this size.
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos
More information about the Tutor
mailing list