[Tutor] Request review: A DSL for scraping a web page
Joe Farro
joe.farro at gmail.com
Thu Apr 2 05:18:21 CEST 2015
Hello,
I recently wrote a python package and was wondering if anyone might have
time to review it?
I'm fairly new to python - it's been about 1/2 of my workload at work for
the past year. Any suggestions would be super appreciated.
https://github.com/tiffon/take
https://pypi.python.org/pypi/take
The package implements a DSL that is intended to make web-scraping a bit
more maintainable :)
I generally find my scraping code ends up being rather chaotic with
querying, regex manipulations, conditional processing, conversions, etc.,
ending up being to close together and sometimes interwoven. It's stressful.
The DSL attempts to mitigate this by doing only two things: finding stuff
and saving it as a string. The post-processing is left to be done down the
pipeline. It's almost just a configuration file.
Here is an example that would get the text and URL for every link in a page:
$ a
save each: links
| [href]
save: url
| text
save: link_text
The result would be something along these lines:
{
'links': [
{
'url': 'http://www.something.com/hm',
'link_text': 'The text in the link'
},
# etc... another dict for each <a> tag
]
}
The hope is that having all the selectors in one place will make them more
manageable and possibly simplify the post-processing.
This is my first go at something along these lines, so any feedback is
welcomed.
Thanks!
Joe
More information about the Tutor
mailing list