[Tutor] Request review: A DSL for scraping a web page

Joe Farro joe.farro at gmail.com
Thu Apr 2 05:18:21 CEST 2015


Hello,

I recently wrote a python package and was wondering if anyone might have
time to review it?

I'm fairly new to python - it's been about 1/2 of my workload at work for
the past year. Any suggestions would be super appreciated.

https://github.com/tiffon/take

https://pypi.python.org/pypi/take


The package implements a DSL that is intended to make web-scraping a bit
more maintainable :)

I generally find my scraping code ends up being rather chaotic with
querying, regex manipulations, conditional processing, conversions, etc.,
ending up being to close together and sometimes interwoven. It's stressful.
The DSL attempts to mitigate this by doing only two things: finding stuff
and saving it as a string. The post-processing is left to be done down the
pipeline. It's almost just a configuration file.

Here is an example that would get the text and URL for every link in a page:

    $ a
        save each: links
            | [href]
                save: url
            | text
                save: link_text


The result would be something along these lines:

    {
        'links': [
            {
                'url': 'http://www.something.com/hm',
                'link_text': 'The text in the link'
            },
            # etc... another dict for each <a> tag
        ]
    }


The hope is that having all the selectors in one place will make them more
manageable and possibly simplify the post-processing.

This is my first go at something along these lines, so any feedback is
welcomed.

Thanks!

Joe


More information about the Tutor mailing list