[Python-ideas] A service to crawl +1s and URLs out of mailman archives

Tue Dec 2 11:51:30 CET 2014

On Mon, Dec 1, 2014 at 12:51 PM, Steven D'Aprano <steve at pearwood.info>
wrote:

> On Mon, Dec 01, 2014 at 09:52:41AM -0600, Wes Turner wrote:
>
> > In context to building a PEP or similar, I don't know how many times I've
> > trawled looking for:
> >
> > * Docs links
> > * Source links
> > * Patch links
> > * THREAD POST LINKS
> > * Consensus
> >
> > A tool to crawl structued and natural language data from the forums could
> > be very useful for preparing PEPs.
>
> Yes it would be. Do you have any idea how to write such a tool?
>
> Do you think suh a tool would be of enough interest to enough people
> that it should be distributed in the Python standard library?
>

Such a module would undoubtedly rely upon external libraries like:
requests, celery, beautifulsoup, and NLTK.

And whatever is necessary to poll mailman without asyncio (e.g. channels,
websockets).

This is sort of in scope for python-ideas,
as a general observation that *linked* development artifacts
are traceable, reproducible, and task focused on: docs, code, and tests
(the build).

> I think that this would make a great project on PyPI, especially since
> it make take a long, long time for it to develop enough intelligence to
> be able to do the job you're suggesting. Finding links to documentation
> and source code is fairly straightforward, but building in the
> intelligence to find "consensus" is a non-trivial application of natural
> language processing and an impressive feat of artificial intelligence.
> It certainly doesn't sound like something that somebody could write over
> a weekend and add to the 3.5 standard library, it's more like an
> on-going project that will see continual development for many years.
>
>
https://github.com/wrdrd/docs/blob/master/wrdrd/tools/crawl.py

Issues:

* Too many HTTP requests
* Inefficient
* A real live queue could be helpful

Thank you for your feedback!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20141202/4c52c805/attachment.html>