Re: [Python-ideas] A service to crawl +1s and URLs out of mailman archives

2 Dec 2014

      On Mon, Dec 1, 2014 at 12:51 PM, Steven D'Aprano 
wrote:
...
On Mon, Dec 01, 2014 at 09:52:41AM -0600, Wes Turner wrote:
...
In context to building a PEP or similar, I don't know how many times I've
trawled looking for:
* Docs links
* Source links
* Patch links
* THREAD POST LINKS
* Consensus
A tool to crawl structued and natural language data from the forums could
be very useful for preparing PEPs.
Yes it would be. Do you have any idea how to write such a tool?
Do you think suh a tool would be of enough interest to enough people
that it should be distributed in the Python standard library?
Such a module would undoubtedly rely upon external libraries like:
requests, celery, beautifulsoup, and NLTK.

And whatever is necessary to poll mailman without asyncio (e.g. channels,
websockets).

This is sort of in scope for python-ideas,
as a general observation that *linked* development artifacts
are traceable, reproducible, and task focused on: docs, code, and tests
(the build).
...
I think that this would make a great project on PyPI, especially since
it make take a long, long time for it to develop enough intelligence to
be able to do the job you're suggesting. Finding links to documentation
and source code is fairly straightforward, but building in the
intelligence to find "consensus" is a non-trivial application of natural
language processing and an impressive feat of artificial intelligence.
It certainly doesn't sound like something that somebody could write over
a weekend and add to the 3.5 standard library, it's more like an
on-going project that will see continual development for many years.
https://github.com/wrdrd/docs/blob/master/wrdrd/tools/crawl.py

Issues:

* Too many HTTP requests
* Inefficient
* A real live queue could be helpful

Thank you for your feedback!