On Mon, Dec 1, 2014 at 12:51 PM, Steven D'Aprano
On Mon, Dec 01, 2014 at 09:52:41AM -0600, Wes Turner wrote:
In context to building a PEP or similar, I don't know how many times I've trawled looking for:
* Docs links * Source links * Patch links * THREAD POST LINKS * Consensus
A tool to crawl structued and natural language data from the forums could be very useful for preparing PEPs.
Yes it would be. Do you have any idea how to write such a tool?
Do you think suh a tool would be of enough interest to enough people that it should be distributed in the Python standard library?
Such a module would undoubtedly rely upon external libraries like: requests, celery, beautifulsoup, and NLTK. And whatever is necessary to poll mailman without asyncio (e.g. channels, websockets). This is sort of in scope for python-ideas, as a general observation that *linked* development artifacts are traceable, reproducible, and task focused on: docs, code, and tests (the build).
I think that this would make a great project on PyPI, especially since it make take a long, long time for it to develop enough intelligence to be able to do the job you're suggesting. Finding links to documentation and source code is fairly straightforward, but building in the intelligence to find "consensus" is a non-trivial application of natural language processing and an impressive feat of artificial intelligence. It certainly doesn't sound like something that somebody could write over a weekend and add to the 3.5 standard library, it's more like an on-going project that will see continual development for many years.
https://github.com/wrdrd/docs/blob/master/wrdrd/tools/crawl.py Issues: * Too many HTTP requests * Inefficient * A real live queue could be helpful Thank you for your feedback!