[Python-ideas] A service to crawl +1s and URLs out of mailman archives
wes.turner at gmail.com
Tue Dec 2 11:51:30 CET 2014
On Mon, Dec 1, 2014 at 12:51 PM, Steven D'Aprano <steve at pearwood.info>
> On Mon, Dec 01, 2014 at 09:52:41AM -0600, Wes Turner wrote:
> > In context to building a PEP or similar, I don't know how many times I've
> > trawled looking for:
> > * Docs links
> > * Source links
> > * Patch links
> > * THREAD POST LINKS
> > * Consensus
> > A tool to crawl structued and natural language data from the forums could
> > be very useful for preparing PEPs.
> Yes it would be. Do you have any idea how to write such a tool?
> Do you think suh a tool would be of enough interest to enough people
> that it should be distributed in the Python standard library?
Such a module would undoubtedly rely upon external libraries like:
requests, celery, beautifulsoup, and NLTK.
And whatever is necessary to poll mailman without asyncio (e.g. channels,
This is sort of in scope for python-ideas,
as a general observation that *linked* development artifacts
are traceable, reproducible, and task focused on: docs, code, and tests
> I think that this would make a great project on PyPI, especially since
> it make take a long, long time for it to develop enough intelligence to
> be able to do the job you're suggesting. Finding links to documentation
> and source code is fairly straightforward, but building in the
> intelligence to find "consensus" is a non-trivial application of natural
> language processing and an impressive feat of artificial intelligence.
> It certainly doesn't sound like something that somebody could write over
> a weekend and add to the 3.5 standard library, it's more like an
> on-going project that will see continual development for many years.
* Too many HTTP requests
* A real live queue could be helpful
Thank you for your feedback!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-ideas