As poet Gene Fowler used to remind me occasionally, sometimes you *do*
want to reinvent the wheel, because that's a learning experience. As
we "climb the ladder of learning" (old metaphor), we sometimes cut our
teeth on time-tested exercises.
Textbooks have set the standard in the pre-millenium, and we're still
calling them textbooks, even as we're phasing out wood pulp a lot
more, in favor of Kindles or what have you.
What I'm thinking would be a fun FOSS project is a "buzzbot".
Companies are making bucks of these things in-house. My associate
Patrick Barton has written one for a company in Chicago. I've
encourage him to approach WebTrends here in Portland to see if there's
overlap between silos, but he already has potential clients on the
hook in Hollywood, wonders if I want to manage his code. I came back
with: as a team effort, it sounds great. Do something in Mercurial?
The idea would be something like:
>>> import buzzbot
>>> search = buzzbot.Search("Britney Spears", engines = ['Google','Yahoo','Technorati'], filter = ['blogs'], target = 'SQL_base')
>>> search.run()
5000 blog posts downloaded to SQL_base (3.12 mins)
>>>
What you'd get are screen-scraped texts, we hope purged of a lot of
Javascript or extraneous XHTML.
The final step is scoring and reporting, which is where we might only
supply some skeletal algorithms. Patrick has a list of cuss words in
English and it's easy to write a scoring algorithm that scans for
these with the bias that some expletives have a negative connotation,
others positive. It's like SpamAssassin in terms of having all these
rules.
What I need to find out before going too far with this is:
(a) do we already have exactly what I'm seeking as a public project?
(b) will I be able to find schools that want to turn students loose on
this scaffolding?
(c) are we looking at a real project like on Sourceforge, or just a
set of interconnected examples that feature aspects of Python?
This proposal is in part a response to Laura's CP4E work on the
Diversity list. She's been suggesting that a primary barrier to
increasing diversity is simply the daunting time demands, the somewhat
austere culture of half-moribund bug trackers and dev lists (some of
these are more like sunken shipwrecks) -- scares people away,
especially people who "have a life" outside of being uber-nerds.
I think many worthy NGOs would benefit from having a buzzbot inhouse
or from contracting with an inexpensive PR company specializing in
buzzbot robotics. The secret sauce is in scoring, whereas the search
engines already have public APIs.
The fancier buzzbots fork into multiple Python worker bee processes
that each harvest and file to SQL independently, with a Dispatcher
(queen bee) keeping track of open searches, handing them off to
available processes. I published some manga code to the Chicago user
list awhile back, haven't tried to dig for it yet...
Kirby