Word frequencies -- Python or Perl for performance?

Bill Tate tatebll at aol.com
Sun Mar 17 07:03:57 EST 2002


"Nick Arnett" <narnett at mccmedia.com> wrote in message news:<mailman.1016223990.19235.python-list at python.org>...
> I'll be processing a fairly large number of short (1-6K or so) documents at
> a time, so I'll be able to batch up things quite a bit.  I'm thinking that
> the database might help me avoid loading up a lot of useless data.  Since
> word frequencies follow a Zipf distribution, I'm guessing that I can spot
> unusual words (my goal here) by loading up the top 80 percent or so of words
> in the database (by occurrences) and focusing on the words that are in the
> docs but not in the set retrieved from the database.

Nick,
Suggestion, you might also check out metakit (see www.equi4.com). 
There is a python binding for this embedded database.  MK is extremely
fast, very flexible in terms of designing a suitable schema and uses a
very-straightforward querying syntax.  Gordon McMillan added a sql
engine on top of it so you can basically use SQL like syntax as well
but it is not required.



More information about the Python-list mailing list