Word frequencies -- Python or Perl for performance?
Bill Tate
tatebll at aol.com
Sun Mar 17 07:03:57 EST 2002
"Nick Arnett" <narnett at mccmedia.com> wrote in message news:<mailman.1016223990.19235.python-list at python.org>...
> I'll be processing a fairly large number of short (1-6K or so) documents at
> a time, so I'll be able to batch up things quite a bit. I'm thinking that
> the database might help me avoid loading up a lot of useless data. Since
> word frequencies follow a Zipf distribution, I'm guessing that I can spot
> unusual words (my goal here) by loading up the top 80 percent or so of words
> in the database (by occurrences) and focusing on the words that are in the
> docs but not in the set retrieved from the database.
Nick,
Suggestion, you might also check out metakit (see www.equi4.com).
There is a python binding for this embedded database. MK is extremely
fast, very flexible in terms of designing a suitable schema and uses a
very-straightforward querying syntax. Gordon McMillan added a sql
engine on top of it so you can basically use SQL like syntax as well
but it is not required.
More information about the Python-list
mailing list