Word frequencies -- Python or Perl for performance?
narnett at mccmedia.com
Fri Mar 15 21:26:30 CET 2002
Anybody have any experience generating word frequencies from short documents
with Python and Perl? Given a choice between the two, I'm wondering what
will be faster. And a related question... any idea if there will be a
significant performance hit (or advantage?) from storing the data in MySQL
v. my own file-based data structures?
I'll be processing a fairly large number of short (1-6K or so) documents at
a time, so I'll be able to batch up things quite a bit. I'm thinking that
the database might help me avoid loading up a lot of useless data. Since
word frequencies follow a Zipf distribution, I'm guessing that I can spot
unusual words (my goal here) by loading up the top 80 percent or so of words
in the database (by occurrences) and focusing on the words that are in the
docs but not in the set retrieved from the database.
Thanks for any thoughts on this and pointers to helpful examples or modules.
More information about the Python-list