Word frequencies -- Python or Perl for performance?
Jim Dennis
jimd at vega.starshine.org
Fri Mar 22 06:02:02 EST 2002
In article <a7dmuk$21j0$1 at news.idiom.com>, Jim Dennis wrote:
>In article <mailman.1016724875.29984.python-list at python.org>,
Nick Arnett wrote:
>>> I don't know what you're really trying to do, but I decided to
>>> code up a quickie "word counter" for the hell of it.
>>Wow -- thanks. I'm going to ask questions more often now!
>>Nick
> I deliberately made it a class so you could instantiate
> multiple word counts on different blocks of text to compare them
> or whatever, and so you could import it into other programs and
> use other methods (i.e. a web spider with urllib and a "text" extractor
> with htmllib) to get your text.
So, in my ongoing effort to bolster my Python programming
skills, I decided to add PostgreSQL support to my word frequency
counter.
This is my first bit of SQL RDBMS programming. In this case
it was pretty simple (the hardest part was convincing Debian to
re-install postgresql cleanly in my user chroot jail, that's a
wrinkle that need not be elaborated upon here).
I only had to add five lines of code (including the import directive)
to make this dump the words and word frequencies into a database table.
(The import is the first line in "__main__" and the the work is all
in the last four lines. A comment shows the DML to create the
table).
(I also had to re-write a few lines where I was flagging the "known
words" in my output). However, my approach is pretty crude; I
should be querying the db to update words that are already in the
table, and only inserting new rows for new words. Maybe I'll do that
for version 0.3
Here's the updated version:
#!/usr/bin/env python2.2
""" Word Frequency Counter """
import sys, string
author="James T. Dennis <jimd at starshine.org>"
version="0.2"
changelog="""
Fri Mar 22 02:52:28 PST 2002
added support for dumping results into database
"""
bugs="""
Will create duplicate words in the table,
I should query the list into a dictionary,
class Wordcount:
"""Keep a count of all unique "words" in text
Maintain a dictionary or words, each with a
count of the number of occurences,
Add arbitrary text to it, return the dictionary on demand
Generator for most/least frequent words
??? Options for allowed charset, and case sensitivity ???"""
# for words like isn't and O'Holloran and fiddle-faddle
# what should we do about contractions?
# ditto possessive forms?
tr = string.maketrans('','')
rm = string.punctuation + string.digits
rm = string.translate(rm, tr, "'-")
# Don't remove apostrophes and hypens from "words"
knownWords = {}
knownWordsRead = 0
def __init__(self):
self.words = {}
self.count = 0 # total words processed
self.nword = 0 # number of words in our instance dictionary
self.known = 0 # number of our words found in the class dict.
# Each instance gets its own word list and total count
if not Wordcount.knownWordsRead:
# We'll try to create the "known words" dictionary
# But we only do that on first instantiation
# since all instances share this one dictionary
try:
Wordcount.knownWordsRead = 1
wlist = open('/usr/share/dict/words','r')
for i in wlist:
i = i.lower().strip()
if not i in Wordcount.knownWords:
Wordcount.knownWords[i] = 0
except: pass
# but we won't try very hard
# print "debug: ", word
def add (self,text):
for each in text.split():
word = string.translate(each, Wordcount.tr, Wordcount.rm).lower()
word = word.strip()
while word.endswith("'"): word = word[:-1] # strip quotes
while word.startswith("'"): word = word[1:] #
if word.startswith('-'): continue
if word.endswith('-'): continue
if word.endswith("n't"): word = word[:-3] # can't include these
if word.endswith("'ll"): word = word[:-3] # or you'll wonder
if word.endswith("'s"): word = word[:-2] # who's
if word == '-' or word == "'" or len(word) < 1 : continue
self.count += 1
if not word in self.words:
self.words[word] = 1
self.nword += 1
else:
self.words[word] += 1
if word in Wordcount.knownWords: self.known += 1
def dump (self):
self.items = [ (y,x) for x,y in self.words.items() ]
self.items.sort()
self.items.reverse()
return self.items
if __name__ == '__main__':
import psycopg
wcount = Wordcount()
for i in sys.argv[1:]:
file = open(i,"r")
for line in file:
wcount.add(line)
# handle hyphenation?
## poss. by cutting last word IFF ends in hyphen
## and prepending to next line.
print wcount.count, wcount.known, wcount.nword, \
wcount.known/float(wcount.count), wcount.nword/float(wcount.known)
l = []
for count, word in wcount.dump():
if count > 1: # Skip "unique" words
if word in Wordcount.knownWords:
f = '*'
t = (word,count,'true')
else:
f = ''
t = (word,count,'false')
l.append(t)
print "%7d %s%s" % (count, word, f)
print wcount.count, wcount.known, wcount.nword, \
wcount.known/float(wcount.count), wcount.nword/float(wcount.known)
### Dump into database:
db = psycopg.connect('dbname=test user=jimd')
## db.execute('''CREATE TABLE word_frequencies
## (word text, count integer, known boolean'''')
cursor = db.cursor()
cursor.executemany("insert into word_frequencies values (%s, %d, %s);",
l )
db.commit()
One thing confuses me a bit. I'd always heard that database
cursors are an extemely limited resource and that the use of them
is best avoided, yet the DB-API docs(*) seem to suggest that you
usually use cursors with Python db connections but that one could
do .execute() function on db connection *WITHOUT* the cursor.
"""
Next, you should create a cursor object. A cursor object acts as a
handle for a given SQL query; it allows retrieval of one or
more rows of the result, until all the matching rows have been
processed. For simple applications that do not need more than
one query at a time, it's not necessary to use a cursor object since
database objects support all the same methods as cursor
objects. We'll deliberately use cursor objects in the following
example. (For more on beginning SQL, see At The Forge by
Reuven Lerner in LJ, Octoboer, 1997.)
"""
* ( http://www.amk.ca/python/writing/DB-API.html )
Yet, I can't seem to find those in psycopg. Is psycopg DB-API 2
compliant? Is this feature there? Or does DB-API not require
this feature?
More information about the Python-list
mailing list