Word frequencies -- Python or Perl for performance?

Fri Mar 22 06:02:02 EST 2002

In article <a7dmuk$21j0$1 at news.idiom.com>, Jim Dennis wrote:
>In article <mailman.1016724875.29984.python-list at python.org>, 
	Nick Arnett wrote:

>>>  I don't know what you're really trying to do, but I decided to 
>>>  code up a quickie "word counter" for the hell of it.

>>Wow -- thanks.  I'm going to ask questions more often now!
>>Nick

> I deliberately made it a class so you could instantiate 
> multiple word counts on different blocks of text to compare them
> or whatever, and so you could import it into other programs and
> use other methods (i.e. a web spider with urllib and a "text" extractor
> with htmllib) to get your text.

 So, in my ongoing effort to bolster my Python programming
 skills, I decided to add PostgreSQL support to my word frequency
 counter. 

 This is my first bit of SQL RDBMS programming.  In this case 
 it was pretty simple (the hardest part was convincing Debian to
 re-install postgresql cleanly in my user chroot jail, that's a 
 wrinkle that need not be elaborated upon here).

 I only had to add five lines of code (including the import directive)
 to make this dump the words and word frequencies into a database table.
 (The import is the first line in "__main__" and the the work is all
 in the last four lines.  A comment shows the DML to create the 
 table).

 (I also had to re-write a few lines where I was flagging the "known
 words" in my output).  However, my approach is pretty crude; I 
 should be querying the db to update words that are already in the 
 table, and only inserting new rows for new words.  Maybe I'll do that
 for version 0.3

 Here's the updated version:

#!/usr/bin/env python2.2
""" Word Frequency Counter """
import sys, string

author="James T. Dennis <jimd at starshine.org>"
version="0.2"
changelog="""
	Fri Mar 22 02:52:28 PST 2002
		added support for dumping results into database
		"""
bugs="""
	Will create duplicate words in the table, 
	I should query the list into a dictionary, 

class Wordcount:
	"""Keep a count of all unique "words" in text
	   Maintain a dictionary or words, each with a 
	   count of the number of occurences, 
	   Add arbitrary text to it, return the dictionary on demand
	   Generator for most/least frequent words
	   ??? Options for allowed charset, and case sensitivity ???"""
	# for words like isn't and O'Holloran and fiddle-faddle
	# what should we do about contractions?
	# ditto possessive forms?
	tr = string.maketrans('','')
	rm = string.punctuation + string.digits
	rm = string.translate(rm, tr, "'-")  
		# Don't remove apostrophes and hypens from "words"
	knownWords = {}
	knownWordsRead = 0

	def __init__(self):
		self.words = {}
		self.count = 0  # total words processed
		self.nword = 0  # number of words in our instance dictionary
		self.known = 0  # number of our words found in the class dict.
		# Each instance gets its own word list and total count
		if not Wordcount.knownWordsRead:
			# We'll try to create the "known words" dictionary
			# But we only do that on first instantiation 
			# since all instances share this one dictionary
			try:
				Wordcount.knownWordsRead = 1
				wlist = open('/usr/share/dict/words','r')
				for i in wlist:
					i = i.lower().strip()
					if not i in Wordcount.knownWords:
						Wordcount.knownWords[i] = 0
			except: pass
			# but we won't try very hard
				# print "debug: ", word 
	def add (self,text):
		for each in text.split():
			word = string.translate(each, Wordcount.tr, Wordcount.rm).lower()
			word = word.strip()
			while word.endswith("'"):   word = word[:-1] # strip quotes
			while word.startswith("'"): word = word[1:]  #
			if word.startswith('-'): continue 
			if word.endswith('-'): continue 
			if word.endswith("n't"): word = word[:-3] # can't include these
			if word.endswith("'ll"): word = word[:-3] # or you'll wonder
			if word.endswith("'s"):  word = word[:-2] # who's
			if word == '-' or word == "'" or len(word) < 1 : continue 
			self.count += 1
			if not word in self.words:
				self.words[word] = 1
				self.nword += 1
			else:
				self.words[word] += 1
			if word in Wordcount.knownWords: self.known += 1
	def dump (self):
		self.items = [ (y,x) for x,y in self.words.items() ] 
		self.items.sort()
		self.items.reverse()
		return self.items

if __name__ == '__main__':
	import psycopg
	wcount = Wordcount()
	for i in sys.argv[1:]:
		file = open(i,"r")
		for line in file:
			wcount.add(line)
			# handle hyphenation?
			## poss. by cutting last word IFF ends in hyphen
			## and prepending to next line.
	print wcount.count, wcount.known, wcount.nword, \
		wcount.known/float(wcount.count), wcount.nword/float(wcount.known)
	l = []
	for count, word in wcount.dump():
		if count > 1:		# Skip "unique" words
			if word in Wordcount.knownWords: 
				f = '*'
				t = (word,count,'true')
			else:
				f = ''
				t = (word,count,'false')
			l.append(t)
			print "%7d %s%s" % (count, word, f)
	print wcount.count, wcount.known, wcount.nword, \
		wcount.known/float(wcount.count), wcount.nword/float(wcount.known)

	### Dump into database:
	db = psycopg.connect('dbname=test user=jimd')
	## db.execute('''CREATE TABLE word_frequencies 
	##    (word text, count integer, known boolean'''')
	cursor = db.cursor()
	cursor.executemany("insert into word_frequencies values (%s, %d, %s);", 
		l )
	db.commit()

 One thing confuses me a bit.  I'd always heard that database
 cursors are an extemely limited resource and that the use of them
 is best avoided, yet the DB-API docs(*) seem to suggest that you 
 usually use cursors with Python db connections but that one could
 do .execute() function on db connection *WITHOUT* the cursor.

 """
 Next, you should create a cursor object. A cursor object acts as a
 handle for a given SQL query; it allows retrieval of one or
 more rows of the result, until all the matching rows have been
 processed. For simple applications that do not need more than
 one query at a time, it's not necessary to use a cursor object since
 database objects support all the same methods as cursor
 objects. We'll deliberately use cursor objects in the following
 example. (For more on beginning SQL, see At The Forge by
 Reuven Lerner in LJ, Octoboer, 1997.)
 """
 	* ( http://www.amk.ca/python/writing/DB-API.html )

 Yet, I can't seem to find those in psycopg.  Is psycopg DB-API 2
 compliant?  Is this feature there?  Or does DB-API not require
 this feature?