[Tutor] How can I make this run faster?

Mon Dec 21 19:23:56 CET 2009

Emad Nawfal (عمـ نوفل ـاد) wrote:
> Dear Tutors,
> The purpose of this script is to see how many vocalized forms map to a 
> single consonantal form. For example, the form "fn" could be fan, fin, 
> fun.
>
> The input is a large list (taken from a file) that has ordinary words. 

What does "ordinary" mean.

Are all the letteers lower case?

> The script creates a devocalized list, then compares the two lists.
>
> The problem: It takes over an hour to process 1 file. The average file 
> size is 400,000 words.
>
> Question: How can I make it run faster? I have a large number of files.

The major consumer of time is the "outer product:" (for loop with list 
comprehension). That is 400,000 squared operations!
And it is not necessary!

Also the program repeatedly call devocalize for the same word!

>
> Note: I'm not a programmer, so please avoid very technical terms.

You are writing a program. Does not that make you a programmer? How can 
I tell what terms are too technical for you?

>
> Thank you in anticipation.

You are welcome in retrospect. Observations follow:

>
> def devocalize(word):
>     vowels = "aiou"
>     return "".join([letter for letter in word if letter not in vowels])

What happenex to"e"?

Use translate to delete the vowels. That will improve speed.

"translate( table[, deletechars])

Return a copy of the string where all characters occurring in the 
optional argument deletechars are removed, and the remaining characters 
have been mapped through the given translation table, which must be a 
string of length 256."

Here is a program that will be much faster. I changed names to be more 
meaningful (to me), and created a main function (standard operating 
procedure in Python).

import collections

def devowel(word, table=''.join(chr(i) for i in range(256)), vowels="aiou"):
  return word.translate(table, vowels)

def main():
  allOriginalWords = ['him', 'ham', 'hum', 'fun', 'fan']
  uniqueOriginalWords = set(allOriginalWords)
  wordDict = collections.defaultdict(list)
  for word in uniqueOriginalWords:
    wordDict[devowel(word)].append(word)
  for lex, wordList in wordDict.iteritems():
    print lex, " ".join(wordList)

main()

-- 
Bob Gailer
Chapel Hill NC
919-636-4239