[Tutor] How can I make this run faster?

Mon Dec 21 20:40:53 CET 2009

Emad Nawfal (عمـ نوفل ـاد) <emadnawfal at gmail.com> dixit:

> Dear Tutors,
> The purpose of this script is to see how many vocalized forms map to a
> single consonantal form. For example, the form "fn" could be fan, fin, fun.
> 
> The input is a large list (taken from a file) that has ordinary words. The
> script creates a devocalized list, then compares the two lists.
> 
> The problem: It takes over an hour to process 1 file. The average file size
> is 400,000 words.
> 
> Question: How can I make it run faster? I have a large number of files.
> 
> Note: I'm not a programmer, so please avoid very technical terms.
> 
> Thank you in anticipation.
> 
> 
> 
> 
> 
> def devocalize(word):
>     vowels = "aiou"
>     return "".join([letter for letter in word if letter not in vowels])
> 
> 
> vowelled = ['him', 'ham', 'hum', 'fun', 'fan'] # input, usually a large list
> of around 500,000 items
> 
> vowelled = set(vowelled)
> 
> unvowelled = set([devocalize(word) for word in vowelled])
> 
Your problem is algorithmic: the last part below is useless and it's the one that consume most time (a loop over all words on a loop over all words). Instead, as you produce unvowelled lexem versions, just feed a dictionary with unvowelled keys and a list of original vowelled lexems. So, to replace the first list comprehension above (untested):

wordmap = {}
for lexem in vowelled:
   unvowelled = devocalize(lexem)
   # add lexem to list if unvowelled already registered
   if unvowelled in wordmap:
       wordmap[unvowelled].append(lexem)
   # else register unvowelled with first lexem
   else:
       wordmap[unvowelled] = [lexem]
for (unvowelled,lexems) in wordmap.items():	# items = list of (key:value) pairs
    print unvowelled, " ".join(lexems)

> for lex in unvowelled:
>     d = {}
>     d[lex] = [word for word in vowelled if devocalize(word) == lex]
> 
>     print lex, " ".join(d[lex])
> 

Note: If you really had to double loop over a whole lexicon, the trick #1 to highly speed things up is: to first split the list into parts separated on the criterion of (at least) first letter (or other script char you use), and do all process on list list parts.
1 list of 100 --> 10000 loops
10 lists of 10 --> 10 x 100 loops
(In your case, it would be more clever to distinguish words on first _consonant_ char!)

Denis
________________________________

la vita e estrany

http://spir.wikidot.com/