[Tutor] How can I make this run faster?
spir
denis.spir at free.fr
Mon Dec 21 20:40:53 CET 2009
Emad Nawfal (عمـ نوفل ـاد) <emadnawfal at gmail.com> dixit:
> Dear Tutors,
> The purpose of this script is to see how many vocalized forms map to a
> single consonantal form. For example, the form "fn" could be fan, fin, fun.
>
> The input is a large list (taken from a file) that has ordinary words. The
> script creates a devocalized list, then compares the two lists.
>
> The problem: It takes over an hour to process 1 file. The average file size
> is 400,000 words.
>
> Question: How can I make it run faster? I have a large number of files.
>
> Note: I'm not a programmer, so please avoid very technical terms.
>
> Thank you in anticipation.
>
>
>
>
>
> def devocalize(word):
> vowels = "aiou"
> return "".join([letter for letter in word if letter not in vowels])
>
>
> vowelled = ['him', 'ham', 'hum', 'fun', 'fan'] # input, usually a large list
> of around 500,000 items
>
> vowelled = set(vowelled)
>
> unvowelled = set([devocalize(word) for word in vowelled])
>
Your problem is algorithmic: the last part below is useless and it's the one that consume most time (a loop over all words on a loop over all words). Instead, as you produce unvowelled lexem versions, just feed a dictionary with unvowelled keys and a list of original vowelled lexems. So, to replace the first list comprehension above (untested):
wordmap = {}
for lexem in vowelled:
unvowelled = devocalize(lexem)
# add lexem to list if unvowelled already registered
if unvowelled in wordmap:
wordmap[unvowelled].append(lexem)
# else register unvowelled with first lexem
else:
wordmap[unvowelled] = [lexem]
for (unvowelled,lexems) in wordmap.items(): # items = list of (key:value) pairs
print unvowelled, " ".join(lexems)
> for lex in unvowelled:
> d = {}
> d[lex] = [word for word in vowelled if devocalize(word) == lex]
>
> print lex, " ".join(d[lex])
>
Note: If you really had to double loop over a whole lexicon, the trick #1 to highly speed things up is: to first split the list into parts separated on the criterion of (at least) first letter (or other script char you use), and do all process on list list parts.
1 list of 100 --> 10000 loops
10 lists of 10 --> 10 x 100 loops
(In your case, it would be more clever to distinguish words on first _consonant_ char!)
Denis
________________________________
la vita e estrany
http://spir.wikidot.com/
More information about the Tutor
mailing list