help make it faster please
Ron Adam
rrr at ronadam.com
Sun Nov 13 17:03:13 EST 2005
Fredrik Lundh wrote:
> Ron Adam wrote:
>
>
>>The \w does make a small difference, but not as much as I expected.
>
>
> that's probably because your benchmark has a lot of dubious overhead:
I think it does what the OP described, but that may not be what he
really needs.
Although the test to find best of n, instead was finding worse of n.
Which explains why I was getting a larger variance than I thought I
should have been getting.
>>word_finder = re.compile('[\w@]+', re.I)
>
>
> no need to force case-insensitive search here; \w looks for both lower-
> and uppercase characters.
But the dictionary keys need to be either upper or lower otherwise you
count 'The' separately from 'the'.
>> for match in word_finder.finditer(string.lower()):
>
>
> since you're using a case-insensitive RE, that lower() call is not necessary.
>
>
>> word = match.group(0)
>
>
> and findall() is of course faster than finditer() + m.group().
Cool, I don't use re that often so I just used what was posted to test
against.
>
>> t = time.clock()
>> for line in lines.splitlines():
>> countDict = foo(line)
>> tt = time.clock()-t
>
>
> and if you want performance, why are you creating a new dictionary for
> each line in the sample?
Because that's what the OP apparently wanted. A line by line word
count. I originally did it to get an the over all count and then change
it so it matched the re version that was posted.
> here's a more optimized RE word finder:
>
> word_finder_2 = re.compile('[\w@]+').findall
>
> def count_words_2(string, word_finder=word_finder_2):
> # avoid global lookups
> countDict = {}
> for word in word_finder(string):
> countDict[word] = countDict.get(word,0) + 1
> return countDict
>
> with your original test on a slow machine, I get
>
> count_words: 0.29868684 (best of 3)
> count_words_2: 0.17244873 (best of 3)
>
> if I call the function once, on the entire sample string, I get
>
> count_words: 0.23096036 (best of 3)
> count_words_2: 0.11690620 (best of 3)
>
> </F>
Wow, a lot bigger difference than on my machine. <curious> An athlon
64 3000+ on winxp. I'm not sure how much difference that would make?
This is what I get after adding the above version to it, with the
lower(). There's not quite as big a difference as you get, but the find
all version is still faster than both the others.
Cheers,
Ron
Character count: 100000
Word count: 16477
Average word size: 6.06906597075
word_counter: 0.06245989 (best of 3)
count_words: 0.07309812 (best of 3)
count_words_2: 0.04981024 (best of 3)
And as count all words...
Character count: 100000
Word count: 16477
Average word size: 6.06906597075
word_counter: 0.05325006 (best of 3)
count_words: 0.05910528 (best of 3)
count_words_2: 0.03748158 (best of 3)
They all improve, but the re find all version is clearly better.
#####################
import string
import re
import time
import random
# Create a really ugly n length string to test with.
# The word length are
n = 100000
random.seed(1)
lines = ''.join([ random.choice(string.ascii_letters * 2
+ '_@$&*()#/<>' + ' \n' * 6) for x in range(n) ])
print 'Character count:', n
print 'Word count:', len(lines.split())
print 'Average word size:', float(n)/len(lines.split())
letters = string.lowercase + string.digits + '_@'
def word_iter(text, letters=letters):
wd = ''
for c in text + ' ':
if c in letters:
wd += c
elif wd != '':
yield wd
wd = ''
def word_counter(text):
countDict={}
for wd in word_iter(text.lower()):
if wd in countDict:
countDict[wd] += 1
else:
countDict[wd] = 1
return countDict
word_finder = re.compile('[\w@]+', re.I).finditer
def count_words(string, word_finder=word_finder):
# avoid global lookups
countDict = {}
for match in word_finder(string.lower()):
word = match.group(0)
countDict[word] = countDict.get(word,0) + 1
return countDict
word_finder_2 = re.compile('[\w@]+').findall
def count_words_2(string, word_finder=word_finder_2):
# avoid global lookups
countDict = {}
for word in word_finder(string.lower()):
countDict[word] = countDict.get(word,0) + 1
return countDict
foos = [word_counter, count_words, count_words_2]
r1 = r2 = None
for foo in foos:
best_time = 1000000 # too large to be useful on purpose
for n in range(3):
t = time.clock()
#for line in lines.splitlines():
countDict = foo(lines)
tt = time.clock()-t
best_time = min(tt, best_time)
r1 = r2
r2 = countDict
if r1 != None:
# change to 1 if assert fails to find problem
if 0:
for k in r1.keys():
if r1[k] != r2[k]:
print k,r1[k],r2[k]
assert r1 == r2
print '%s: %.8f (best of %d)' \
% (foo.__name__, best_time, n+1)
More information about the Python-list
mailing list