Most Effective Way to Build Up a Histogram of Words?
Darrell Gallion
darrell at dorb.com
Thu Oct 12 22:50:14 EDT 2000
The many small bytes approach avoids the long delay at the end and runs fast
also.
But it's a pain to locate spaces to split the buffer on.
Also whish string.split took "start, end" parms, then we could avoid
splitting a large buffer.
################
Words#: 1773
Total Words Seen: 844032
Time: 10.5939999819
Data Len: K 8042
##################
import string, time, re
t1=time.time()
file = open('b', "r")
totalLen=0
totalWordsSeen=0
histogram = {}
if 1:
filedata = file.read()
last=0
res=[]
sz=25000
for x in range(len(filedata)/sz):
i = string.find(filedata, ' ', sz*x)
if i==-1:
break
else:
buf=filedata[last:i]
last=i
res.append(buf)
totalLen += len(buf)
buf=filedata[last:]
totalLen += len(buf)
res.append(buf)
for d in res:
words=string.split(d)
for word in words:
histogram[word] = histogram.get(word, 0) + 1
totalWordsSeen += 1
flist = []
for word, count in histogram.items():
flist.append([count, word])
flist.sort()
flist.reverse()
print 'Words#:', len(flist)
print 'Total Words Seen:', totalWordsSeen
print 'Time:', time.time()-t1
print 'Data Len: K', totalLen/1000
----- Original Message -----
From: "June Kim"
>
> Using lambda and mapping, it took about 8 minutes to process a 6.5 MB text
> file, but very oddly, after all the commands processed, it took 4 mins
more
> to come back to the command line. What had Python been doing after putting
> out the result? Returning system resources for 4 mins???
More information about the Python-list
mailing list