Most Effective Way to Build Up a Histogram of Words?

Thu Oct 12 22:50:14 EDT 2000

The many small bytes approach avoids the long delay at the end and runs fast
also.
But it's a pain to locate spaces to split the buffer on.
Also whish string.split took "start, end" parms, then we could avoid
splitting a large buffer.

################
Words#: 1773
Total Words Seen: 844032
Time: 10.5939999819
Data Len: K 8042
##################

import string, time, re
t1=time.time()
file = open('b', "r")
totalLen=0
totalWordsSeen=0
histogram = {}

if 1:
    filedata = file.read()
    last=0
    res=[]
    sz=25000
    for x in range(len(filedata)/sz):
        i = string.find(filedata, ' ', sz*x)
        if i==-1:
            break
        else:
            buf=filedata[last:i]
            last=i
        res.append(buf)
        totalLen += len(buf)

    buf=filedata[last:]
    totalLen += len(buf)
    res.append(buf)
    for d in res:
        words=string.split(d)
        for word in words:
                histogram[word] = histogram.get(word, 0) + 1
                totalWordsSeen += 1

flist = []
for word, count in histogram.items():
    flist.append([count, word])
flist.sort()
flist.reverse()
print 'Words#:', len(flist)
print 'Total Words Seen:', totalWordsSeen
print 'Time:', time.time()-t1
print 'Data Len: K', totalLen/1000

----- Original Message -----
From: "June Kim"
>
> Using lambda and mapping, it took about 8 minutes to process a 6.5 MB text
> file, but very oddly, after all the commands processed, it took 4 mins
more
> to come back to the command line. What had Python been doing after putting
> out the result? Returning system resources for 4 mins???