Most Effective Way to Build Up a Histogram of Words?

Thu Oct 12 11:07:59 EDT 2000

Simon Brunning wrote:
> 
> > From: June Kim [SMTP:junaftnoon at nospamplzyahoo.com]
> > What is the most effective way, in terms of the execution speed,
> > to build up a histogram of words from a multiple of huge text
> > files?
> 
> June,
> How huge? As a first cut, I'd try something like this (untested) -
> 
> file = open('yourfile.txt', r)
> filedata = file.read()
> words=filedata.split()
> histogram {}
> for word in words:
>         histogram[word] = histogram.get(word, 0) + 1
> for word in histogram.keys():
>         print 'Word: %s - count %s' % (word, str(histogram[word])
> 
> This should work unless the file is *really* huge, in which case you'll need
> to read the file in a chunk at a time. But if you can squeeze the file in
> one gulp, do so.
> 
> Cheers,
> Simon Brunning
> TriSystems Ltd.
> sbrunning at trisystems.co.uk
> 
Try the following with some sample files to see whether you have enough
memory.

I removed a couple of syntax errors, and re-cast it to be 1.5.2 compatible
(since I don't run 2.0c1 on my laptop yet).  Tested -- seems to work OK.

If you want the most frequent words last, remove the reverse() call.

regards
 Steve
---------------------------------------------------------------------------
import string

file = open('histo.py', "r")
filedata = file.read()
words=string.split(filedata)
histogram = {}
for word in words:
        histogram[word] = histogram.get(word, 0) + 1
#for word in histogram.keys():
#        print 'Word: %s - count %s' % (word, str(histogram[word]))
flist = []
for word, count in histogram.items():
    flist.append([count, word])
flist.sort()
flist.reverse()
for pair in flist:
    print "%30s: %4d" % (pair[1], pair[0])
-- 
Helping people meet their information needs with training and technology.
703 967 0887      sholden at bellatlantic.net      http://www.holdenweb.com/