Most Effective Way to Build Up a Histogram of Words?
Steve Holden
sholden at holdenweb.com
Thu Oct 12 11:07:59 EDT 2000
Simon Brunning wrote:
>
> > From: June Kim [SMTP:junaftnoon at nospamplzyahoo.com]
> > What is the most effective way, in terms of the execution speed,
> > to build up a histogram of words from a multiple of huge text
> > files?
>
> June,
> How huge? As a first cut, I'd try something like this (untested) -
>
> file = open('yourfile.txt', r)
> filedata = file.read()
> words=filedata.split()
> histogram {}
> for word in words:
> histogram[word] = histogram.get(word, 0) + 1
> for word in histogram.keys():
> print 'Word: %s - count %s' % (word, str(histogram[word])
>
> This should work unless the file is *really* huge, in which case you'll need
> to read the file in a chunk at a time. But if you can squeeze the file in
> one gulp, do so.
>
> Cheers,
> Simon Brunning
> TriSystems Ltd.
> sbrunning at trisystems.co.uk
>
Try the following with some sample files to see whether you have enough
memory.
I removed a couple of syntax errors, and re-cast it to be 1.5.2 compatible
(since I don't run 2.0c1 on my laptop yet). Tested -- seems to work OK.
If you want the most frequent words last, remove the reverse() call.
regards
Steve
---------------------------------------------------------------------------
import string
file = open('histo.py', "r")
filedata = file.read()
words=string.split(filedata)
histogram = {}
for word in words:
histogram[word] = histogram.get(word, 0) + 1
#for word in histogram.keys():
# print 'Word: %s - count %s' % (word, str(histogram[word]))
flist = []
for word, count in histogram.items():
flist.append([count, word])
flist.sort()
flist.reverse()
for pair in flist:
print "%30s: %4d" % (pair[1], pair[0])
--
Helping people meet their information needs with training and technology.
703 967 0887 sholden at bellatlantic.net http://www.holdenweb.com/
More information about the Python-list
mailing list