Most Effective Way to Build Up a Histogram of Words?

Thu Oct 12 15:58:29 EDT 2000

On Thu, 12 Oct 2000 23:20:43 +0900, June Kim <junaftnoon at nospamplzyahoo.com>
wrote:
>Thank you Mr. Brunning.
>
>"Simon Brunning" <SBrunning at trisystems.co.uk> wrote in message
>news:mailman.971358210.26293.python-list at python.org...
>> > From: June Kim [SMTP:junaftnoon at nospamplzyahoo.com]
>> > What is the most effective way, in terms of the execution speed,
>> > to build up a histogram of words from a multiple of huge text
>> > files?
>>
>> June,
>> How huge? As a first cut, I'd try something like this (untested) -
>
>The files are of a few MBs.

Provided you have more than a few MBs of free memory, slurping the whole file
oughta work fine.

>> file = open('yourfile.txt', r)
>> filedata = file.read()
>> words=filedata.split()
>> histogram {}
>> for word in words:
>> histogram[word] = histogram.get(word, 0) + 1
>> for word in histogram.keys():
>> print 'Word: %s - count %s' % (word, str(histogram[word])
>>
>> This should work unless the file is *really* huge, in which case you'll
>need
>> to read the file in a chunk at a time. But if you can squeeze the file in
>> one gulp, do so.
>>
>
>and then how could I sort the dictionary according to the
>frequency order?

a = map(lambda t: (t[1], t[0]), historgram.items()
a.sort()
for freq, word in a:
    print word, freq

Also, if you're using python 2.0, you could use a slightly more efficient

historgram.setdefault(word, 1)

instead of:

histogram[word] = histogram.get(word, 0) + 1

and a clearer list comprehension replaces the map:

a = [(freq, word) for (word, freq) in historgram.items()]

... but as to which one is faster, I believe the last argument concluded that
sometimes the function call overhead hurts map more than the "can't
precalculate size" thing hurts a comprehension or explicit loop, but sometimes
it's the reverse :)  I wonder if

a = [None] * len(historgram)
i = 0
for word, freq in historgram.items():
    a[i] = freq, word
    i = i + 1

would really be any faster.