"Newbie" questions - "unique" sorting ?

Anton Vredegoor anton at vredegoor.doge.nl
Thu Jun 26 21:24:06 CEST 2003


"Cousin Stanley" <CousinStanley at hotmail.com> wrote:
>| Thank you for such excellent programming.
>
>You're welcome ....
>
>Thanks also to ....
>
>    Erik Max Francis for suggesting
>    the lambda sort for Mixed-Case sorting ....
>
>    Kim Petersen for suggesting usage of ....
>
>        dict_words.has_key[ this_word ] instead of
>
>        this_word in dict_words.keys()
>
>    which made an incredible difference in processing time ....

I've been playing a little with the script and managed to double the
speed by using a trick that was posted here some time ago by someone
called "Lulu". The trick was originally from someone else but I lost
the attribution somewhere. This bumps Erik's idea from the list I'm
afraid ..., because it translates all letters into lowercase and
translates the rest into spaces. This speeds up sorting and splitting.

Probably it's possible to shave off a few percents more, but I think
doubling speed once again will cost four times more programmer effort
or maybe twice as much money for computer equipment.

It's all in the script below, I hope I didn't introduce any new
errors. By the way, I don't like an empty line for every other line,
as in your script, and using "\n" is easier than what you did. Other
than that, nice job!

Anton


import sys
import time

time_in = time.time()
module_name = sys.argv[ 0 ]
print '\n    %s '  % (module_name )
#to get the file below:
#http://sailor.gutenberg.org/etext97/1donq10.zip
path_in    = '1donq10.txt'
path_out   = 'words.out'
file_in    = file( path_in   , 'r' )
file_out   = file( path_out  , 'w' )
word_total = 0
dict_words = {}
#start lulu magic
i_r = map(chr, range(256))
trans = [' '] * 256
o_a, o_z = ord('a'), (ord('z')+1)
trans[ord('A'):(ord('Z')+1)] = i_r[o_a:o_z]
trans[o_a:o_z] = i_r[o_a:o_z]
trans = ''.join(trans)
#end lulu magic
print
print '        Indexing Words ....\n ' ,
for iLine in file_in : 
    if (word_total+1) % 10000  ==  0 :
            sys.stdout.write('.')
    #use lulu magic in the line below here:
    list_words = iLine.translate(trans).split()
    for this_word in list_words : 
        if not dict_words.has_key(this_word) :
            dict_words[this_word] = 1
        else : 
            dict_words[this_word] += 1
        word_total += 1
list_words = dict_words.keys()
#lulu magic turned all words into lowercase, so standard sort is
#possible:
list_words.sort()
print '\n\n        Writing Output File ....' ,
for this_word in list_words : 
    word_count = dict_words[this_word]
    str_out    = '%6d %s\n' % (word_count ,this_word)
    file_out.write(str_out)
word_str   = '\n     Total  Words .... %d\n' % (word_total)
keys_total = len(dict_words.keys())
keys_str   = '\n     Unique Words .... %d\n' % (keys_total)
file_out.write(word_str)
file_out.write(keys_str)
print '\n        Complete .................\n' 
print '            Total  Words ....' , word_total
print 
print '            Unique Words ....' , keys_total
file_in.close()
file_out.close()
time_out  = time.time()
time_diff = time_out - time_in
print '\n        Process Time ........ %-6.2f Seconds' % (time_diff)




More information about the Python-list mailing list