"Newbie" questions - "unique" sorting ?

Cousin Stanley CousinStanley at hotmail.com
Tue Jun 24 10:55:30 CEST 2003


| How about the simple approach?
| ...

Kim ...

The approach I used is fairly simple and similar
to the one you posted, basically just
stuffing words from split lines into
a dictionary ...

The following script produces an indexed list
fairly quickly for relatively small files,
but dogged out on the input file John supplied
which yielded ...

  Total  Words .... 467381
  Unique Words .... 47122

Perhaps skipping the dictionary word count update
in the following line might speed things up ...

      else :

            dict_words[ this_word ] += 1

-- 
Cousin Stanley
Human Being
Phoenix, Arizona

-------------------------------------------------------------------

'''

    Module ........... word_list.py

    Usage ............ python word_list.py File_In.txt File_Out.txt

    NewsGroup ........ comp.lang.python

    Date ............. 2003-06-18

    Posted_By ........ John Fitzsimmons

    Replies_From ..... [ kpop , Erik Max Francis ]

    Coded_By ......... Stanley C. Kitching

'''

import math
import sys
import time

time_in = time.time()

NL = '\n'

module_name = sys.argv[ 0 ]

print '%s    %s '  % ( NL , module_name )

path_in    = sys.argv[ 1 ]
path_out   = sys.argv[ 2 ]

file_in    = file( path_in   , 'r' )
file_out   = file( path_out  , 'w' )

word_total = 0


dict_words = {}

print
print '        Indexing Words .... ' ,

for iLine in file_in :

    if math.fmod( word_total , 1000 ) == 0 :

       print '.' ,

    list_words = iLine.strip().split()

    for this_word in list_words :

        if this_word not in dict_words.keys() :

            dict_words[ this_word ] = 1

        else :

            dict_words[ this_word ] += 1

        word_total += 1


list_words = dict_words.keys()

list_words.sort( lambda x , y :  cmp( x.lower() , y.lower() )  )


print NL
print '        Writing Output File ....' ,

for this_word in list_words :

    word_count = dict_words[ this_word ]

    str_out    = '%6d %s %s' % ( word_count , this_word , NL )

    file_out.write( str_out )


word_str   = '%s     Total  Words .... %d %s' % ( NL , word_total , NL )

keys_total = len( dict_words.keys() )

keys_str   = '%s     Unique Words .... %d %s' % ( NL , keys_total , NL )


file_out.write( word_str )

file_out.write( keys_str )

print NL
print '        Complete .................'
print
print '            Total  Words ....' , word_total
print
print '            Unique Words ....' , keys_total


file_in.close()

file_out.close()


time_out  = time.time()

time_diff = time_out - time_in

print NL
print '        Process Time ........ %-6.2f Seconds' % ( time_diff )








More information about the Python-list mailing list