CAN You help Re: Writing dictionary output to a file

Sat Mar 6 23:41:15 EST 2004

Hey,
I did a little debugging on the help u sent. Heres the
code:

# python code for creating dictionary of words from an
input file
import string, StringIO
import mailbox, email, re
import os
import sys
import re
import mailbox
import email.Parser
import email.Message
import getopt

fp=open(sys.argv[1], 'r')

msg=email.message_from_file(fp)

msg=msg.get_payload()

dictpos={}
wordcount={}
#get rid of anything that isn't a letter, and make it
all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = msg.translate(65*' '+lower+6*'
'+lower+133*' ')

#words_in_body = fixed_body.split()

msg = fixed_body.split()

for i, w in enumerate(file('dictionary_index')):
	dictpos[w.strip()]=i
	#print i
	#print w

for w in msg:
	try:
		wordcount[w]+=1
		#print wordcount
	except KeyError:
		wordcount[w]=1
		#print wordcount

for w, c in wordcount.iteritems():
	try:
		print dictpos[w],':',c
	except KeyError:
		pass

#print wordcount
#print dictpos
#print '\n'

But this does not give me anything. I get no output at
all. I dont really understand, if this is doing the
matching in the words in the email message with the
words in the dictionary  and "Yes" if it does,  it
should give me the corresponding index.
I have a piece of code, which does check for matching
but the problem as I mentioned, I need the index in
the dictionary not in the index of the word in the
message.

heres the code which gives me the vector, matching the
word in the email message by comparing  with the words
in the dictionary:

import string, StringIO
import mailbox, email, re
import os
import sys
import re
import mailbox
import email.Parser
import email.Message
import getopt

#load up external dictionary:
words = open('dictionary_index', 'r').read().split()
dct = {}
for i in xrange(len(words)):
     dct[words[i]] = i

print dct.values()

#make vector:
vector = {}

fp=open(sys.argv[1], 'r')

msg=email.message_from_file(fp)

msg=msg.get_payload()

#a = float(len(fp))

#a = float(len(words_in_body))

#get rid of anything that isn't a letter, and make it
all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = msg.translate(65*' '+lower+6*'
'+lower+133*' ')

#words_in_body = fixed_body.split()

msg = fixed_body.split()

a = float(len(msg))
print a

for i in msg:
     if i in dct:
         try:
             vector[i] += 1

         except:
             vector[i] = 1

for v,i in enumerate(vector):
    vector[i] /= a
    print v,i, vector[i]
    #; if u want to see the word too that was commmon
    #print v, ":",vector[i]

    #rint "\n"

#1.write(s)
#1.close()

--- Ruud de Jong <ruud.de.jong at consunet.nl> wrote:
> Small but essential correction on my previous post
> 
> Ruud de Jong schreef:
> 
> > dont bother schreef:
> > 
> >> Hi Jong,
> >> Yes I really want the location of the number
> matching
> >> in the dictionary.
> >> This is because I have to input these feature
> vectors
> >> to another program which takes [index: value ]
> where index: is the 
> >> value specific to dictionary.
> >> I dont care about the addition/extension of the
> words
> >> in the dictionary but for now, I really want the
> index
> >> of the word in the dictionary. This is also
> equivalent
> >> to the line number of the word in the dictionary.
> > 
> > 
> > OK. Back to basics. You have:
> > 
> > - a dictionary with one word per line
> > - a message with words that may or may not be
> >   words from the dictionary
> > - another program that takes [index: value] as
> input,
> >   and presumably does something useful.
> > 
> > So, you want to have a program that does the
> following:
> > 
> >   for each dictionary word that is present in the
> message,
> >   output the "index: count", where index is the
> position of the
> >   word in the dictionary, and count is the number
> of times
> >   the word occurs in the message.
> > 
> > Side note: your original program divides the count
> by the
> > total number of words in the message. Since both
> are integers,
> > this division will always give 0. I will ignore
> this division
> > for now, but in the actual program you'll need to
> address that.
> > 
> > Assuming your dictionary is too large to do a
> search to find the
> > position of an individual word, you basically need
> two mappings,
> > both keyed by actual words:
> > 
> > dictpos = {}, which maps dictionary words to
> dictionary positions
> > wordcount = {}, which maps message words to
> frequence counts
> > 
> > dictpos you can fill from your dictionary file:
> > 
> > for i, w in file('dictionary')
> >     dictpos[w.strip()] = i
> 
> This should obviously be:
> 
> for i, w in enumerate(file('dictionary'))
>      dictpos[w.strip()] = i
> 
> > 
> > (strip removes the trailing newline)
> > 
> > wordcount can be filled from the message, like:
> > 
> > for w in msg.split():
> >     try:
> >         wordcount[w] += 1
> >     except KeyError:
> >         wordcount[w] = 1
> > 
> > Now the output can be generated by:
> > 
> > for w, c in wordcount.iteritems():
> >     try:
> >         print dictpos[w], ':', c
> >     except KeyError:
> >         pass
> > 
> > This output is not sorted according to dictionary
> position.
> > If you need such sorting, that you'll have to
> capture everything
> > in a list first, and sort list that before
> printing.
> > 
> > Hope this helps.
> > 
> > Ruud.
> > 
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com