Almost Done: Need some Help in Generating FEATURE VECTORS

dont bother dontbotherworld at yahoo.com
Fri Mar 5 04:36:09 EST 2004


Hey,

I want to map my document vectors to n dimensional
feature space.

All I have to do is to:

Part 1. Take emails, parse them, generate a dictionary
of words from spams.

Part 2. When a new email arrives: I have to do this:

Parse the new email. Check all the words of this new
email from my dictionary.

For example: My dictionary has words like this:

1. Hi
2. Bye
3. you 
4. cola
5. pepsi
6. viagra
7. weight


Suppose a new email arrives and I parse it and get the
following body in a file.

"Hi How is viagra. Loose weight"
total number of words=6

I have to compare this body with the words in
dictionary and create a feature vector like this:

Index: --> words in the email

1:--> for 'Hi'
6:--> for 'viagra'
7:--> for 'weight'

Feature Vector for the email:

[1:1/6 6:1/6 7:1/6 ]
 
since each of the word appears only once and the total
number of words are 6.

-----------------------------------------------------

I have been able to do Part1: Get an email, parse it,
remove html headers, get payload and generate a
dictionary.

What I dont know is this:


a)- How to strip off blanks spaces and characters like
^M from my dictionary
b)- How to remove numbers from my dictionary
c)- How to remove To, From and Message ID headers frm
my dictionary

The real important ones where I really need help is:

d)- How to compare the words from the payload of the
new email message, which I write in a another file
with the dictionary indexes

e)- How to create the feature vector I talked about in
part 2 above

I know these are not difficult but its a matter of
ignorance because I am new bie with Python. I choose
Python instead of Java because I heard parsing emails
is really easy and which is true.

I would really appreciate if some one can give me a
hand in that.

Thanks

Doon't


-------------------------------------------------
My code for parsing emails and generating dictionary
are here: emailparser.py and dictionary.py
--------------------------------------------------
#emailparser.py

#!/usr/local/bin/py


import string, StringIO, sys
import mailbox, email, re


def parse_mail(msg):
    if msg.is_multipart():
        pass
    else:
        # Get the parts of the message
	body = msg.get_payload()

	for hdr in msg.keys():
		if hdr.startswith('From'):
			del msg[hdr]
		if hdr.startswith('To'):
			del msg[hdr]
		if hdr.startswith('Received'):
			del msg[hdr]
		if hdr.startswith('X-'):
			del msg[hdr]






        # process the body to remove html messages<>
	body=re.sub(r'<[^>]*>','',body)

        return(body)

if __name__ == '__main__':

    if len(sys.argv) == 1:
        print """
        emailparser.py MBOX_FILE

        """
        sys.exit(0)

    f = open(sys.argv[1],'r')
    mbox =
mailbox.UnixMailbox(f,email.message_from_file)
    f1 = open('output','w')


    num = 0
    while 1:
        num = num+1
        try:
            msg = mbox.next()
        except email.Errors.HeaderParseError:
            print 'Current mail (num = '+str(num)+')
seems to have a parse error. Skipping'
            continue

        if not msg: break

        if msg.is_multipart():
            print 'Skipping a multipart email (num
'+str(num)+')'
            continue
        s = parse_mail(msg)

    f1.write(s)
    f1.close()


#------------------------------------------------------


#dictionary.py

# python code for creating dictionary of words

import os
import sys
import re

try:
	fread = open(sys.argv[1], 'r')
except IOError:
	print 'Cant open file for reading'
	sys.exit(0)
print 'Okay reading the file'
s=""
fread.seek(0,2)
c=fread.tell()

fread.seek(0)
d=fread.tell()

a=fread.read(1)

while(fread.tell()!=c):

		s=s+a
		b=fread.tell()

		a=fread.read(1)
		if(a=='\012'): #newline
			#print s
			#print 'The Line Ends'
			fwrite=open('dictionary', 'a')
			fwrite.write(s)
			s=""

		if(a=='\040'): #blank character
			#print s
			fwrite=open('dictionary', 'a')
			fwrite.write(s)
			fwrite.write("\n")
			s=""

print 'Wrote to Dictionary\n'
fwrite.close()
fread.close()

#------------------------------------------------------


__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com




More information about the Python-list mailing list