Almost Done: Need some Help in Generating FEATURE VECTORS

Josiah Carlson jcarlson at nospam.uci.edu
Fri Mar 5 18:22:56 EST 2004


#First, normalize the line breaks:
email_source = email_source.replace('\r\n', '\n').replace('\r', '\n')

#toss the headers:
pos = email_source.find('\n\n')
if pos != -1:
     email_body = email_source[pos:]
else:
     email_body = email_source

#clean out html:
(use the method given http://flangy.com/dev/python/striphtml.html )

#get rid of anything that isn't a letter, and make it all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = email_body.translate(65*' '+lower+6*' '+lower+133*' ')

words_in_body = fixed_body.split()

#load up external dictionary:
words = open('dictionary', 'r').read().split()
dct = {}
for i in xrange(len(words)):
     dct[words[i]] = i

#make vector:
vector = {}
a = float(len(words_in_body))
for i in words_in_body:
     if i in dct:
         try:
             vector[i] += 1
         except:
             vector[i] = 1

for i in vector:
     vector[i] /= a



I know the above doesn't fit with what you have, but you should be able 
to adapt it.

  - Josiah



More information about the Python-list mailing list