Building a word list from multiple files

Larry Bates lbates at syscononline.com
Fri Nov 19 18:40:49 CET 2004


With email messages they should be small enough so reading
them into memory isn't an issue so line-by-line processing
isn't indicated here.

Email messages have LOTS of punctuation in the other than
witespace between words.  Just look at your email message
below.  It contains:

 > greater than symbol
) parenthesis
. periods
? question marks
, commas

Even text like:  "html.So no line.."   Periods with no
whitespace will be a problem  string split would
return "html.So" as a word.

I really think you are going to need to use regex to
split this into "words" and even then the words may
be of questionable origin.  See another response for
an example regex expression that might work.  Constructs
like e.g. will return two words "e" and "g" (which
might be ok for your application).

Hope feedback at least helps.

Larry Bates


Manu wrote:
> hi,
> 
>>1) How large are the files you are reading (e.g. can they
>>fit in memory)?
> 
> 
> The files are email messages.
> I will using the the builtin email module to extract only the content
> type which is plain text or in html.So no line by line processing is
> possible unless
> i write my own parser for email.
> 
> 
>>2) Are the words in the file separated with some consistent
>>character (e.g. space, tab, csv, etc).
> 
> 
> in the case of html mail i only extract the text and strip of the
> tags.
> Since this is regular text i expect no special seperators and as i
> understand split() by default takes any whitespace character as
> delimter.This will work fine for my purposes.
> 
> 
> 
>>If not, preprocess the files and use shelve to save a
>>dictionary that has already been processed.  When you
> 
> 
> This is what i was planning to do.Once the processing is done for a
> set of files they are never processed again.I was going to store the
> dict as a string in a file and then use eval() to get it back.
> 
> 
> Thanks
> Manu



More information about the Python-list mailing list