Building a word list from multiple files
manu.1982 at gmail.com
Fri Nov 19 05:16:30 CET 2004
> 1) How large are the files you are reading (e.g. can they
> fit in memory)?
The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
i write my own parser for email.
> 2) Are the words in the file separated with some consistent
> character (e.g. space, tab, csv, etc).
in the case of html mail i only extract the text and strip of the
Since this is regular text i expect no special seperators and as i
understand split() by default takes any whitespace character as
delimter.This will work fine for my purposes.
> If not, preprocess the files and use shelve to save a
> dictionary that has already been processed. When you
This is what i was planning to do.Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.
More information about the Python-list