Building a word list from multiple files

Steven Bethard steven.bethard at
Thu Nov 18 20:56:55 CET 2004

Larry Bates wrote:
> 2) Are the words in the file separated with some consistent
> character (e.g. space, tab, csv, etc).
> If not, you will probably need to use regular expressions
> to handle all different punctuations that might separate
> the words.  Things like quotes, commas, periods, colons,
> semi-colons, etc.  Simple string split won't handle these
> properly.

If you go this way, you probably ought to read this thread:

which suggests finding words with a regexp something like r'[^\W\d_]+'. 
  (If you're not concerned about internationalization, it could be simpler.)


