Building a word list from multiple files

Steven Bethard steven.bethard at gmail.com
Thu Nov 18 14:56:55 EST 2004


Larry Bates wrote:
> 2) Are the words in the file separated with some consistent
> character (e.g. space, tab, csv, etc).
> 
> If not, you will probably need to use regular expressions
> to handle all different punctuations that might separate
> the words.  Things like quotes, commas, periods, colons,
> semi-colons, etc.  Simple string split won't handle these
> properly.

If you go this way, you probably ought to read this thread:

http://mail.python.org/pipermail/python-list/2004-November/250520.html

which suggests finding words with a regexp something like r'[^\W\d_]+'. 
  (If you're not concerned about internationalization, it could be simpler.)

STeve



More information about the Python-list mailing list