Building a word list from multiple files
Steven Bethard
steven.bethard at gmail.com
Thu Nov 18 14:56:55 EST 2004
Larry Bates wrote:
> 2) Are the words in the file separated with some consistent
> character (e.g. space, tab, csv, etc).
>
> If not, you will probably need to use regular expressions
> to handle all different punctuations that might separate
> the words. Things like quotes, commas, periods, colons,
> semi-colons, etc. Simple string split won't handle these
> properly.
If you go this way, you probably ought to read this thread:
http://mail.python.org/pipermail/python-list/2004-November/250520.html
which suggests finding words with a regexp something like r'[^\W\d_]+'.
(If you're not concerned about internationalization, it could be simpler.)
STeve
More information about the Python-list
mailing list