regex-strategy for finding *similar* words?
Thomas Guettler
guettli at thomas-guettler.de
Thu Nov 18 10:35:17 EST 2004
Am Thu, 18 Nov 2004 13:20:08 +0100 schrieb Christoph Pingel:
> Hi all,
>
> an interesting problem for regex nerds.
> I've got a thesaurus of some hundred words and a moderately large
> dataset of about 1 million words in some thousand small texts. Words
> from the thesaurus appear at many places in my texts, but they are
> often misspelled, just slightly different from the thesaurus.
Hi,
You can write a method which takes a single word,
and returns a normalized version.
normalize("...ies") --> "...y"
normalize("running") --> "run"
Build a big dictionary which maps each word
to a list of files where they occur. Only
add normalized words to the dictionary (or database).
bigdict={"foo": ["file1.txt", "file2.txt", ...]}
HTH,
Thomas
More information about the Python-list
mailing list