regex-strategy for finding *similar* words?
Peter Maas
peter at somewhere.com
Thu Nov 18 07:46:21 EST 2004
Christoph Pingel schrieb:
> Hi all,
>
> an interesting problem for regex nerds.
> I've got a thesaurus of some hundred words and a moderately large
> dataset of about 1 million words in some thousand small texts. Words
> from the thesaurus appear at many places in my texts, but they are often
> misspelled, just slightly different from the thesaurus.
You could set up a list of misspelling cases, scan a word for it e.g.
citti and turn it into a regex by applying suitable misspelling cases
But this is cumbersome. It is probably better to use a string distance
defined by the least number of operations (add,delete, replace, exchange)
to map one string onto another.
Search for '"Levenshtein distance" python' and find e.g.
http://trific.ath.cx/resources/python/levenshtein/
--
-------------------------------------------------------------------
Peter Maas, M+R Infosysteme, D-52070 Aachen, Tel +49-241-93878-0
E-mail 'cGV0ZXIubWFhc0BtcGx1c3IuZGU=\n'.decode('base64')
-------------------------------------------------------------------
More information about the Python-list
mailing list