regex-strategy for finding *similar* words?

Denis S. Otkidach ods at strana.ru
Thu Nov 18 07:49:57 EST 2004


On Thu, 18 Nov 2004 13:20:08 +0100
Christoph Pingel <ch.pingel at web.de> wrote:

> an interesting problem for regex nerds.
> I've got a thesaurus of some hundred words and a moderately large 
> dataset of about 1 million words in some thousand small texts. Words 
> from the thesaurus appear at many places in my texts, but they are 
> often misspelled, just slightly different from the thesaurus.
> 
> Now I'm looking for the best strategy to match the appearence of my 
> thesaurus items in the texts. Do I have to build patterns from all my 
> thesaurus items for the expected misspellings, or is there a more 
> general approach possible? I tried to add '?.?' to each letter in a 
> thesaurus item, but this is much too weak (I get a lot of false 
> positives because this expression for example matches any possible 
> substring). Adding word boundries helps a little, but sometimes a 
> concept spans two words, and this could be spelled with a space char 
> *or* a dash. In this case, I'm back to matching almose everything. 
> Any ideas?
> BTW, efficiency is not absolutely required, it's not meant to work 
> for realtime requests.

Similar words can be determiner with difflib.SequenceMatcher.ratio()
method or by computing Levenshtein distance.  You have too many
combinations here to compare, so probably some filter (e.g. by word
length difference) will be required to proceed in reasonable time.

Certainly, you can generate regex for each word in thesaurus, but I
doubt it will run faster.

-- 
Denis S. Otkidach
http://www.python.ru/      [ru]



More information about the Python-list mailing list