appoximate string matching library - any interest?

Tim Churches tchur at optushome.com.au
Thu Aug 28 09:27:39 EDT 2003


Istvan Albert writes:
> I'm working on a project that needs approximate string 
> matching such as the String::Aprox module in perl:
> 
> http://search.cpan.org/author/JHI/String-Approx-3.20/Approx.pm
> 
> Unlike  exact matches approximate (fuzzy) matches can match 
> words having small differences in them, typos, errors or 
> similarly spellings.
> 
> I was unable to find a similar implementation in python right 
> away so I tried wrapping the perl module's underlying C 
> library into python calls. I turned out to be fairly easy, 
> man is SWIG an awesome product or what ... in a just a few 
> hours I managed to create a quite functional version (see below).
> 
> In the meantime I have also discovered that there is a 
> similar project Agrepy.py available but I have no idea how 
> well it works. I'm trying to gauge the interest relative to 
> this library, right now it serves my needs yet I wouldn't 
> mind polishing it up and making it public if it appears to be 
> useful for others too.

There are a number of approximate string matching functions, implemented
in pure Python, included in the Febrl project (Febrl=Freely-extensible
biomedical record linkage). See under "Prototype software" at
http://datamining.anu.edu.au/projects/linkage.html or for details of the
approximate string comparators implemented, see
http://cs.anu.edu.au/~Peter.Christen/febrl-0.2/febrldoc-0.2.1/node37.htm
l Note that approximate string comparator functions are different from
(although related to) phonetic encoders, such as Soundex (see
http://cs.anu.edu.au/~Peter.Christen/febrl-0.2/febrldoc-0.2.1/node38.htm
l for some examples of the latter).

Wrapped C implementations of any or all of these comparators would be
welcome, although in practice we haven't found them to be a major
bottleneck (although calculating the Levenshtein distance on long
strings can be rather expensive). Oh, there are also a number of
interesting vector-space comparison techniques which can be applied to
strings, but we haven't implemented any of these yet in Febrl. Then
there are various language- or culture-specific comparators. And then
there is the whole issue of name comparison in pictographic and
ideographic languages...

And I didn't mention guns once...

Tim C







More information about the Python-list mailing list