appoximate string matching library - any interest?
Tim Churches
tchur at optushome.com.au
Thu Aug 28 09:27:39 EDT 2003
Istvan Albert writes:
> I'm working on a project that needs approximate string
> matching such as the String::Aprox module in perl:
>
> http://search.cpan.org/author/JHI/String-Approx-3.20/Approx.pm
>
> Unlike exact matches approximate (fuzzy) matches can match
> words having small differences in them, typos, errors or
> similarly spellings.
>
> I was unable to find a similar implementation in python right
> away so I tried wrapping the perl module's underlying C
> library into python calls. I turned out to be fairly easy,
> man is SWIG an awesome product or what ... in a just a few
> hours I managed to create a quite functional version (see below).
>
> In the meantime I have also discovered that there is a
> similar project Agrepy.py available but I have no idea how
> well it works. I'm trying to gauge the interest relative to
> this library, right now it serves my needs yet I wouldn't
> mind polishing it up and making it public if it appears to be
> useful for others too.
There are a number of approximate string matching functions, implemented
in pure Python, included in the Febrl project (Febrl=Freely-extensible
biomedical record linkage). See under "Prototype software" at
http://datamining.anu.edu.au/projects/linkage.html or for details of the
approximate string comparators implemented, see
http://cs.anu.edu.au/~Peter.Christen/febrl-0.2/febrldoc-0.2.1/node37.htm
l Note that approximate string comparator functions are different from
(although related to) phonetic encoders, such as Soundex (see
http://cs.anu.edu.au/~Peter.Christen/febrl-0.2/febrldoc-0.2.1/node38.htm
l for some examples of the latter).
Wrapped C implementations of any or all of these comparators would be
welcome, although in practice we haven't found them to be a major
bottleneck (although calculating the Levenshtein distance on long
strings can be rather expensive). Oh, there are also a number of
interesting vector-space comparison techniques which can be applied to
strings, but we haven't implemented any of these yet in Febrl. Then
there are various language- or culture-specific comparators. And then
there is the whole issue of name comparison in pictographic and
ideographic languages...
And I didn't mention guns once...
Tim C
More information about the Python-list
mailing list