Fuzzy string matching?
Magnus L. Hetland
mlh at idt.ntnu.no
Thu Aug 26 20:52:19 CEST 1999
"Hans Nowak" <hnowak at cuci.nl> writes:
> ...Anyone know of such a beast in Python?
> Is such a thing done in Python already? If
> not, would it be difficult to do? (I obviously don't how to do it
> myself otherwise I would have done so already... :-/ )
I did some of this a while back, in a simple CGI-script trying to
merge two web pages... The point was - my local cinema had a web-page
showing which films were running, and when they were on etc. If you
clicked on the movie names, you invariably ended up on another page
which listed all the movie names - and if you clicked on one of
*them*, you would get a description of the film.
Idiotic design IMO.
So - I tried to make a simple script that extracted the URLs from the
second page and inserted them into the first, instead of the one
leading to the second... The problem was that the names weren't the
There were several differences, among them: Different case,
abbreviations (& instead of "and"), different punctuation, different
cutoffs (length of strings), different use of whitespace, different
encoding of scandinavian letters (8-bit ASCII or HTML entities) etc.,
etc. And it was all unsystematic, as it seemed that the pages were
maintaned by different people...
So - I had a problem similar to yours... I didn't do anything
revolutionary (and the script is far from foolproof yet...), but...
What I did was remove all the things that were different, creating a
"standard form"... I removed all whitespace, made all letters
lowercase, changed all HTML entities into their ISO8859-1 equivalents,
cut the string after 20 characters, etc.
It worked to some degree. Of course, when you get misspellings etc.,
you have a problem... But you could use the (old) soundex module to
overcome some problems...
Other than that - fuzzy matching is a non-trivial problem. It is
possible to use dynamic programming to make a quadratic time distance
function that you can use to compare words or strings - to see how
similar they are. (If you like, I can post a Python variant of it.)
Magnus Making no sound / Yet smouldering with passion
Lie The firefly is still sadder / Than the moaning insect
Hetland : Minamoto Shigeyuki
More information about the Python-list