Fuzzy matching of postal addresses
spam-trap-095 at at-andros.demon.co.uk
Tue Jan 18 01:02:07 CET 2005
I have a problem that is suspect isn't unusual and I'm looking to see if
there is any code available to help. I've Googled without success.
Basically, I have two databases containing lists of postal addresses and
need to look for matching addresses in the two databases. More
precisely, for each address in database A I want to find a single
matching address in database B.
I'm 90% of the way there, in the sense that I have a simplistic approach
that matches 90% of the addresses in database A. But the extra cases
could be a pain to deal with!
It's probably not relevant, but I'm using ZODB to store the databases.
The current approach is to loop over addresses in database A. I then
identify all addresses in database B that share the same postal code
(typically less than 50). The database has a mapping that lets me do
this efficiently. Then I look for 'good' matches. If there is exactly
one I declare a success. This isn't as efficient as it could be, it's
O(n^2) for each postcode, because I end up comparing all possible pairs.
But it's fast enough for my application.
The problem is looking for good matches. I currently normalise the
addresses to ignore some irrelevant issues like case and punctuation,
but there are other issues.
Here are just some examples where the software didn't declare a match:
1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS
Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP
Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU
St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL
The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF
The challenge is to fix some of the false negatives above without
introducing false positives!
Any pointers gratefully received.
More information about the Python-list