Fuzzy matching of postal addresses
Jeff Shannon
jeff at ccvcorp.com
Mon Jan 17 19:50:21 EST 2005
Andrew McLean wrote:
> The problem is looking for good matches. I currently normalise the
> addresses to ignore some irrelevant issues like case and punctuation,
> but there are other issues.
I'd do a bit more extensive normalization. First, strip off the city
through postal code (e.g. 'Beaminster, Dorset, DT8 3SS' in your
examples). In the remaining string, remove any punctuation and words
like "the", "flat", etc.
> Here are just some examples where the software didn't declare a match:
And how they'd look after the transformation I suggest above:
> 1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
> THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS
1 Brantwood
BEECHES 1 BRANTWOOD
> Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
> 2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP
2 Bethany House Broadwindsor Road
2 BETHANY HOUSE
> Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
> PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU
Penthouse Old Vicarage 1 Clay Lane
PENTHOUSE OLD VICARAGE 1 CLAY LANE
> St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
> THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL
St Johns Presbytery Shortmoor
PRESBYTERY SHORTMOOR
> The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
> PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF
Pinnacles White Sheet Hill
PINNACLES WHITESHEET HILL
Obviously, this is not perfect, but it's closer. At this point, you
could perhaps say that if either string is a substring of the other,
you have a match. That should work with all of these examples except
the last one. You could either do this munging for all address
lookups, or you could do it only for those that don't find a match in
the simplistic way. Either way, you can store the Database B's
pre-munged address so that you don't need to constantly recompute
those. I can't say for certain how this will perform in the false
positives department, but I'd expect that it wouldn't be too bad.
For a more-detailed matching, you might look into finding an algorithm
to determine the "distance" between two strings and using that to
score possible matches.
Jeff Shannon
Technician/Programmer
Credit International
More information about the Python-list
mailing list