Fuzzy matching of postal addresses

Jeff Shannon jeff at ccvcorp.com
Mon Jan 17 19:50:21 EST 2005


Andrew McLean wrote:

> The problem is looking for good matches. I currently normalise the 
> addresses to ignore some irrelevant issues like case and punctuation, 
> but there are other issues.


I'd do a bit more extensive normalization.  First, strip off the city 
through postal code (e.g. 'Beaminster, Dorset, DT8 3SS' in your 
examples).  In the remaining string, remove any punctuation and words 
like "the", "flat", etc.

> Here are just some examples where the software didn't declare a match:

And how they'd look after the transformation I suggest above:

> 1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
> THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS

1 Brantwood
BEECHES 1 BRANTWOOD

> Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
> 2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP

2 Bethany House Broadwindsor Road
2 BETHANY HOUSE

> Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
> PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU

Penthouse Old Vicarage 1 Clay Lane
PENTHOUSE OLD VICARAGE 1 CLAY LANE

> St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
> THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL

St Johns Presbytery Shortmoor
PRESBYTERY SHORTMOOR

> The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
> PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF

Pinnacles White Sheet Hill
PINNACLES WHITESHEET HILL


Obviously, this is not perfect, but it's closer.  At this point, you 
could perhaps say that if either string is a substring of the other, 
you have a match.  That should work with all of these examples except 
the last one.  You could either do this munging for all address 
lookups, or you could do it only for those that don't find a match in 
the simplistic way.  Either way, you can store the Database B's 
pre-munged address so that you don't need to constantly recompute 
those.  I can't say for certain how this will perform in the false 
positives department, but I'd expect that it wouldn't be too bad.

For a more-detailed matching, you might look into finding an algorithm 
to determine the "distance" between two strings and using that to 
score possible matches.

Jeff Shannon
Technician/Programmer
Credit International




More information about the Python-list mailing list