Fuzzy matching of postal addresses
newsgroups at jhrothjr.com
Tue Jan 18 01:59:44 CET 2005
"Andrew McLean" <spam-trap-095 at at-andros.demon.co.uk> wrote in message
news:96B2w2E$HF7BFwSq at at-andros.demon.co.uk...
>I have a problem that is suspect isn't unusual and I'm looking to see if
>there is any code available to help. I've Googled without success.
There isn't any publically availible code that I'm aware of.
Companies that do a good job of address matching regard
that code as a competitive advantage on a par with the
> Basically, I have two databases containing lists of postal addresses and
> need to look for matching addresses in the two databases. More precisely,
> for each address in database A I want to find a single matching address in
> database B.
> I'm 90% of the way there, in the sense that I have a simplistic approach
> that matches 90% of the addresses in database A. But the extra cases could
> be a pain to deal with!
>From a purely pragmatic viewpoint, is this a one-off, and how many
non-matches do you have to deal with? If the answers are yes,
and not all that many, I'd do the rest by hand.
> It's probably not relevant, but I'm using ZODB to store the databases.
I doubt if it's relevant.
> The current approach is to loop over addresses in database A. I then
> identify all addresses in database B that share the same postal code
> (typically less than 50). The database has a mapping that lets me do this
> efficiently. Then I look for 'good' matches. If there is exactly one I
> declare a success. This isn't as efficient as it could be, it's O(n^2) for
> each postcode, because I end up comparing all possible pairs. But it's
> fast enough for my application.
> The problem is looking for good matches. I currently normalise the
> addresses to ignore some irrelevant issues like case and punctuation, but
> there are other issues.
I used to work on a system that had a reasonably decent address
matching routine. The critical issue is, as you suspected, normalization.
You're not going far enough. You've also got an issue here that doesn't
exist in the States - named buildings.
> Here are just some examples where the software didn't declare a match:
> 1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
> THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS
The first line is a street address, the second is a named building and a
without a house number. There's no way of matching this unless you know
that The Beaches doesn't have flat (or room, etc.) numbers and can move the
1 to being the street address. On the other hand, this seems to be a
consistent problem in your data base - in the US, the street address must
be associated with the street name. No comma is allowed between the two.
> Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
> 2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP
The first is a flat, house name and street name, the second is a number
and a house name. Assuming that UK postal standards don't allow
more than one named building in a postal code, this is easily matchable
if you do a good job of normalization.
> Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
> PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU
The issue here is to use the words "flat" and "the" to split the flat
name and the house name. Then the house number is in the wrong
part - it shoud go with the street name. See the comment above.
> St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
> THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL
This one may not be resolvable, unless there is only one house name
with "presbytery" in it within the postal code. Notice that "the" should
probably be dropped when normalizing.
> The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
> PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF
Spelling correction needed.
> The challenge is to fix some of the false negatives above without
> introducing false positives!
> Any pointers gratefully received.
If, on the other hand, this is a repeating problem that's simply going
to be an ongoing headache, I'd look into commercial address correction
software. Here in the US, there are a number of vendors that have
such software to correct addresses to the standards of the USPS.
They also have data bases of all the legitimate addresses in each
postal code. They're adjuncts of mass mailers, and they exist
because the USPS gives a mass mailing discount based on the
number of "good" addresses you give them.
I don't know what the situation is in the UK, but I'd be surprised
if there wasn't some availible address data base, either commercial
or free, possibly as an adjunct of the postal service.
The later, by the way, is probably the first place I'd look. The
postal service has a major interest in having addresses that they
can deliver without a lot of hassle.
Another place is google. The first two pages using "Address
Matching software" gave two UK references, and several
> Andrew McLean
More information about the Python-list