Fuzzy matching of postal addresses

Mon Jan 17 19:43:24 EST 2005

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2005-01-18, Andrew McLean <spam-trap-095 at at-andros.demon.co.uk> wrote:
> I have a problem that is suspect isn't unusual and I'm looking to see if 
> there is any code available to help. I've Googled without success.

I have done something very similar (well, near as dammit identical,
actually), unfortunately, I did this in Perl[1].

I can give you  some general pointers of the way to go about this 
type of thing, but can't actually provide any code, as it is at work.

> Basically, I have two databases containing lists of postal addresses and 
> need to look for matching addresses in the two databases. More 
> precisely, for each address in database A I want to find a single 
> matching address in database B.

In my implementation this is the exact same setup, database A 
(OS/RM addresspoint data) contained a metric shitload of addresses,
with addresses to be matched against them coming from client supplied
data.

> I'm 90% of the way there, in the sense that I have a simplistic approach 
> that matches 90% of the addresses in database A. But the extra cases 
> could be a pain to deal with!

There is no way to guarantee 100% accuracy if one (or both) of the 
datasets is collated from a non-authoratative source, ie the actual
homeowners. ;)

> It's probably not relevant, but I'm using ZODB to store the databases.

> The current approach is to loop over addresses in database A. I then 
> identify all addresses in database B that share the same postal code 
> (typically less than 50). The database has a mapping that lets me do 
> this efficiently. Then I look for 'good' matches. If there is exactly 
> one I declare a success. This isn't as efficient as it could be, it's 
> O(n^2) for each postcode, because I end up comparing all possible pairs. 
> But it's fast enough for my application.

OK, this is a good start, the first thing to do is to clean the data,
especially the postcodes.

something along the lines of: (pseudocode, can't remember the exact python
implementation I did)  

postcode = resultarray[postcode]
len = length(postcode)
for (i = 0; i < len; i++):
  # if the fourth character is a digit or a space append it to the string
  if (i == 3 && postcode[i] =~ /(\d|\s)/: 
    cleanPostcode .= postcode[i]
  # if this isn't the fourth character and it is an aplhanumeric character
  # append it to the string
  else if (postcode[i] =~ /(\w|\d):
    cleanPostcode .= postcode[i]

That puts all the postcodes into the format that the Royal Mail uses.

Then search on postcode ;)

The next thing I did was to split each individual word out fom
it field, and matched that in a specific order (I found starting
with elements such as house name, number, and street name
was the best approach). 

If a word matched it was assigned a score (1 for a exact match, 0.7 for
a metaphone match and 0.6 for a soundex macth IIRC), and when the searching
was finished I took the resulting score and divided it by the number of
word elements.

If that score was higher than any of the prevous scores then it was put
in a given variable. If there where a number of equally good(bad?) matches
then they were appended onto an array, and if there was no clear winner
yb the time that the last of the records for that postcode had been process
it spat out a multiple choice list.

The trick is picking a threshold level below which no matches are
put into the DB, even if they are the best scoring. (I used a threshold 
of 0.3 

This can be refined, the current, extremely baroque, perl script that 
does this currently drops out certain values from the data array if 
there is an exact match with certain fields, such as house name.
It doesn't reduce the value of the integer that the result is divided 
by though, thus favouring results that return an exact match on a couple 
of given fields.

> The challenge is to fix some of the false negatives above without 
> introducing false positives!
>
> Any pointers gratefully received.

Hope this is a) understandable, and b) useful ;)

FWIW, the perl script (an I would expect a similarly implemented python
script to perform about as well) running a somewhat flaky set of
user collated data against the Royal Mail Addresspoint data managed
a 75% hit rate, with an additional 5% requiring user intervention,
and as near as I have been able to ascertain a >1% false positive 
count, from a dataset of nearly 17,000 addresses.

With cleaner and more up to date data I would expect the results
to be noticably better.

[1] It is still my main language, I don't use python enough to
think in it as easily as I think in perl ;)

- -- 
James					jamesk[at]homeric[dot]co[dot]uk

"Consistency is the last resort of the unimaginative."
- -- Bob Hope
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB7FurqfSmHkD6LvoRAgdVAJ4t2HCaT52qbuqyT5yN59X+az0ZQwCfZgOH
L5nTnPj+TF95Z+FCM65CzV0=
=UkeW
-----END PGP SIGNATURE-----