Fuzzy matching of postal addresses

Tue Jan 18 02:36:05 EST 2005

Ermmm ... only remove "the" when you are sure it is a whole word. Even
then it's a dodgy idea. In the first 1000 lines of the nearest address
file I had to hand, I found these: Catherine, Matthew, Rotherwood,
Weatherall, and "The Avenue".

Ermmm... don't rip out commas (or other punctuation); replace them with
spaces. That way "SHORTMOOR,BEAMINSTER" doesn't become one word
"SHORTMOORBEAMINSTER".

A not-unreasonable similarity metric would be float(len(sa1 & sa2))  /
len(sa1 | sa2). Even more reasonable would be to use trigrams instead
of words -- more robust in the presence of erroneous insertion or
deletion of spaces (e.g. Short Moor and Bea Minster are plausible
variations) and spelling errors and typos. BTW, the OP's samples look
astonishingly clean to me, so unlike real world data.

Two general comments addressed to the OP:
(1) Your solution doesn't handle the case where the postal code has
been butchered. e.g. "DT8 BEL" or "OT8 3EL".
(2) I endorse John Roth's comments. Validation against an address data
base that is provided by the postal authority, using either an
out-sourced bureau service, or buying a licence to use
validation/standardisation/repair software, is IMHO the way to go. In
Australia the postal authority assigns a unique ID to each delivery
point. This "DPID" has to be barcoded onto the mail article to get bulk
postage discounts. Storing the DPID on your database makes duplicate
detection a snap. You can license s/w (from several vendors) that is
certified by the postal authority and has batch and/or online APIs. I
believe the situation in the UK is similar. At least one of the vendors
in Australia is a British company. Google "address deduplication
site:.uk"
Actually (3): If you are constrained by budget, pointy-haired boss or
hubris to write your own software (a) lots of luck (b) you need to do a
bit more research -- look at the links on the febrl website, also
Google for "Monge Elkan", read their initial paper, look at the papers
referencing that on citeseer; also google for "merge purge"; also
google for "record linkage" (what the statistical and medical
fraternity call the problem) (c) and have a damn good look at your data
[like I said, it looks too clean to be true] and (d) when you add a
nice new wrinkle like "strip out 'the'", do make sure to run your
regression tests :-)
Would you believe (4): you are talking about cross-matching two
databases -- don't forget the possibility of duplicates _within_ each
database.

HTH, 
John