Fuzzy matching of postal addresses
spam-trap-095 at at-andros.demon.co.uk
Tue Jan 18 22:26:40 CET 2005
Thanks for all the suggestions. There were some really useful pointers.
A few random points:
1. Spending money is not an option, this is a 'volunteer' project. I'll
try out some of the ideas over the weekend.
2. Someone commented that the data was suspiciously good quality. The
data sources are both ones that you might expect to be authoritative. If
you use as a metric, having a correctly formatted and valid postcode, in
one database 100% the records do in the other 99.96% do.
3. I've already noticed duplicate addresses in one of the databases.
4. You need to be careful doing an endswith search. It was actually my
first approach to the house name issue. The problem is you end up
matching "12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".
I am tempted to try an approach based on splitting the address into a
sequence of normalised tokens. Then work with a metric based on the
differences between the sequences. The simple case would look at
deleting tokens and perhaps concatenating tokens to make a match.
More information about the Python-list