Fuzzy matching of postal addresses
skip at pobox.com
Tue Jan 18 04:11:36 CET 2005
Andrew> I'm 90% of the way there, in the sense that I have a simplistic
Andrew> approach that matches 90% of the addresses in database A. But
Andrew> the extra cases could be a pain to deal with!
Based upon the examples you gave, here are a couple things you might try to
reduce the size of the difficult comparisons:
* Remove "the" and commas as part of your normalization process
* Split each address on white space and convert the resulting list to a
set, then consider the size of the intersection with other addresses
with the same postal code:
>>> a1 = "St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL".upper().replace(",", "")
"ST JOHN'S PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL"
>>> a2 = "THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL".upper().replace(",", "").replace("THE ", "")
'PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL'
>>> a1 == a2
>>> sa1 = set(a1.split())
>>> sa2 = set(a2.split())
>>> len(sa1 & sa2)
More information about the Python-list