Fuzzy matching of postal addresses [1/1]

Sun Jan 23 16:59:31 EST 2005

Andrew McLean wrote:
> In case anyone is interested, here is the latest.

> def insCost(tokenList, indx, pos):
>     """The cost of inserting a specific token at a specific
normalised position along the sequence."""
>     if containsNumber(tokenList[indx]):
>         return INSERT_TOKEN_WITH_NUMBER + POSITION_WEIGHT * (1 - pos)
>     elif indx > 0 and containsNumber(tokenList[indx-1]):
>         return INSERT_TOKEN_AFTER_NUMBER + POSITION_WEIGHT * (1 -
pos)
>     elif tokenList[indx][0] in minorTokenList:
>         return INSERT_MINOR_TOKEN
>     else:
>         return INSERT_TOKEN + POSITION_WEIGHT * (1 - pos)
>
> def delCost(tokenList, indx, pos):
>     """The cost of deleting a specific token at a specific normalised
position along the sequence.
>     This is exactly the same cost as inserting a token."""
>     return insCost(tokenList, indx, pos)

Functions are first-class citizens of Pythonia -- so just do this:

delCost = insCost

Re speed generally: (1) How many addresses in each list and how long is
it taking? On what sort of configuration? (2) Have you considered using
pysco -- if not running on x86 architecture, consider exporting your
files to a grunty PC and doing the match there. (3) Have you considered
some relatively fast filter to pre-qualify pairs of addresses before
you pass the pair to your relatively slow routine?

Soundex?? To put it bluntly, the _only_ problem to which soundex is the
preferred solution is genealogy searching in the US census records, and
even then one needs to know what varieties of the algorithm were in use
at what times. I thought you said your addresses came from
authoritative sources. You have phonetic errors? Can you give some
examples of pairs of tokens that illustrate the problem you are trying
to overcome with soundex?

Back to speed again: When you look carefully at the dynamic programming
algorithm for edit distance, you will note that it is _not_ necessary
to instantiate the whole NxM matrix -- it only ever refers to the
current row and the previous row. What does space saving have to do
with speed, you ask? Well, Python is not FORTRAN; it takes considerable
effort to evaluate d[i][j]. A relatively simple trick is to keep 2 rows
and swap (the pointers to) them each time around the outer loop. At the
expense of a little more complexity, one can reduce this to one row and
3 variables (north, northwest, and west) corresponding to d[i-1][j],
d[i-1][j-1], and d[i][j-1] -- but I'd suggest the simple way first.
Hope some of this helps,

John