Fuzzy matching of postal addresses [1/1]

Sun Jan 23 19:45:34 EST 2005

In article <1106517571.768306.23430 at f14g2000cwb.googlegroups.com>, John 
Machin <sjmachin at lexicon.net> writes
>Andrew McLean wrote:
>> In case anyone is interested, here is the latest.
>
>> def insCost(tokenList, indx, pos):
>>     """The cost of inserting a specific token at a specific
>normalised position along the sequence."""
>>     if containsNumber(tokenList[indx]):
>>         return INSERT_TOKEN_WITH_NUMBER + POSITION_WEIGHT * (1 - pos)
>>     elif indx > 0 and containsNumber(tokenList[indx-1]):
>>         return INSERT_TOKEN_AFTER_NUMBER + POSITION_WEIGHT * (1 -
>pos)
>>     elif tokenList[indx][0] in minorTokenList:
>>         return INSERT_MINOR_TOKEN
>>     else:
>>         return INSERT_TOKEN + POSITION_WEIGHT * (1 - pos)
>>
>> def delCost(tokenList, indx, pos):
>>     """The cost of deleting a specific token at a specific normalised
>position along the sequence.
>>     This is exactly the same cost as inserting a token."""
>>     return insCost(tokenList, indx, pos)
>
>Functions are first-class citizens of Pythonia -- so just do this:
>
>delCost = insCost
>

Actually, the code used to look like that. I think I changed it so that 
it would look clearer. But perhaps that was a bad idea.

>Re speed generally: (1) How many addresses in each list and how long is
>it taking? On what sort of configuration? (2) Have you considered using
>pysco -- if not running on x86 architecture, consider exporting your
>files to a grunty PC and doing the match there. (3) Have you considered
>some relatively fast filter to pre-qualify pairs of addresses before
>you pass the pair to your relatively slow routine?

There are approx. 50,000 addresses in each list.

At the moment the processing assumes all the postcodes are correct, and 
only compares addresses with matching postcodes. This makes it a lot 
faster. But may miss some cases of mismatched postcodes.

Also it does two passes. One looking for exact matches of token 
sequences. This deals with about half the cases. Only then do I employ 
the, more expensive, edit distance technique.

Overall, the program runs in less than half an hour. Specifically it 
takes about 60s per thousand addresses, which requires an average of 
about 8 calls to editDistance per address. Psyco.full() reduced the 60s 
to 45s.

I'll only try optimisation if I need to use it much more.

>Soundex?? To put it bluntly, the _only_ problem to which soundex is the
>preferred solution is genealogy searching in the US census records, and
>even then one needs to know what varieties of the algorithm were in use
>at what times. I thought you said your addresses came from
>authoritative sources. You have phonetic errors? Can you give some
>examples of pairs of tokens that illustrate the problem you are trying
>to overcome with soundex?

I'm sure that in retrospect Soundex might not be a good choice. The 
misspellings tend to be minor, e.g.
Kitwhistle and KITTWHISTLE
Tythe and TITHE

I was tempted by an edit distance technique on the tokens, but would 
prefer a hash based method for efficiency reasons.

>Back to speed again: When you look carefully at the dynamic programming
>algorithm for edit distance, you will note that it is _not_ necessary
>to instantiate the whole NxM matrix -- it only ever refers to the
>current row and the previous row. What does space saving have to do
>with speed, you ask? Well, Python is not FORTRAN; it takes considerable
>effort to evaluate d[i][j]. A relatively simple trick is to keep 2 rows
>and swap (the pointers to) them each time around the outer loop. At the
>expense of a little more complexity, one can reduce this to one row and
>3 variables (north, northwest, and west) corresponding to d[i-1][j],
>d[i-1][j-1], and d[i][j-1] -- but I'd suggest the simple way first.
>Hope some of this helps,

Thanks for that. The first edit distance algorithm I looked at did it 
that way. But I based my code on a version of the algorithm I could 
actually follow ;-). Getting the concatenation bit right was non-trivial 
and it was useful to store all of "d" for debugging purposes.

As to Python not being Fortran. You've found me out. The three languages 
I am most comfortable with are Fortran, Matlab and Python. It did occur 
to me that numarray might be a more efficient way of dealing with a 
4-dimensional array, but the arrays aren't very big, so the overhead in 
setting them up might be significant.

The simplest optimisation would be to replace the two indices used to 
deal with concatenation by four explicit variables. And then, as you 
said, I could just store the three last rows, and avoid any multiple 
indexing.

As with all these potential optimisations, you don't know until you try 
them.

-- 
Andrew McLean