Fuzzy matching of postal addresses [1/1]
spam-trap-095 at at-andros.demon.co.uk
Mon Jan 24 01:45:34 CET 2005
In article <1106517571.768306.23430 at f14g2000cwb.googlegroups.com>, John
Machin <sjmachin at lexicon.net> writes
>Andrew McLean wrote:
>> In case anyone is interested, here is the latest.
>> def insCost(tokenList, indx, pos):
>> """The cost of inserting a specific token at a specific
>normalised position along the sequence."""
>> if containsNumber(tokenList[indx]):
>> return INSERT_TOKEN_WITH_NUMBER + POSITION_WEIGHT * (1 - pos)
>> elif indx > 0 and containsNumber(tokenList[indx-1]):
>> return INSERT_TOKEN_AFTER_NUMBER + POSITION_WEIGHT * (1 -
>> elif tokenList[indx] in minorTokenList:
>> return INSERT_MINOR_TOKEN
>> return INSERT_TOKEN + POSITION_WEIGHT * (1 - pos)
>> def delCost(tokenList, indx, pos):
>> """The cost of deleting a specific token at a specific normalised
>position along the sequence.
>> This is exactly the same cost as inserting a token."""
>> return insCost(tokenList, indx, pos)
>Functions are first-class citizens of Pythonia -- so just do this:
>delCost = insCost
Actually, the code used to look like that. I think I changed it so that
it would look clearer. But perhaps that was a bad idea.
>Re speed generally: (1) How many addresses in each list and how long is
>it taking? On what sort of configuration? (2) Have you considered using
>pysco -- if not running on x86 architecture, consider exporting your
>files to a grunty PC and doing the match there. (3) Have you considered
>some relatively fast filter to pre-qualify pairs of addresses before
>you pass the pair to your relatively slow routine?
There are approx. 50,000 addresses in each list.
At the moment the processing assumes all the postcodes are correct, and
only compares addresses with matching postcodes. This makes it a lot
faster. But may miss some cases of mismatched postcodes.
Also it does two passes. One looking for exact matches of token
sequences. This deals with about half the cases. Only then do I employ
the, more expensive, edit distance technique.
Overall, the program runs in less than half an hour. Specifically it
takes about 60s per thousand addresses, which requires an average of
about 8 calls to editDistance per address. Psyco.full() reduced the 60s
I'll only try optimisation if I need to use it much more.
>Soundex?? To put it bluntly, the _only_ problem to which soundex is the
>preferred solution is genealogy searching in the US census records, and
>even then one needs to know what varieties of the algorithm were in use
>at what times. I thought you said your addresses came from
>authoritative sources. You have phonetic errors? Can you give some
>examples of pairs of tokens that illustrate the problem you are trying
>to overcome with soundex?
I'm sure that in retrospect Soundex might not be a good choice. The
misspellings tend to be minor, e.g.
Kitwhistle and KITTWHISTLE
Tythe and TITHE
I was tempted by an edit distance technique on the tokens, but would
prefer a hash based method for efficiency reasons.
>Back to speed again: When you look carefully at the dynamic programming
>algorithm for edit distance, you will note that it is _not_ necessary
>to instantiate the whole NxM matrix -- it only ever refers to the
>current row and the previous row. What does space saving have to do
>with speed, you ask? Well, Python is not FORTRAN; it takes considerable
>effort to evaluate d[i][j]. A relatively simple trick is to keep 2 rows
>and swap (the pointers to) them each time around the outer loop. At the
>expense of a little more complexity, one can reduce this to one row and
>3 variables (north, northwest, and west) corresponding to d[i-1][j],
>d[i-1][j-1], and d[i][j-1] -- but I'd suggest the simple way first.
>Hope some of this helps,
Thanks for that. The first edit distance algorithm I looked at did it
that way. But I based my code on a version of the algorithm I could
actually follow ;-). Getting the concatenation bit right was non-trivial
and it was useful to store all of "d" for debugging purposes.
As to Python not being Fortran. You've found me out. The three languages
I am most comfortable with are Fortran, Matlab and Python. It did occur
to me that numarray might be a more efficient way of dealing with a
4-dimensional array, but the arrays aren't very big, so the overhead in
setting them up might be significant.
The simplest optimisation would be to replace the two indices used to
deal with concatenation by four explicit variables. And then, as you
said, I could just store the three last rows, and avoid any multiple
As with all these potential optimisations, you don't know until you try
More information about the Python-list