Usable street address parser in Python?

Albert van der Horst albert at spenarnc.xs4all.nl
Wed Apr 21 21:47:09 CEST 2010


In article <4bcddc5a$0$1630$742ec2ed at news.sonic.net>,
John Nagle  <nagle at animats.com> wrote:
>Iain King wrote:
>> Not sure on the volume of addresses you're working with, but as an
>> alternative you could try grabbing the zip code, looking up all
>> addresses in that zip code, and then finding whatever one of those
>> address strings most closely resembles your address string (smallest
>> Levenshtein distance?).
>
>    The parser doesn't have to be perfect, but it should
>reliably reports when it fails.  Then I can run the hard cases through
>one of the commercial online address standardizers.  I'd like to
>be able to knock off the easy cases cheaply.

In a similar situation I did the exact reverse. ( analysing
assembler code sequences for the stack effect.)
I made a list of all exceptions, and checked against that first.
If it is not an exception, the rule should apply.
If it doesn't, call Houston.
(Of course one starts with making an input canonical, all upper case
maybe reordering etc.)

>
>    What I want to do is to first extract the street number and
>undecorated street name only, match that to a large database of US businesses
>stored in MySQL, and then find the best match from the database
>hits.  So I need reliable extraction of undecorated street name and number.  The
>other fields are less important.

This kind of problem remains very tricky ...

At least in the Netherlands we have a book containing information
about how the spelling of a street should be officially using a limited
number of characters.

>
>                               John Nagle

Groetjes Albert

--
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert at spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst




More information about the Python-list mailing list