Usable street address parser in Python?

John Roth johnroth1 at gmail.com
Sun Apr 18 17:08:17 EDT 2010


On Apr 17, 1:23 pm, John Nagle <na... at animats.com> wrote:
>    Is there a usable street address parser available?  There are some
> bad ones out there, but nothing good that I've found other than commercial
> products with large databases.  I don't need 100% accuracy, but I'd like
> to be able to extract street name and street number for at least 98% of
> US mailing addresses.
>
>    There's pyparsing, of course. There's a street address parser as an
> example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser.py".
> It's not very good.  It gets all of the following wrong:
>
>         1500 Deer Creek Lane    (Parses "Creek" as a street type")
>         186 Avenue A            (NYC street)
>         2081 N Webb Rd          (Parses N Webb as a street name)
>         2081 N. Webb Rd         (Parses N as street name)
>         1515 West 22nd Street   (Parses "West" as name)
>         2029 Stierlin Court     (Street names starting with "St" misparse.)
>
> Some special cases that don't work, unsurprisingly.
>         P.O. Box 33170
>         The Landmark @ One Market, Suite 200
>         One Market, Suite 200
>         One Market
>
> Much of the problem is that this parser starts at the beginning of the string.
> US street addresses are best parsed from the end, says the USPS.  That's why
> things like "Deer Creek Lane" are mis-parsed.  It's not clear that regular
> expressions are the right tool for this job.
>
> There must be something out there a little better than this.
>
>                                         John Nagle

You have my sympathy. I used to work on the address parser module at
Trans Union, and I've never seen another piece of code that had as
many special cases, odd rules and stuff that absolutely didn't make
any sense until one of the old hands showed you the situation it was
supposed to handle.

And most of those files were supposed to be up to USPS mass mailing
standards.

When the USPS says that addresses are best parsed from the end, they
aren't talking about the street address; they're talking about the
address as a whole, where it's easiest if you look for a zip first,
then the state, etc. The best approach I know of for the street
address is simply to tokenize the thing, and then do some pattern
matching. Trying to use any kind of deterministic parser is going to
fail big time.

IMO, 98% is way too high for any module except one that's been given a
lot of love by a company that does this as part of their core
business. There's a reason why commercial products come with huge data
bases -- it's impossible to parse everything correctly with a single
set of rules. Those data bases also contain the actual street names
and address ranges by zip code, so that direct marketing files can be
cleansed to USPS standards.

That said, I don't see any reason why any of the examples in your
first group should be misparsed by a competent parser.

Sorry I don't have any real help for you.

John Roth



More information about the Python-list mailing list