Usable street address parser in Python?

John Yeung gallium.arsenide at gmail.com
Tue Apr 20 03:24:07 EDT 2010


My response is similar to John Roth's.  It's mainly just sympathy. ;)

I deal with addresses a lot, and I know that a really good parser is
both rare/expensive to find and difficult to write yourself.  We have
commercial, USPS-certified products where I work, and even with those
I've written a good deal of pre-processing and post-processing code,
consisting almost entirely of very silly-looking fixes for special
cases.

I don't have any experience whatsoever with pyparsing, but I will say
I agree that you should try to get the street type from the end of the
line.  Just be aware that it can be valid to leave off the street type
completely.  And of course it's a plus if you can handle suites that
are on the same line as the street (which is where the USPS prefers
them to be).

I would take the approach which John R. seems to be suggesting, which
is to tokenize and then write a whole bunch of very hairy, special-
case-laden logic. ;)  I'm almost positive this is what all the
commercial packages are doing, and I have a tough time imagining what
else you could do.  Addresses inherently have a high degree of
irregularity.

Good luck!

John Y.



More information about the Python-list mailing list