Stuck on a three word street name regex

Brian D briandenzer at gmail.com
Thu Jan 28 04:39:44 CET 2010


On Jan 27, 6:35 pm, Paul Rubin <no.em... at nospam.invalid> wrote:
> Brian D <brianden... at gmail.com> writes:
> > I've tackled this kind of problem before by looping through a patterns
> > dictionary, but there must be a smarter approach.>
> > Two addresses. Note that the first has incorrectly transposed the
> > direction and street name. ....
>
> If you're really serious about it (e.g. you are the post office trying
> to program automatic mail sorting machines) there is no simple regex
> trick anything like what you want.  A lot of addresses will be
> ambiguous.  You have use all the info you have about your entire address
> corpus (e.g. you need a complete street directory of the whole US) and
> do a bunch of Bayesian inference.  As a very simple example, for an
> address like "1000 RAMPART S ST" you'd use the zip code to identify the
> address's geographic neighborhood, and then use your street directory to
> find candidate correct addresses within that zip code.  The USPS does
> an amazing job of delivering mail to completely mangled addresses
> based on methods like that.

Paul,

That's a sound methodology. I actually have a routine that will
compare an address to a list of all streets in the city using a Short
Distance function. I have used that in circumstances when there are a
lot of problems with addresses. In this case, however, the streets are
actually structured very well -- except for the transposed street
directions. I was really hoping to see if there's a solution that
handles one, two, and three word strings, followed by an occasional
single character, and then a two character suffix. I'm still hoping
for that kind of a solution if it exists. The reason? It's actually a
very small number of addresses that aren't being captured with the
current regex. I don't see the need for overkill, and I'm always
stretching to learn something I haven't already succeeded at
accomplishing. I may just make a second pass at the data with a
different regex.



More information about the Python-list mailing list