matching a street address with regular expressions

Tim Chase python.list at tim.thechases.com
Thu Oct 11 20:24:03 CEST 2007


> Don't forget to write test cases. If you have a series of addresses,
> and confirm they are parsed correctly, you are in a good position to
> refine the pattern. You will instantly know if a change in pattern has
> broken another pattern.
> 
> The reason I'm saying this, is because I think your pattern is
> incomplete. I suggest you add a test case for the following street
> address:
> 
> 221B Baker Street

There are a number of weird street names and addresses that one
may need to address.  Having worked with police applications,
they often break it into the BLOCK, DIRECTION, STREET, SUFFIX and
APARTMENT/SUITE.

However, there are complications...

Block can include things like

  1234 1/2 (an actual street format from one of our test cases
where two block numbers were divided to make room)
  221B (though this might be a block + apartment)

Directions can include not only your cardinal N/S/E/W directions
(written out or abbreviated, with or without punctuation), but
can include 8-point directions or more, such as NW, Northwest,
north-west, etc.  It wouldn't even surprise me if locations with
16-point directions exist (NNW).

The Street portion is often whatever is left over when the rest
is unparsed.

The Suffix would be "Rd", "Road", "St", "Ave", "Cir", "Bvd",
"Blvd", "Row", "Hwy", "Highway", etc.  There are about 30 of them
that we used by default, but I'm sure there are some abnormals as
well.

There are wrinkles in even the above, as here in the Dallas area,
we have a "Northwest Highway" where Northwest is the street-name
of the road, not the Direction portion.

I second Goldfish's suggestion for making a suite of both normal
and abnormal addresses along with their expected breakdowns.
Depending on how normalized you want them to be, you may have to
deal with punctuation and spacing abnormalities as well.

-tkc






More information about the Python-list mailing list