[Chicago] Address parser?

Massimo Di Pierro mdipierro at cs.depaul.edu
Mon Feb 25 18:35:45 CET 2008


You may want to look into this

    http://exogen.case.edu/projects/geopy/

I also have my own which I use for

    http://www.appealmypropertytaxes.com

Mine performs normalization based on the USPS specifications. They  
have a very long document that say you should use AVE and not AVENUE  
or AV., you should use N and not NORTH, etc. My parser works in most  
cases and it is specifically  designed to translate addresses into  
web2py database queries. It is not freely available but I can make it  
available to web2py users if there is a need.

Massimo

On Feb 25, 2008, at 11:02 AM, Phil Robare wrote:

> Address parsing is a hard problem.  Not in the theoretical NP sense,
> but in that it requires a lot of knowledge of special cases.
> Addresses can be ambiguous or not depending upon information that the
> application 'just has to know'.  For instance an address in Chicago of
> 320 Randolph is ambiguous - It could be east or west.  But an address
> of 1320 Randolph is merely incomplete, needing West as part of the
> street name. If the user dropped the space you could figure out where
> 1320 westrandolph street was.  But a Westmont Street would just be a
> street named after a suburb.  It would probably be the same as an
> address on Westmont Ave.  But Atlanta is (in)famous for having
> multiple different roads all named Peachtree but having different
> suffixes, e.g. Road, Avenue, Boulevard.  Usually digits are part of
> the address and words are part of the street name.  Detroit, for
> example, confounds things with "8 Mile Road". In many places a street
> has multiple names, bearing both the local name and the highway route
> name, so you get an address like 185 Rt 45.  There are people with the
> last name of "Street" that have had a road named after them.  While
> the block number might be useful for figuring out west or east in
> Chicago, in the suburbs it can be a mess.  Arlington Heights Road goes
> through a number of suburbs, many of them having their own numbering
> system and their own east/west dividing point.  These addresses can be
> ambiguous because no one knows which suburb they are in as they drive
> along it. Most addresses are whole numbers but within the US there are
> a number of places that use fractions (like 1/2) to specify part of a
> duplex, and there are even places that use decimals in the address in
> place of apartment numbers.  Another problem is that there are
> multiple towns with the same name in some states, so the county has to
> be part of the address (or the zip code has to be checked).
>
> So, as far as I know, there are no good public domain address parsers
> because of the amount of work it takes to create one and the
> dependence of the parsing upon an underlying map.  If you are a direct
> marketer mailing hundreds of pieces the post office parser may be a
> good choice.  But if you are working for a web retailer who would just
> like to make sure the user typed an address that can be mailed to I
> think the Google API would be an option (depending upon terms of use -
> I don't know how restricted they are with regards to businesses using
> it.)  Navteq and Teleatlas have commercial offerings that I am not
> very familiar with.
>
> Asking the person entering the data to put in a house number field, a
> street name, a street type, direction suffix/prefix, etc. can make the
> job of the coder easier but will frustrate those who have to enter an
> address that doesn't fit the model.
>
> Phil
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago



More information about the Chicago mailing list