matching a street address with regular expressions

Paul McGuire ptmcg at austin.rr.com
Fri Oct 12 21:04:23 CEST 2007


On Oct 12, 8:19 am, John Machin <sjmac... at lexicon.net> wrote:
> "... most of the developed world" was the [very optimistic] request.
> How does it go with "JAPAN 112-0001 TOKYO Bunkyo-Ku Hakusan 4-Chome 3-
> 2" and will it give the same result for "4-3-2 HAKUSAN BUNKYO-KU TOKYO
> 112-00001 JAPAN"? OK, a little exotic ... closer to "home", what about
> addresses in Quebec? People often write addresses in formats that you
> won't find on the postal service website, but the local postal workers
> will still deliver. Rural addresses can be quaintly medieval e.g. "Lot
> 123, Hundred of Foughbarre" [South Australia]. Etc etc ...

John -

As I'm sure you already know, the answer is "not too well".  I've been
to Japan and Europe too, and I can't even figure out how many digits a
phone number is supposed to have!

I once lived in a small town in Virginia, and received a Christmas
card from a friend (who wanted to remind me just how small a town I
was living in) addressed simply as "The McGuire's, <name-of-town>,
Virginia" - I happened to mention it the next time I was at the post
office, and they said "yep, you're the only McGuires we got!"

In the spectrum of tools from string.split, to re, to pyparsing and
its ilk, to natural language toolkits, the broader the scope of your
"address space", the further you find your self moving toward NLTK.  I
knew my pyparsing script wasn't in the "anywhere in the developed
world" category, that's why I posted the test cases, too.

If you've got an re that can handle everything from "123 Main" to
"221B Baker Street" to "Hollywood and Vine" to "Lot 123, Hundred of
Foughbarre", now THAT would be something.

-- Paul




More information about the Python-list mailing list