Simple distributed example for learning purposes?

Mon Dec 28 11:28:49 EST 2009

On 12/28/2009 11:59 PM, Shawn Milochik wrote:
> With address data:
> 	one address may have suite data and the other might not
> 	the same city may have multiple zip codes

why is that even a problem?  You do put suite data and zipcode into 
different database fields right?

> 	incoming addresses may be missing information
> 	typos are common

see below

> 	sometimes "Route 35" is the same road as "Convery Boulevard"

If you have access to roadnames index, you can theoretically normalize 
"Route 35" into "Convery Boulevard" (or vice versa). But for most 
practical purpose, that's a human issue; tell the user to use a certain 
form of address for the form entry. Tell the user that once they 
registers using "Route 35" they have to refer to it as "Route 35" in the 
future.

> With names:
> 	you have to compare with and without the middle name
> 	compare with and without the title (Mrs., Dr., Mr., Ms.)
> 	compare with and without the suffix (PhD., Sr., Junior, III, etc.)

they're never a problem, names should be separated into at least two 
fields: firstname and lastname; and title and suffixes should have their 
own fields:

Mrs. John Doe -> titles: (mrs,); first: john; last: doe
Doe, John -> titles: (); first: john; last: doe
John Doe -> titles: (); first: john; last: doe
John Foo Doe -> titles: (); first: john middle: foo; last: doe
          or: -> titles: (); first: john; middle: foo; last: doe
Doe, John Foo -> titles: (); first: john; middle: foo; last: doe
Prof. John Doe -> titles: (professor,); first: john; last: doe
dr. John Doe, PhD -> titles: (doctor, PhD); first: john; last: doe
Lady John Doe III -> titles: (lady, III); first: john; last: doe
Lady John Doe The Third -> titles: (lady, III); first: john; last: doe
John Doe Jr. -> titles: (junior,); first: john; last: doe

If both the "query" and the "index" is normalized with the same (or 
similar) algorithms; that would significantly reduce the need for fuzzy 
search.

> 	typos are VERY common

that's where fuzzy search comes in, but the database entries themselves 
should be normalized long before fuzzy search kicks in.

> 	what if John Henry Smith goes by "Henry Smith"?

what's wrong with that? Your "name search" algorithm can combine the 
firstname, middlename, and lastname fields into one "superview" for 
searching purpose.

> 	what if Xu Wang goes by "John Wang" (happens all the time)
 > 	maiden name versus married name

your search query should be normalized as well. search "Xu" first, then 
search "Wang", then find intersection. Show the "intersection" to the 
user, if they can't find the correct name in the intersection, then 
offer the queree to search for "Xu"-only or "Wang"-only. However, if 
John Wang goes by Jack Black, then indeed it is an unsolvable problem.

> 	etc. etc. etc.

> This is a major, real-world issue that remains unsolved, and companies that do a decent job at it make millions of dollars a year from their clients. One of my old jobs made tens of millions a year (and growing FAST) in the  medical industry alone.

I agree fuzzy searches is indispensable in certain cases, but from the 
way you're describing the issue, it appears that half of your "unsolved" 
problems comes due to the poor design of the database. I agree, that the 
other halves (e.g. typos, multiple names/addresses) are indeed unsolvable.