Simple distributed example for learning purposes?
Lie Ryan
lie.1296 at gmail.com
Mon Dec 28 11:28:49 EST 2009
On 12/28/2009 11:59 PM, Shawn Milochik wrote:
> With address data:
> one address may have suite data and the other might not
> the same city may have multiple zip codes
why is that even a problem? You do put suite data and zipcode into
different database fields right?
> incoming addresses may be missing information
> typos are common
see below
> sometimes "Route 35" is the same road as "Convery Boulevard"
If you have access to roadnames index, you can theoretically normalize
"Route 35" into "Convery Boulevard" (or vice versa). But for most
practical purpose, that's a human issue; tell the user to use a certain
form of address for the form entry. Tell the user that once they
registers using "Route 35" they have to refer to it as "Route 35" in the
future.
> With names:
> you have to compare with and without the middle name
> compare with and without the title (Mrs., Dr., Mr., Ms.)
> compare with and without the suffix (PhD., Sr., Junior, III, etc.)
they're never a problem, names should be separated into at least two
fields: firstname and lastname; and title and suffixes should have their
own fields:
Mrs. John Doe -> titles: (mrs,); first: john; last: doe
Doe, John -> titles: (); first: john; last: doe
John Doe -> titles: (); first: john; last: doe
John Foo Doe -> titles: (); first: john middle: foo; last: doe
or: -> titles: (); first: john; middle: foo; last: doe
Doe, John Foo -> titles: (); first: john; middle: foo; last: doe
Prof. John Doe -> titles: (professor,); first: john; last: doe
dr. John Doe, PhD -> titles: (doctor, PhD); first: john; last: doe
Lady John Doe III -> titles: (lady, III); first: john; last: doe
Lady John Doe The Third -> titles: (lady, III); first: john; last: doe
John Doe Jr. -> titles: (junior,); first: john; last: doe
If both the "query" and the "index" is normalized with the same (or
similar) algorithms; that would significantly reduce the need for fuzzy
search.
> typos are VERY common
that's where fuzzy search comes in, but the database entries themselves
should be normalized long before fuzzy search kicks in.
> what if John Henry Smith goes by "Henry Smith"?
what's wrong with that? Your "name search" algorithm can combine the
firstname, middlename, and lastname fields into one "superview" for
searching purpose.
> what if Xu Wang goes by "John Wang" (happens all the time)
> maiden name versus married name
your search query should be normalized as well. search "Xu" first, then
search "Wang", then find intersection. Show the "intersection" to the
user, if they can't find the correct name in the intersection, then
offer the queree to search for "Xu"-only or "Wang"-only. However, if
John Wang goes by Jack Black, then indeed it is an unsolvable problem.
> etc. etc. etc.
> This is a major, real-world issue that remains unsolved, and companies that do a decent job at it make millions of dollars a year from their clients. One of my old jobs made tens of millions a year (and growing FAST) in the medical industry alone.
I agree fuzzy searches is indispensable in certain cases, but from the
way you're describing the issue, it appears that half of your "unsolved"
problems comes due to the poor design of the database. I agree, that the
other halves (e.g. typos, multiple names/addresses) are indeed unsolvable.
More information about the Python-list
mailing list