Simple distributed example for learning purposes?
Shawn Milochik
shawn at milochik.com
Mon Dec 28 07:59:54 EST 2009
On Dec 27, 2009, at 1:23 PM, Lie Ryan wrote:
>
> IMHO, that's a poor example. Rather than writing a fuzzy search algorithm, it's easier to write a normalizer before entering data to the index (or before comparing the search string with the corpus' string).
> --
>
It does seem like that at first, but it turns out that you can't normalize this data, for many reasons.
With address data:
one address may have suite data and the other might not
the same city may have multiple zip codes
incoming addresses may be missing information
typos are common
sometimes "Route 35" is the same road as "Convery Boulevard"
etc. etc. etc.
With names:
you have to compare with and without the middle name
compare with and without the title (Mrs., Dr., Mr., Ms.)
compare with and without the suffix (PhD., Sr., Junior, III, etc.)
typos are VERY common
what if John Henry Smith goes by "Henry Smith"?
what if Xu Wang goes by "John Wang" (happens all the time)
maiden name versus married name
etc. etc. etc.
This is a major, real-world issue that remains unsolved, and companies that do a decent job at it make millions of dollars a year from their clients. One of my old jobs made tens of millions a year (and growing FAST) in the medical industry alone.
Shawn
More information about the Python-list
mailing list