Simple distributed example for learning purposes?

Thu Jan 14 22:54:09 EST 2010

Ethan Furman wrote:
> CM wrote:
>> On Dec 26 2009, 3:46 pm, Shawn Milochik <sh... at milochik.com> wrote:
>>> The special features of the Shrek DVD showed how the rendering took
>>> so much processing power that everyone's workstation was used
>>> overnight as a rendering farm. Some kind of video rendering would
>>> make a great example. However, it might be a lot of overhead for you
>>> to set up, unless you can find someone with expertise in the area.
>>> The nice thing about this is that it would be relevant to the
>>> audience. Also, if you describe what goes into processing a single
>>> frame in enough depth that they appreciate it, they'll really "get"
>>> the power of distributed processing.
>>>
>>> Something else incredibly time-expensive but much easier to set up
>>> would be matching of names and addresses. I worked at a company where
>>> this was, at its very core, the primary function of the business
>>> model. Considering the different ways of entering simple data, many
>>> comparisons must be made. This takes a lot of time, and even then the
>>> match rates aren't necessarily going to be very high.
>>>
>>> Here are some problems with matching:
>>>
>>> Bill versus William
>>> '52 10th Street' | '52 tenth street'
>>> 'E. Smith street' | 'E smith street' | 'east smith street'
>>> 'Bill Smith' | 'Smith, Bill'
>>> 'William Smith Jr' | 'William Smith Junior'
>>> 'Dr. W. Smith' | 'William Smith'
>>> 'Michael Norman Smith' | 'Michael N. Smith' | 'Michael Smith' |
>>> 'Smith, Michael' | 'Smith, Michael N.' | 'Smith, Michael Norman'
>>>
>>> The list goes on and on, ad nauseum. Not to mention geocoding,
>>> married and maiden names, and scoring partial name matches with
>>> distance proximity matches.
>>
>> I'm not sure I understand the task.  Based on another comment in this
>> thread, the idea seems to be that the company that does this matching
>> work is handed a big data set, and then these sorts of matching
>> comparisons are made on it--but what is the goal?
>>
>> I can understand with address matches, in that 'east smith street'
>> should be recognized as the official postal address 'E. Smith Street',
>> so that the company can send correspondence to the correct address.
>> Is that the idea?
> 
> Yes, and...
> 
>> But what about the names?  If it says 'Michael Norman Smith' as the
>> name, or 'Michael N. Smith' or 'Smith, Michael', can't one then just
>> use that on the correspondence?  Why do you need to match 'Michael
>> Norman Smith' to 'Michael N. Smith' to discover they are the same
>> person?  Is it that those two variations appear in the same data set
>> and you want to make sure you don't mail twice to the same person?
> 
> yes.  We have one customer who gives us *everything* they have -- about
> 20,000 records, and after the duplicate purge process they mail to about
> 3,000.  In other cases we'll be combining different data sets from the
> same customer, and again we don't want duplicate mailings.
> 
And I can tell you from long experience that one thing the purchasers of
(for example) expensive training do not like is to see vendors wasting
money on multiple (expensive) catalog mailings.

regards
 Steve
-- 
Steve Holden           +1 571 484 6266   +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010  http://us.pycon.org/
Holden Web LLC                 http://www.holdenweb.com/
UPCOMING EVENTS:        http://holdenweb.eventbrite.com/