clean data [was: Simple distributed example for learning purposes?]
Ethan Furman
ethan at stoneleaf.us
Mon Dec 28 17:14:09 EST 2009
Lie Ryan wrote:
> On 12/28/2009 11:59 PM, Shawn Milochik wrote:
>> With address data:
>> one address may have suite data and the other might not
>> the same city may have multiple zip codes
>
> why is that even a problem? You do put suite data and zipcode into
> different database fields right?
The issue here is not proper database design, the issue is users -- not
one user, not two users, but millions of users, with no consistency
amongst them. They bring you their nice tiny list of 10,000 names and
addresses and want you to correct/normalize/mail them, and you have to
be able to break down what they gave you into something usable.
To rephrase, the issue that Shawn is referring to is the huge amount of
data *already out there*, not brand new data.
~Ethan~
More information about the Python-list
mailing list