People's names (was Re: sqlite3 error)

John Machin sjmachin at lexicon.net
Mon Oct 9 20:51:23 EDT 2006


John J. Lee wrote:
> "John Machin" <sjmachin at lexicon.net> writes:
> [...]
> > This is all a bit OT.  Before we close the thread down
>
> Do you have a warrant for that?

I have some signed-but-otherwise-blank warrants, but I'm saving them
for other threads :-)

>
> > , let me leave
> > you with one warning:
> > Beware of enthusiastic maintenance programmers on a mission to clean up
> > the dirty names in your database:
> > E.g. (1) "Karim bin Md" may not appreciate getting a letter addressed
> > to "Dr Karim Bin" (Md is an abbreviation of Muhammad).
> > E.g. (2) Billing job barfs on a customer who has no given names and no
> > family name. Inspection reveals that he is over-endowed in the title
> > department: "Mr Earl King".
> [...]
>
> Heh.

Heh indeed. This behaviour seems to be endemic. Another true story from
a 3rd post-cleanup cleanup assignment: Looking at the "country"
component of addresses:  WALES? Users suggested it be changed to "UK"
to conform with ISO standard, UPU conventions, etc. However glancing at
other address components, one found intriguing things like "C/o Prince
of  Hospital". The same "algorithm" had migrated a handful of clients
from Coromandel Valley to Oman, and a considerable number from the
Melbourne suburb of Chadstone to Chad.

>
> I guess the people who really know about that kind of thing are the
> "record linkage" people (this one is a project worked on by c.l.py's
> own Tim Churches, and has produced some Python code):
>
> http://datamining.anu.edu.au/projects/linkage.html

The project is heavily into probabilistic methods. Given enough
correctly tagged data to work on, 'Earl" and "King" are much more
likely to drop into a name slot than a title slot.

Cheers,
John




More information about the Python-list mailing list