[Tutor] Randomize SSN value in field

GTXY20 gtxy20 at gmail.com
Thu May 22 19:56:22 CEST 2008


Thanks all;

Basically I will be given an address list of about 50 million address lines
- this will boil down to approximately 15 million unique people in the list
using SSN as the reference for the primary key. I was concerned that by
using random I would eventually have duplicate values for different SSN's.
After assigning the unique reference for the SSN I need to then forward the
address file (complete 50 million records) to a larger group for analysis
and assign the unique reference in place of the SSN.

I will take a look at the various options I have with sha and md5 along with
the information regarding cryptography to see what i can come up with.

Alternatively I guess I could parse the address list and build a dictionary
where the key is the SSN and the value starts at 1 and is incremented as I
add addtional SSN keys to the dictionary. I would hold onto this dictionary
for reference as information is fed back to me.

With respect to a potentially large dictionary object can you suggest
efficient ways of handling memory when working with large dictionary
objects?

As always your help much appreciated.

G.

On Thu, May 22, 2008 at 1:39 PM, Kent Johnson <kent37 at tds.net> wrote:

> On Thu, May 22, 2008 at 12:14 PM, GTXY20 <gtxy20 at gmail.com> wrote:
> > Hello all,
> >
> > I will be dealing with an address list where I might have the following:
> >
> > Name SSN
> > John 111111111
> > John 111111111
> > Jane 222222222
> > Jill 333333333
> >
> > What I need to do is parse the address list and then create a unique
> random
> > unidentifiable value for the SSN field
>
> > The unique random value does not have to follow this convention but it
> needs
> > to be unique so that I can relate it back to the original SSN when
> needed.
> > As opposed to using the random module I was thinking that it would be
> better
> > to use either sha or md5. Just curious as to thoughts on the correct
> > approach.
>
> How are you relating back to the SSN? Are you keeping a
> cross-reference? If so, you might just assign sequence numbers for the
> unidentifiable value. If you want the key itself to be convertable
> back to the SSN (which wouldn't work with random values) you will need
> some cryptography. If you want a unique key that won't collide with
> other keys then sha or md5 is a better bet than random.
>
> Kent
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20080522/cc63d280/attachment.htm>


More information about the Tutor mailing list