Is there a faster way to do this?

Tue Aug 5 22:57:08 EDT 2008

On Tue, 5 Aug 2008, ronald.johnson at gmail.com wrote:

> I have a csv file containing product information that is 700+ MB in
> size. I'm trying to go through and pull out unique product ID's only
> as there are a lot of multiples. My problem is that I am appending the
> ProductID to an array and then searching through that array each time
> to see if I've seen the product ID before. So each search takes longer
> and longer. I let the script run for 2 hours before killing it and had
> only run through less than 1/10 if the file.

My take:

I assume you have 80 bytes per line, that makes 10 milion lines for 700M 
file. To be quite sure lets round it to 20 milions. Next, I don't want to 
trash my disk with 700M+ files, so I assume reading the line, breaking it 
and getting product id takes roughly the same time as generating random id 
by my code. So, I:

1. read all records line by line (or just make random ids), append the 
product id to the list (actually, I preallocate the list with empty space 
and fill it up)
2. sort() the list
3. iterate the list, count the unique ids, optionally write to file

The code (most of it is just making random names, which was real fun):

import  string

RECORD_NUM = 20*1024*1024 # 700M/80-bytes-per-line = ca. 10M+ records
ALPHA = string.digits + string.ascii_letters
RAND = None

def  random_iter ( ) :
  x = 12345678910
  y = 234567891011
  M = 2**61 - 1
  M0 = 2**31 - 1
  pot10 = [ 1, 10, 100, 1000, 10000, 100000 ]
  while 1 :
    p = x * y
    l = pot10[5 - (p % 10)]
    n = (p / l) % M
    d = l * (n % 10)
    p = p % M0
    x1 = y - d + p
    y1 = x + d + p
    x, y = x1, y1
    yield n
    pass
  pass

def  next_random ( ) :
  global RAND
  if RAND == None :
    RAND = random_iter()
  return RAND.next()

def  num_to_name ( n ) :
  s = []
  len_a = len(ALPHA)
  while n > 0 :
    n1, r = divmod(n, len_a)
    s.append(ALPHA[r])
    n = n1
    pass
  return "".join(s)

def  get_product_id ( ) :
  return  num_to_name(next_random())

def  dry_run ( n ) :
  r = [ 0 ] * n
  while n > 0 :
    n -= 1
    r[n] = get_product_id()
  return r

###

if __name__ == "__main__":
  print "hello"
  for i in range(10) : print get_product_id()
  print
  print "RECORD_NUM: %d" % RECORD_NUM
  products = dry_run(RECORD_NUM)
  print "RECORDS PRODUCED: %d" % len(products)
  products.sort()
  i = 0
  lastp = ""
  count = 0
  while i < len(products) :
    if lastp != products[i] :
      lastp = products[i]
      count += 1
    i += 1
    pass
  print "UNIQUE RECORDS: %d" % count

I may have made some bugs, but it works on my computer. Or seems to ;-/.

For 20 mln products, on my Athlon XP @1800 / 1.5G ram, Debian Linux box, 
it takes about 13 minutes to go through list generation, about 3 minutes 
to sort the list, and few more seconds to skim it (writing should not be 
much longer). All summed up, about 18 minutes of real time, with some 
other programs computing a little etc in the background - so, much less 
then 2 hours.

Regards,
Tomasz Rola

--
** A C programmer asked whether computer had Buddha's nature.      **
** As the answer, master did "rm -rif" on the programmer's home    **
** directory. And then the C programmer became enlightened...      **
**                                                                 **
** Tomasz Rola          mailto:tomasz_rola at bigfoot.com             **