Is there a faster way to do this?
Gary Herron
gherron at islandtraining.com
Tue Aug 5 12:45:42 EDT 2008
ronald.johnson at gmail.com wrote:
> I have a csv file containing product information that is 700+ MB in
> size. I'm trying to go through and pull out unique product ID's only
> as there are a lot of multiples. My problem is that I am appending the
> ProductID to an array and then searching through that array each time
> to see if I've seen the product ID before. So each search takes longer
> and longer. I let the script run for 2 hours before killing it and had
> only run through less than 1/10 if the file.
>
Store your ID's in a dictionary or a set. Then test for for existence
of a new ID in that set. That test will be *much* more efficient that
searching a list. (It uses a hashing scheme.)
IDs = set()
for row in ...
ID = extractIdFromRow(row)
if ID not in IDs:
set.add(ID)
... whatever ...
In fact if *all* you are doing is trying to identify all product IDs
that occur in the file (no matter how many times they occur)
IDs = set()
for row in ...
ID = extractIdFromRow(row)
set,add(ID)
and your set is will contain *one* copy of each ID added, no matter how
many were added.
Better yet, if you can write you ID extraction as a generator or list
comprehension...
IDs = set(extractIdFromRow(row) for row in rowsOfTable)
or some such would be most efficient.
Gary Herron
> Heres the code:
> import string
>
> def checkForProduct(product_id, product_list):
> for product in product_list:
> if product == product_id:
> return 1
> return 0
>
>
> input_file="c:\\input.txt"
> output_file="c:\\output.txt"
> product_info = []
> input_count = 0
>
> input = open(input_file,"r")
> output = open(output_file, "w")
>
> for line in input:
> break_down = line.split(",")
> product_number = break_down[2]
> input_count+=1
> if input_count == 1:
> product_info.append(product_number)
> output.write(line)
> output_count = 1
> if not checkForProduct(product_number,product_info):
> product_info.append(product_number)
> output.write(line)
> output_count+=1
>
> output.close()
> input.close()
> print input_count
> print output_count
> --
> http://mail.python.org/mailman/listinfo/python-list
>
More information about the Python-list
mailing list