Is there a faster way to do this?
Gary Herron
gherron at islandtraining.com
Tue Aug 5 13:02:34 EDT 2008
Avinash Vora wrote:
> On Aug 5, 2008, at 10:00 PM, ronald.johnson at gmail.com wrote:
>
>> I have a csv file containing product information that is 700+ MB in
>> size. I'm trying to go through and pull out unique product ID's only
>> as there are a lot of multiples. My problem is that I am appending the
>> ProductID to an array and then searching through that array each time
>> to see if I've seen the product ID before. So each search takes longer
>> and longer. I let the script run for 2 hours before killing it and had
>> only run through less than 1/10 if the file.
>
> Why not split the file into more manageable chunks, especially as it's
> just what seems like plaintext?
>
>> Heres the code:
>> import string
>>
>> def checkForProduct(product_id, product_list):
>> for product in product_list:
>> if product == product_id:
>> return 1
>> return 0
>>
>>
>> input_file="c:\\input.txt"
>> output_file="c:\\output.txt"
>> product_info = []
>> input_count = 0
>>
>> input = open(input_file,"r")
>> output = open(output_file, "w")
>>
>> for line in input:
>> break_down = line.split(",")
>> product_number = break_down[2]
>> input_count+=1
>> if input_count == 1:
>> product_info.append(product_number)
>> output.write(line)
>> output_count = 1
>
> This seems redundant.
>
>> if not checkForProduct(product_number,product_info):
>> product_info.append(product_number)
>> output.write(line)
>> output_count+=1
>
> File writing is extremely expensive. In fact, so is reading. Think
> about reading the file in whole chunks. Put those chunks into Python
> data structures, and make your output information in Python data
> structures.
Don't bother yourself with this suggestion about reading in chunks --
Python already does this for you, and does so more efficiently that you
could. The code
for line in open(input_file,"r"):
reads in large chunks (efficiently) and then serves up the contents
line-by-line.
Gary Herron
> If you use a dictionary and search the ID's there, you'll notice some
> speed improvements as Python does a dictionary lookup far quicker than
> searching a list. Then, output your data all at once at the end.
>
> --
> Avi
>
> --
> http://mail.python.org/mailman/listinfo/python-list
More information about the Python-list
mailing list