python compare and process a csv file

Wed Nov 4 05:22:04 EST 2009

Chris Rebert wrote:

> On Tue, Nov 3, 2009 at 7:43 AM, Siva Subramanian
> <elpostssv at rediffmail.com> wrote:
>> Hello all,
>>
>> I am new on this list and computer programming
>>
>> I have two distinct statistical files (both csv)
>>
>> 1.       Report_2_5 – this is a report dump containing over a 10 million
>> records and is different every day
>>
>> 2.       Customer_id dump – this is a daily dump of customers who have
>> made payments. This is generally a million record
>>
>> I need to extract past history depending on customers who make regular
>> payments
>>
>> For example,
>>
>> Report_2_5
>>
>> Customer ID, Plan_NO, stat1, vol2, amount3
>> 2134, Ins1, 10000, 20000, 10
>> 2112, Ins3, 30000, 20000, 10
>> 2121, Ins3, 30000, 20000, 10
>> 2145, Ins2, 15000, 10000, 5
>> 2245, Ins2, 15000, 10000, 5
>> 0987, Ins1, 10000, 20000, 10
>>
>> 4546, Ins1, 10020, 21000, 10
>>
>> 6757, Ins1, 10200, 22000, 10
>> …
>>
>> customer_id dump
>>
>>
>> 0987
>>
>> 4546
>>
>> 6757
>>
>> 2134
>>
>> I need to process the Report_2_5 and extract the following output
>>
>> Stat1: 40220
>> Vol2 : 83000
>> Amount3 : 40
>>
>> I am new to programming and I have been extracting this data using MS –
>> Access and I badly need a better solution.
> 
> Have you considered using a proper SQL database? (See
> http://en.wikipedia.org/wiki/SQL ; MySQL is one example:
> http://en.wikipedia.org/wiki/MySQL)
> Mucking around with CSV files like this is basically doing the work of
> some simple SQL queries, only in an ad-hoc, inefficient manner. (MS
> Access is essentially a non-industrial-strength SQL for
> non-programmers.)

Industrial strength or not, Access should be capable of solving the OP's 
problem. So it would be interesting what's so bad about it in this case.

Anyway, here's a database-free python solution:

import csv

REPORT = "report.csv"
CUSTOMERS = "customers.csv"

with open(CUSTOMERS) as instream:
    next(instream) # skip header

    # put customer ids into a set for fast lookup
    customer_ids = set(line.strip() for line in instream)

with open(REPORT) as instream:
    rows = csv.reader(instream)

    # find columns
    headers = [column.strip() for column in rows.next()]
    customer_column = headers.index("Customer ID")
    sum_over_columns = [headers.index(s) for s in "stat1 vol2 
amount3".split()]

    # initialize totals
    sigma = [0] * len(headers)

    # calculate totals
    for row in rows:
        if row[customer_column] in customer_ids:
            for index in sum_over_columns:
                sigma[index] += int(row[index])
# print totals
for index in sum_over_columns:
    print "%-10s %10d" % (headers[index] + ":", sigma[index])

The limiting factor for this approach is the customer_ids set which at some 
point may no longer fit into memory.

Peter