python compare and process a csv file
Peter Otten
__peter__ at web.de
Wed Nov 4 05:22:04 EST 2009
Chris Rebert wrote:
> On Tue, Nov 3, 2009 at 7:43 AM, Siva Subramanian
> <elpostssv at rediffmail.com> wrote:
>> Hello all,
>>
>> I am new on this list and computer programming
>>
>> I have two distinct statistical files (both csv)
>>
>> 1. Report_2_5 – this is a report dump containing over a 10 million
>> records and is different every day
>>
>> 2. Customer_id dump – this is a daily dump of customers who have
>> made payments. This is generally a million record
>>
>> I need to extract past history depending on customers who make regular
>> payments
>>
>> For example,
>>
>> Report_2_5
>>
>> Customer ID, Plan_NO, stat1, vol2, amount3
>> 2134, Ins1, 10000, 20000, 10
>> 2112, Ins3, 30000, 20000, 10
>> 2121, Ins3, 30000, 20000, 10
>> 2145, Ins2, 15000, 10000, 5
>> 2245, Ins2, 15000, 10000, 5
>> 0987, Ins1, 10000, 20000, 10
>>
>> 4546, Ins1, 10020, 21000, 10
>>
>> 6757, Ins1, 10200, 22000, 10
>> …
>>
>> customer_id dump
>>
>>
>> 0987
>>
>> 4546
>>
>> 6757
>>
>> 2134
>>
>> I need to process the Report_2_5 and extract the following output
>>
>> Stat1: 40220
>> Vol2 : 83000
>> Amount3 : 40
>>
>> I am new to programming and I have been extracting this data using MS –
>> Access and I badly need a better solution.
>
> Have you considered using a proper SQL database? (See
> http://en.wikipedia.org/wiki/SQL ; MySQL is one example:
> http://en.wikipedia.org/wiki/MySQL)
> Mucking around with CSV files like this is basically doing the work of
> some simple SQL queries, only in an ad-hoc, inefficient manner. (MS
> Access is essentially a non-industrial-strength SQL for
> non-programmers.)
Industrial strength or not, Access should be capable of solving the OP's
problem. So it would be interesting what's so bad about it in this case.
Anyway, here's a database-free python solution:
import csv
REPORT = "report.csv"
CUSTOMERS = "customers.csv"
with open(CUSTOMERS) as instream:
next(instream) # skip header
# put customer ids into a set for fast lookup
customer_ids = set(line.strip() for line in instream)
with open(REPORT) as instream:
rows = csv.reader(instream)
# find columns
headers = [column.strip() for column in rows.next()]
customer_column = headers.index("Customer ID")
sum_over_columns = [headers.index(s) for s in "stat1 vol2
amount3".split()]
# initialize totals
sigma = [0] * len(headers)
# calculate totals
for row in rows:
if row[customer_column] in customer_ids:
for index in sum_over_columns:
sigma[index] += int(row[index])
# print totals
for index in sum_over_columns:
print "%-10s %10d" % (headers[index] + ":", sigma[index])
The limiting factor for this approach is the customer_ids set which at some
point may no longer fit into memory.
Peter
More information about the Python-list
mailing list