[Tutor] Processing CSV files

Wed Oct 9 01:45:32 CEST 2013

Hello there,

 : We have a box with 16GB RAM so RAM should not be an issue 
 : hopefully.
 : 
 : The datastore is Cassandra and I'm hoping to use the pycassa 
 : library for interaction.
 : 
 : I do have an additional question related to Cassandra & Python. 
 : As part of data processing, I need to fetch slices of data from 
 : Cassandra and run computations like sum and percentile 
 : calculation on it. The sum along with other attributes needs to 
 : be stored back in another Cassandra table that will be queried by 
 : end users of a reporting system.  This is because Cassandra does 
 : not provide any aggregation functions, so we will precompute the 
 : aggregations and store in cassandra.
 : 
 : So for calculating the sum & percentile in Python, some of the 
 : data slices on Cassandra could fetch a lot of rows (e.g.750,000 
 : to 1mill rows) … And since I need to compute a sum and 
 : percentile, I need to consider all the rows. I am planning to do 
 : this in Python. Do you foresee any issues with this approach? Any 
 : advise on this will be greatly appreciated.

Even if you simply use sum(), it'll be just fine to compute sum().  
Computing won't be the bottleneck--retrieval is more likely to be 
your problem.  For 1 million rows simulation, which occurs pretty 
darned quickly, try this out:

  >>> sum(random.sample(xrange(10000000),1000000))
  5002880911167
  >>> len(random.sample(xrange(10000000),1000000))
  1000000

If you want to do percentiles, then .... do you know about numpy and 
pandas?  You may want to look into these good, mature, third-party 
Python libraries.

  numpy:   http://www.numpy.org/
  pandas:  http://pandas.pydata.org/

They are both nice tools for working with data in Python.  Pandas 
bears a clear resemblance to R.  I think numpy is one of the oldest 
scientific computational libraries available for Python.

Good luck,

-Martin

-- 
Martin A. Brown
http://linux-ip.net/