[Tutor] Processing CSV files
Martin A. Brown
martin at linux-ip.net
Wed Oct 9 01:45:32 CEST 2013
Hello there,
: We have a box with 16GB RAM so RAM should not be an issue
: hopefully.
:
: The datastore is Cassandra and I'm hoping to use the pycassa
: library for interaction.
:
: I do have an additional question related to Cassandra & Python.
: As part of data processing, I need to fetch slices of data from
: Cassandra and run computations like sum and percentile
: calculation on it. The sum along with other attributes needs to
: be stored back in another Cassandra table that will be queried by
: end users of a reporting system. This is because Cassandra does
: not provide any aggregation functions, so we will precompute the
: aggregations and store in cassandra.
:
: So for calculating the sum & percentile in Python, some of the
: data slices on Cassandra could fetch a lot of rows (e.g.750,000
: to 1mill rows) … And since I need to compute a sum and
: percentile, I need to consider all the rows. I am planning to do
: this in Python. Do you foresee any issues with this approach? Any
: advise on this will be greatly appreciated.
Even if you simply use sum(), it'll be just fine to compute sum().
Computing won't be the bottleneck--retrieval is more likely to be
your problem. For 1 million rows simulation, which occurs pretty
darned quickly, try this out:
>>> sum(random.sample(xrange(10000000),1000000))
5002880911167
>>> len(random.sample(xrange(10000000),1000000))
1000000
If you want to do percentiles, then .... do you know about numpy and
pandas? You may want to look into these good, mature, third-party
Python libraries.
numpy: http://www.numpy.org/
pandas: http://pandas.pydata.org/
They are both nice tools for working with data in Python. Pandas
bears a clear resemblance to R. I think numpy is one of the oldest
scientific computational libraries available for Python.
Good luck,
-Martin
--
Martin A. Brown
http://linux-ip.net/
More information about the Tutor
mailing list