Numpy and Terabyte data
Albert-Jan Roskam
sjeik_appie at hotmail.com
Wed Jan 3 15:03:30 EST 2018
On Jan 2, 2018 18:27, Rustom Mody <rustompmody at gmail.com> wrote:
>
> Someone who works in hadoop asked me:
>
> If our data is in terabytes can we do statistical (ie numpy pandas etc)
> analysis on it?
>
> I said: No (I dont think so at least!) ie I expect numpy (pandas etc)
> to not work if the data does not fit in memory
>
> Well sure *python* can handle (streams of) terabyte data I guess
> *numpy* cannot
>
> Is there a more sophisticated answer?
>
> ["Terabyte" is a just a figure of speech for "too large for main memory"]
Have a look at Pyspark and pyspark.ml. Pyspark has its own kind of DataFrame. Very, very cool stuff.
Dask DataFrames have been mentioned already.
numpy has memmapped arrays: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.memmap.html
More information about the Python-list
mailing list