[scikit-learn] Need help in dealing with large dataset

Mon Mar 5 11:19:39 EST 2018

Dear All,

I am working on building a CNN model for image classification problem.
As par of it I have converted all my test images to numpy array.

Now when I am trying to split the array into training and test set I am
getting memory error.
Details are as below:

X = np.load("./data/X_train.npy", mmap_mode='r')
train_pct_index = int(0.8 * len(X))
X_train, X_test = X[:train_pct_index], X[train_pct_index:]
X_train = X_train.reshape(X_train.shape[0], 256, 256, 3)

X_train = X_train.astype('float32')
-------------------------------------------------MemoryError
                    Traceback (most recent call
last)<ipython-input-46-9180807e01dc> in <module>()
      2 print("Normalizing Data")
      3 ----> 4 X_train = X_train.astype('float32')

*More information:*

*1. my python version is*

python --versionPython 3.6.4 :: Anaconda custom (64-bit)

*2. I am running the code in ubuntu ubuntu 16.04.*

*3. I have 32GB RAM*

*4. X_train.npy file that I have loaded to np.array is of size 20GB*

print("X_train Shape: ", X_train.shape)
X_train Shape:  (85108, 256, 256, 3)

I would be really glad if you can help me to overcome this problem.

Regards,
-
Chethan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180305/21225329/attachment-0001.html>