[scikit-learn] Need help in dealing with large dataset

Sebastian Raschka se.raschka at gmail.com
Mon Mar 5 12:28:27 EST 2018


Like Guillaume suggested, you don't want to load the whole array into memory if it's that large. There are many different ways for how to deal with this. The most naive way would be to break up your NumPy array into smaller NumPy array and load them iteratively with a running accuracy calculation. My suggestion would be to create a HDF5 file from the NumPy array where each entry is an image. If it's just the test images, you can also save a batch of them as entry because you don't need to shuffle them anyway.

Ultimately, the recommendation based on the sweet spot between performance and convenience depends on what DL framework you use. Since this is a scikit-learn forum, I suppose you are using sklearn objects (although, I am not aware that sklearn has CNNs). The DataLoader in PyTorch is universally useful though and can come in handy no matter what CNN implementation you use. I have some examples here if that helps:

- https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-celeba.ipynb
- https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-csv.ipynb

Best,
Sebastian


> On Mar 5, 2018, at 12:13 PM, Guillaume Lemaître <g.lemaitre58 at gmail.com> wrote:
> 
> If you work with deep net you need to check the utils from the deep net library.
> For instance in keras, you should create a batch generator if you need to deal with large dataset.
> In patch torch you can use the data loader which and the ImageFolder from torchvision which manage
> the loading for you.
> 
> On 5 March 2018 at 17:19, CHETHAN MURALI <chethanmuralisv at gmail.com> wrote:
> Dear All,
> 
> I am working on building a CNN model for image classification problem.
> As par of it I have converted all my test images to numpy array.
> 
> Now when I am trying to split the array into training and test set I am getting memory error.
> Details are as below:
> 
> X = np.load("./data/X_train.npy", mmap_mode='r')
> 
> train_pct_index 
> = int(0.8 * len(X))
> 
> X_train
> , X_test = X[:train_pct_index], X[train_pct_index:]
> 
> X_train 
> = X_train.reshape(X_train.shape[0], 256, 256, 3)
> 
> 
> X_train 
> = X_train.astype('float32')
> 
> 
> 
> -------------------------------------------------
> MemoryError                               Traceback (most recent call last)
> <ipython-input-46-9180807e01dc> in <module>()
> 
>       
> 2 print("Normalizing Data")
> 
>       
> 3
>  
> 
> ----> 4 X_train = X_train.astype('float32')
> More information:
> 
> 1. my python version is
> 
> python --
> version
> 
> Python 3.6.4 :: Anaconda custom (64-bit)
> 2. I am running the code in ubuntu ubuntu 16.04.
> 
> 3. I have 32GB RAM
> 
> 4. X_train.npy file that I have loaded to np.array is of size 20GB
> 
> print("X_train Shape: ", X_train.shape)
> 
> X_train 
> Shape:  (85108, 256, 256, 3)
> I would be really glad if you can help me to overcome this problem.
> 
> Regards,
> -
> Chethan
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> 
> -- 
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list