[scikit-learn] Ipython Jupyter Kernel Dies when I fit an SGDClassifier

Stuart Reynolds stuart at stuartreynolds.net
Fri Jun 2 13:39:48 EDT 2017


Hmmm... is it possible to place your original data into a memmap?
(perhaps will clear out 8Gb, depending on SGDClassifier internals?)

https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html
https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas

- Stuart

On Fri, Jun 2, 2017 at 10:30 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> I also think that this could be likely a memory related issue. I just ran the following snippet in a Jupyter Nb:
>
> import numpy as np
> from sklearn.linear_model import SGDClassifier
>
> model = SGDClassifier(loss='log',penalty=None,alpha=0.0,
>               l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False,learning_rate='constant',
>              eta0=1.0)
>
> X = np.random.random((1000000, 1000))
> y = np.zeros(1000000)
> y[:1000] = 1
>
> model.fit(X, y)
>
>
>
> The dataset takes approx. 8 Gb, but the model fitting is consuming ~16 Gb -- probably due to making a copy of the X array in the code. The Notebook didn't crash but I think on machines with smaller RAM, this could be an issue. One workaround you could try is to fit the model iteratively using partial_fit. For example, 1000 samples at a time or so:
>
>
> indices = np.arange(y.shape[0])
> batch_size = 1000
>
> for start_idx in range(0, indices.shape[0] - batch_size + 1,
>                        batch_size):
>     index_slice = indices[start_idx:start_idx + batch_size]
>     model.partial_fit(X[index_slice], y[index_slice], classes=[0, 1])
>
>
>
> Best,
> Sebastian
>
>
>> On Jun 2, 2017, at 6:50 AM, Iván Vallés Pérez <ivanvallesperez at gmail.com> wrote:
>>
>> Are you monitoring your RAM memory consumption? I would say that it is the cause of the majority of the kernel crashes
>> El El vie, 2 jun 2017 a las 12:45, Aymen J <ay.j at hotmail.fr> escribió:
>> Hey Guys,
>>
>>
>> So I'm trying to fit an SGD classifier on a dataset that has 900,000 for about 3,600 features (high cardinality).
>>
>>
>> Here is my model:
>>
>>
>> model = SGDClassifier(loss='log',penalty=None,alpha=0.0,
>>               l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False,learning_rate='constant',
>>              eta0=1.0)
>>
>> When I run the model.fit function, The program runs for about 5 minutes, and I receive the message "the kernel has died" from Jupyter.
>>
>> Any idea what may cause that? Is my training data too big (in terms of features)? Can I do anything (parameters) to finish training?
>>
>> Thanks in advance for your help!
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


More information about the scikit-learn mailing list