[scikit-learn] combining arrays of features to train an MLP

Tue Dec 20 14:00:17 EST 2016

Hi, Thomas,
I haven’t looked what RandomizedLasso does exactly, but like you said, it is probably not ideal for combining it with an MLP. What In terms of regularization, I was more thinking of the L1 and L2 for the hidden layers, or dropout. However, given such a small sample size (and the small the sample/feature ratio), I think there are way too many (hyper/)parameters to fit in an MLP to get good results. I think you could be better off with a kernel SVM (if linear models don’t work well) or ensemble learning.

Best,
Sebastian

> On Dec 19, 2016, at 6:51 PM, Thomas Evangelidis <tevang3 at gmail.com> wrote:
> 
> Thank you, these articles discuss about ML application of the types of fingerprints I working with! I will read them thoroughly to get some hints.
> 
> In the meantime I tried to eliminate some features using RandomizedLasso and the performance escalated from R=0.067 using all 615 features to R=0.524 using only the 15 top ranked features. Naive question: does it make sense to use the RandomizedLasso to select the good features in order to train a MLP? I had the impression that RandomizedLasso uses multi-variate linear regression to fit the observed values to the experimental and rank the features.
> 
> Another question: this dataset consists of 31 observations. The Pearson's R values that I reported above were calculated using cross-validation. Could someone claim that they are inaccurate because the number of features used for training the MLP is much larger than the number of observations?
>  
> 
> On 19 December 2016 at 23:42, Sebastian Raschka <se.raschka at gmail.com> wrote:
> Oh, sorry, I just noticed that I was in the wrong thread — meant answer a different Thomas :P.
> 
> Regarding the fingerprints; scikit-learn’s estimators expect feature vectors as samples, so you can’t have a 3D array … e.g., think of image classification: here you also enroll the n_pixels times m_pixels array into 1D arrays.
> 
> The low performance can have mutliple issues. In case dimensionality is an issue, I’d maybe try stronger regularization first, or feature selection.
> If you are working with molecular structures, and you have enough of them, maybe also consider alternative feature representations, e.g,. learning from the graphs directly:
> 
> http://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints.pdf
> http://pubs.acs.org/doi/abs/10.1021/ci400187y
> 
> Best,
> Sebastian
> 
> 
> > On Dec 19, 2016, at 4:56 PM, Thomas Evangelidis <tevang3 at gmail.com> wrote:
> >
> > this means that both are feasible?
> >
> > On 19 December 2016 at 18:17, Sebastian Raschka <se.raschka at gmail.com> wrote:
> > Thanks, Thomas, that makes sense! Will submit a PR then to update the docstring.
> >
> > Best,
> > Sebastian
> >
> >
> > > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis <tevang3 at gmail.com> wrote:
> > >
> > > 
> > > Greetings,
> > >
> > > My dataset consists of objects which are characterised by their structural features which are encoded into a so called "fingerprint" form. There are several different types of fingerprints, each one encapsulating different type of information. I want to combine two specific types of fingerprints to train a MLP regressor. The first fingerprint consists of a 2048 bit array of the form:
> > >
> > >  FP1 = array([ 1.,  1.,  0., ...,  0.,  0.,  1.], dtype=float32)
> > >
> > > The second is a 60 float number array of the form:
> > >
> > > FP2 = array([ 2.77494618,  0.98973243,  0.34638652,  2.88303715,  1.31473857,
> > >        -0.56627112,  4.78847547,  2.29587913, -0.6786228 ,  4.63391109,
> > >        ...
> > >         0.        ,  0.        ,  5.89652792,  0.        ,  0.        ])
> > >
> > > At first I tried to fuse them into a single 1D array of 2048+60 columns but the predictions of the MLP were worse than the 2 different MLP models trained from one of the 2 fingerprint types individually. My question: is there a more effective way to combine the 2 fingerprints in order to indicate that they represent different type of information?
> > >
> > > To this end, I tried to create a 2-row array (1st row 2048 elements and 2nd row 60 elements) but sklearn complained:
> > >
> > >     mlp.fit(x_train,y_train)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 618, in fit
> > >     return self._fit(X, y, incremental=False)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 330, in _fit
> > >     X, y = self._validate_input(X, y, incremental)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 1264, in _validate_input
> > >     multi_output=True, y_numeric=True)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 521, in check_X_y
> > >     ensure_min_features, warn_on_dtype, estimator)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 402, in check_array
> > >     array = array.astype(np.float64)
> > > ValueError: setting an array element with a sequence.
> > > 
> > >
> > > Then I tried to create for each object of the dataset a 2D array of size 2x2048, by adding 1998 zeros in the second row in order both rows to be of equal size. However sklearn complained again:
> > >
> > >
> > >     mlp.fit(x_train,y_train)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 618, in fit
> > >     return self._fit(X, y, incremental=False)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 330, in _fit
> > >     X, y = self._validate_input(X, y, incremental)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 1264, in _validate_input
> > >     multi_output=True, y_numeric=True)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 521, in check_X_y
> > >     ensure_min_features, warn_on_dtype, estimator)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 405, in check_array
> > >     % (array.ndim, estimator_name))
> > > ValueError: Found array with dim 3. Estimator expected <= 2.
> > >
> > >
> > > In another case of fingerprints, lets name them FP3 and FP4, I observed that the MLP regressor created using FP3 yields better results when trained and evaluated using logarithmically transformed experimental values (the values in y_train and y_test 1D arrays), while the MLP regressor created using FP4 yielded better results using the original experimental values. So my second question is: when combining both FP3 and FP4 into a single array is there any way to designate to the MLP that the features that correspond to FP3 must reproduce the logarithmic transform of the experimental values while the features of FP4 the original untransformed experimental values?
> > >
> > >
> > > I would greatly appreciate any advice on any of my 2 queries.
> > > Thomas
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > ======================================================================
> > > Thomas Evangelidis
> > > Research Specialist
> > > CEITEC - Central European Institute of Technology
> > > Masaryk University
> > > Kamenice 5/A35/1S081,
> > > 62500 Brno, Czech Republic
> > >
> > > email: tevang at pharm.uoa.gr
> > >               tevang3 at gmail.com
> > >
> > > website: https://sites.google.com/site/thomasevangelidishomepage/
> > >
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> > --
> > ======================================================================
> > Thomas Evangelidis
> > Research Specialist
> > CEITEC - Central European Institute of Technology
> > Masaryk University
> > Kamenice 5/A35/1S081,
> > 62500 Brno, Czech Republic
> >
> > email: tevang at pharm.uoa.gr
> >               tevang3 at gmail.com
> >
> > website: https://sites.google.com/site/thomasevangelidishomepage/
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> -- 
> ======================================================================
> Thomas Evangelidis
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081, 
> 62500 Brno, Czech Republic 
> 
> email: tevang at pharm.uoa.gr
>          	tevang3 at gmail.com
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn