[scikit-learn] suggested classification algorithm

Thu Nov 17 09:00:33 EST 2016

Guys thank you all for your hints! Practical experience is irreplaceable
that's why I posted this query here. I could read all week the mailing list
archives and the respective internet resources but still not find the key
info I could potentially get by someone here.

I did PCA on my training set (this one has 24 positive and 1278 negative
observation) and projected the 19 features on the first 2 PCs, which
explain 87.6 % of the variance in the data. Does this plot help to decide
which classification algorithms and/or over- or under-sampling would be
more suitable?

https://dl.dropboxusercontent.com/u/48168252/PCA_of_features.png

thanks for your advices
Thomas

On 16 November 2016 at 22:20, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Yeah, there are many useful resources and implementations scattered around
> the web. However, a good, brief overview of the general ideas and concepts
> would be this one, for example: http://www.svds.com/learning-
> imbalanced-classes/
>
>
> > On Nov 16, 2016, at 3:54 PM, Dale T Smith <Dale.T.Smith at macys.com>
> wrote:
> >
> > Unbalanced class classification has been a topic here in past years, and
> there are posts if you search the archives. There are also plenty of
> resources available to help you, from actual code on Stackoverflow, to
> papers that address various ideas. I don’t think it’s necessary to repeat
> any of this on the mailing list.
> >
> >
> > ____________________________________________________________
> ____________________________________________________________
> __________________
> > Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science
> > 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
> >
> > From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys.com at python.org] On Behalf Of Fernando Marcos Wittmann
> > Sent: Wednesday, November 16, 2016 3:11 PM
> > To: Scikit-learn user and developer mailing list
> > Subject: Re: [scikit-learn] suggested classification algorithm
> >
> > ⚠ EXT MSG:
> > Three based algorithms (like Random Forest) usually work well for
> imbalanced datasets. You can also take a look at the SMOTE technique (
> http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for
> over-sampling the positive observations.
> >
> > On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis <tevang3 at gmail.com>
> wrote:
> > Greetings,
> >
> > I want to design a program that can deal with classification problems of
> the same type, where the  number of positive observations is small but the
> number of negative much larger. Speaking with numbers, the number of
> positive observations could range usually between 2 to 20 and the number of
> negative could be at least x30 times larger. The number of features could
> be between 2 and 20 too, but that could be reduced using feature selection
> and elimination algorithms. I 've read in the documentation that some
> algorithms like the SVM are still effective when the number of dimensions
> is greater than the number of samples, but I am not sure if they are
> suitable for my case. Moreover, according to this Figure, the Nearest
> Neighbors is the best and second is the RBF SVM:
> >
> > http://scikit-learn.org/stable/_images/sphx_glr_plot_
> classifier_comparison_001.png
> >
> > However, I assume that Nearest Neighbors would not be effective in my
> case where the number of positive observations is very low. For these
> reasons I would like to know your expert opinion about which classification
> algorithm should I try first.
> >
> > thanks in advance
> > Thomas
> >
> >
> > --
> > ======================================================================
> > Thomas Evangelidis
> > Research Specialist
> > CEITEC - Central European Institute of Technology
> > Masaryk University
> > Kamenice 5/A35/1S081,
> > 62500 Brno, Czech Republic
> >
> > email: tevang at pharm.uoa.gr
> >           tevang3 at gmail.com
> >
> > website: https://sites.google.com/site/thomasevangelidishomepage/
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> >
> > --
> >
> > Fernando Marcos Wittmann
> > MS Student - Energy Systems Dept.
> > School of Electrical and Computer Engineering, FEEC
> > University of Campinas, UNICAMP, Brazil
> > +55 (19) 987-211302
> >
> > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

-- 

======================================================================

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tevang at pharm.uoa.gr

          tevang3 at gmail.com

website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161117/25d6b32d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PCA_of_features.png
Type: image/png
Size: 106770 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161117/25d6b32d/attachment-0001.png>