[scikit-learn] scikit-learn Digest, Vol 43, Issue 8

Sat Oct 5 12:20:37 EDT 2019

Are Nearest Neighbor models better than decision trees for Adaboost?

On Sat, Oct 5, 2019 at 9:02 AM <scikit-learn-request at python.org> wrote:

> Send scikit-learn mailing list submissions to
>         scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-request at python.org
>
> You can reach the person managing the list at
>         scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>    1. Re: Can Scikit-learn decision tree (CART) have both
>       continuous and categorical features? (Sebastian Raschka)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 4 Oct 2019 22:28:46 -0500
> From: Sebastian Raschka <mail at sebastianraschka.com>
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Can Scikit-learn decision tree (CART) have
>         both continuous and categorical features?
> Message-ID:
>         <4FC33890-94D3-4AA8-8FA9-EF1FADFD4C20 at sebastianraschka.com>
> Content-Type: text/plain;       charset=utf-8
>
> The docs show a way such that you don't need to write it as png file using
> tree.plot_tree:
> https://scikit-learn.org/stable/modules/tree.html#classification
>
> I don't remember why, but I think I had problems with that in the past (I
> think it didn't look so nice visually, but don't remember), which is why I
> still stick to graphviz. For my use cases, it's not much hassle -- it used
> to be a bit of a hassle to get GraphViz working, but now you can do
>
> conda install pydotplus
> conda install graphviz
>
> Coincidentally, I just made an example for a lecture I was teaching on
> Tue:
> https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb
>
> Best,
> Sebastian
>
>
> > On Oct 4, 2019, at 10:09 PM, C W <tmrsg11 at gmail.com> wrote:
> >
> > On a separate note, what do you use for plotting?
> >
> > I found graphviz, but you have to first save it as a png on your
> computer. That's a lot work for just one plot. Is there something like a
> matplotlib?
> >
> > Thanks!
> >
> > On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka <
> mail at sebastianraschka.com> wrote:
> > Yeah, think of it more as a computational workaround for achieving the
> same thing more efficiently (although it looks inelegant/weird)-- something
> like that wouldn't be mentioned in textbooks.
> >
> > Best,
> > Sebastian
> >
> > > On Oct 4, 2019, at 6:33 PM, C W <tmrsg11 at gmail.com> wrote:
> > >
> > > Thanks Sebastian, I think I get it.
> > >
> > > It's just have never seen it this way. Quite different from what I'm
> used in Elements of Statistical Learning.
> > >
> > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka <
> mail at sebastianraschka.com> wrote:
> > > Not sure if there's a website for that. In any case, to explain this
> differently, as discussed earlier sklearn assumes continuous features for
> decision trees. So, it will use a binary threshold for splitting along a
> feature attribute. In other words, it cannot do sth like
> > >
> > > if x == 1 then right child node
> > > else left child node
> > >
> > > Instead, what it does is
> > >
> > > if x >= 0.5 then right child node
> > > else left child node
> > >
> > > These are basically equivalent as you can see when you just plug in
> values 0 and 1 for x.
> > >
> > > Best,
> > > Sebastian
> > >
> > > > On Oct 4, 2019, at 5:34 PM, C W <tmrsg11 at gmail.com> wrote:
> > > >
> > > > I don't understand your answer.
> > > >
> > > > Why after one-hot-encoding it still outputs greater than 0.5 or less
> than? Does sklearn website have a working example on categorical input?
> > > >
> > > > Thanks!
> > > >
> > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka <
> mail at sebastianraschka.com> wrote:
> > > > Like Nicolas said, the 0.5 is just a workaround but will do the
> right thing on the one-hot encoded variables, here. You will find that the
> threshold is always at 0.5 for these variables. I.e., what it will do is to
> use the following conversion:
> > > >
> > > > treat as car_Audi=1 if car_Audi >= 0.5
> > > > treat as car_Audi=0 if car_Audi < 0.5
> > > >
> > > > or, it may be
> > > >
> > > > treat as car_Audi=1 if car_Audi > 0.5
> > > > treat as car_Audi=0 if car_Audi <= 0.5
> > > >
> > > > (Forgot which one sklearn is using, but either way. it will be fine.)
> > > >
> > > > Best,
> > > > Sebastian
> > > >
> > > >
> > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <niourf at gmail.com> wrote:
> > > >>
> > > >>
> > > >>> But, decision tree is still mistaking one-hot-encoding as
> numerical input and split at 0.5. This is not right. Perhaps, I'm doing
> something wrong?
> > > >>
> > > >> You're not doing anything wrong, and neither is the tree. Trees
> don't support categorical variables in sklearn, so everything is treated as
> numerical.
> > > >>
> > > >> This is why we do one-hot-encoding: so that a set of numerical (one
> hot encoded) features can be treated as if they were just one categorical
> feature.
> > > >>
> > > >>
> > > >>
> > > >> Nicolas
> > > >>
> > > >> On 10/4/19 2:01 PM, C W wrote:
> > > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So,
> typo on my part.
> > > >>>
> > > >>> Looks like I did one-hot-encoding correctly. My new variable names
> are: car_Audi, car_BMW, etc.
> > > >>>
> > > >>> But, decision tree is still mistaking one-hot-encoding as
> numerical input and split at 0.5. This is not right. Perhaps, I'm doing
> something wrong?
> > > >>>
> > > >>> Is there a good toy example on the sklearn website? I am only see
> this:
> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
> .
> > > >>>
> > > >>> Thanks!
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <
> mail at sebastianraschka.com> wrote:
> > > >>> Hi,
> > > >>>
> > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0,
> Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5
> and 1.5
> > > >>>
> > > >>> that's not a onehot encoding then.
> > > >>>
> > > >>> For an Audi datapoint, it should be
> > > >>>
> > > >>> BMW=0
> > > >>> Toyota=0
> > > >>> Audi=1
> > > >>>
> > > >>> for BMW
> > > >>>
> > > >>> BMW=1
> > > >>> Toyota=0
> > > >>> Audi=0
> > > >>>
> > > >>> and for Toyota
> > > >>>
> > > >>> BMW=0
> > > >>> Toyota=1
> > > >>> Audi=0
> > > >>>
> > > >>> The split threshold should then be at 0.5 for any of these
> features.
> > > >>>
> > > >>> Based on your email, I think you were assuming that the DT does
> the one-hot encoding internally, which it doesn't. In practice, it is hard
> to guess what is a nominal and what is a ordinal variable, so you have to
> do the onehot encoding before you give the data to the decision tree.
> > > >>>
> > > >>> Best,
> > > >>> Sebastian
> > > >>>
> > > >>>> On Oct 4, 2019, at 11:48 AM, C W <tmrsg11 at gmail.com> wrote:
> > > >>>>
> > > >>>> I'm getting some funny results. I am doing a regression decision
> tree, the response variables are assigned to levels.
> > > >>>>
> > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0,
> Toyota=1, Audi=2) as numerical values, not category.
> > > >>>>
> > > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding
> wrong? How does the sklearn know internally 0 vs. 1 is categorical, not
> numerical?
> > > >>>>
> > > >>>> In R for instance, you do as.factor(), which explicitly states
> the data type.
> > > >>>>
> > > >>>> Thank you!
> > > >>>>
> > > >>>>
> > > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <
> t3kcit at gmail.com> wrote:
> > > >>>>
> > > >>>>
> > > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote:
> > > >>>>>
> > > >>>>>
> > > >>>>> On Sat, 14 Sep 2019 at 20:59, C W <tmrsg11 at gmail.com> wrote:
> > > >>>>> Thanks, Guillaume.
> > > >>>>> Column transformer looks pretty neat. I've also heard though,
> this pipeline can be tedious to set up? Specifying what you want for every
> feature is a pain.
> > > >>>>>
> > > >>>>> It would be interesting for us which part of the pipeline is
> tedious to set up to know if we can improve something there.
> > > >>>>> Do you mean, that you would like to automatically detect of
> which type of feature (categorical/numerical) and apply a
> > > >>>>> default encoder/scaling such as discuss there:
> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
> > > >>>>>
> > > >>>>> IMO, one a user perspective, it would be cleaner in some cases
> at the cost of applying blindly a black box
> > > >>>>> which might be dangerous.
> > > >>>> Also see
> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
> > > >>>> Which basically does that.
> > > >>>>
> > > >>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> Jaiver,
> > > >>>>> Actually, you guessed right. My real data has only one numerical
> variable, looks more like this:
> > > >>>>>
> > > >>>>> Gender Date            Income  Car   Attendance
> > > >>>>> Male     2019/3/01   10000   BMW          Yes
> > > >>>>> Female 2019/5/02    9000   Toyota          No
> > > >>>>> Male     2019/7/15   12000    Audi           Yes
> > > >>>>>
> > > >>>>> I am predicting income using all other categorical variables.
> Maybe it is catboost!
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>>
> > > >>>>> M
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez <jlopez at ende.cc>
> wrote:
> > > >>>>> If you have datasets with many categorical features, and perhaps
> many categories, the tools in sklearn are quite limited,
> > > >>>>> but there are alternative implementations of boosted trees that
> are designed with categorical features in mind. Take a look
> > > >>>>> at catboost [1], which has an sklearn-compatible API.
> > > >>>>>
> > > >>>>> J
> > > >>>>>
> > > >>>>> [1] https://catboost.ai/
> > > >>>>>
> > > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrsg11 at gmail.com> wrote:
> > > >>>>> Hello all,
> > > >>>>> I'm very confused. Can the decision tree module handle both
> continuous and categorical features in the dataset? In this case, it's just
> CART (Classification and Regression Trees).
> > > >>>>>
> > > >>>>> For example,
> > > >>>>> Gender Age Income  Car   Attendance
> > > >>>>> Male     30   10000   BMW          Yes
> > > >>>>> Female 35     9000  Toyota          No
> > > >>>>> Male     50   12000    Audi           Yes
> > > >>>>>
> > > >>>>> According to the documentation
> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
> it can not!
> > > >>>>>
> > > >>>>> It says: "scikit-learn implementation does not support
> categorical variables for now".
> > > >>>>>
> > > >>>>> Is this true? If not, can someone point me to an example? If
> yes, what do people do?
> > > >>>>>
> > > >>>>> Thank you very much!
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> _______________________________________________
> > > >>>>> scikit-learn mailing list
> > > >>>>> scikit-learn at python.org
> > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >>>>> _______________________________________________
> > > >>>>> scikit-learn mailing list
> > > >>>>> scikit-learn at python.org
> > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >>>>> _______________________________________________
> > > >>>>> scikit-learn mailing list
> > > >>>>> scikit-learn at python.org
> > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >>>>>
> > > >>>>>
> > > >>>>> --
> > > >>>>> Guillaume Lemaitre
> > > >>>>> INRIA Saclay - Parietal team
> > > >>>>> Center for Data Science Paris-Saclay
> > > >>>>> https://glemaitre.github.io/
> > > >>>>>
> > > >>>>>
> > > >>>>> _______________________________________________
> > > >>>>> scikit-learn mailing list
> > > >>>>>
> > > >>>>> scikit-learn at python.org
> > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >>>>
> > > >>>> _______________________________________________
> > > >>>> scikit-learn mailing list
> > > >>>> scikit-learn at python.org
> > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >>>> _______________________________________________
> > > >>>> scikit-learn mailing list
> > > >>>> scikit-learn at python.org
> > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >>>
> > > >>> _______________________________________________
> > > >>> scikit-learn mailing list
> > > >>> scikit-learn at python.org
> > > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >>>
> > > >>>
> > > >>> _______________________________________________
> > > >>> scikit-learn mailing list
> > > >>>
> > > >>> scikit-learn at python.org
> > > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >> _______________________________________________
> > > >> scikit-learn mailing list
> > > >> scikit-learn at python.org
> > > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >
> > > > _______________________________________________
> > > > scikit-learn mailing list
> > > > scikit-learn at python.org
> > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > > _______________________________________________
> > > > scikit-learn mailing list
> > > > scikit-learn at python.org
> > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 43, Issue 8
> *******************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191005/3116e8a8/attachment-0001.html>