[scikit-learn] scikit-learn Digest, Vol 43, Issue 11

Sun Oct 6 18:16:59 EDT 2019

Pandas has a read_excel function that can load data from an excel
spreadsheet:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

On Sun, Oct 6, 2019 at 1:57 AM Mike Smith <javaeurusd at gmail.com> wrote:

> Can I call an MSExcel cell range in a function such as model.predict(),
> instead of typing the data in for each element?
>
> On Sat, Oct 5, 2019 at 11:58 AM <scikit-learn-request at python.org> wrote:
>
>> Send scikit-learn mailing list submissions to
>>         scikit-learn at python.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         https://mail.python.org/mailman/listinfo/scikit-learn
>> or, via email, send a message with subject or body 'help' to
>>         scikit-learn-request at python.org
>>
>> You can reach the person managing the list at
>>         scikit-learn-owner at python.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of scikit-learn digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Re: scikit-learn Digest, Vol 43, Issue 10 (Mike Smith)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Sat, 5 Oct 2019 11:55:33 -0700
>> From: Mike Smith <javaeurusd at gmail.com>
>> To: scikit-learn at python.org
>> Subject: Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 10
>> Message-ID:
>>         <CAEWZffDWv8mOUVaKSSBzpiEebjcVrRD-t8zxuBSFCKxqTGi3=
>> A at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>>  1. Re: Can Scikit-learn decision tree (CART) have both
>>       continuous and categorical features? (C W)
>>
>> What I'd ask in reply to this is if regression and classification module
>> results can be entered into an input for one resultant output.
>>
>>
>>
>> On Sat, Oct 5, 2019, 11:50 AM , <scikit-learn-request at python.org> wrote:
>>
>> > Send scikit-learn mailing list submissions to
>> >         scikit-learn at python.org
>> >
>> > To subscribe or unsubscribe via the World Wide Web, visit
>> >         https://mail.python.org/mailman/listinfo/scikit-learn
>> > or, via email, send a message with subject or body 'help' to
>> >         scikit-learn-request at python.org
>> >
>> > You can reach the person managing the list at
>> >         scikit-learn-owner at python.org
>> >
>> > When replying, please edit your Subject line so it is more specific
>> > than "Re: Contents of scikit-learn digest..."
>> >
>> >
>> > Today's Topics:
>> >
>> >    1. Re: Can Scikit-learn decision tree (CART) have both
>> >       continuous and categorical features? (C W)
>> >
>> >
>> > ----------------------------------------------------------------------
>> >
>> > Message: 1
>> > Date: Sat, 5 Oct 2019 14:50:09 -0400
>> > From: C W <tmrsg11 at gmail.com>
>> > To: Scikit-learn mailing list <scikit-learn at python.org>
>> > Subject: Re: [scikit-learn] Can Scikit-learn decision tree (CART) have
>> >         both continuous and categorical features?
>> > Message-ID:
>> >         <
>> > CAE2FW2nHDJGNky2VWk-U8fU3gqwBqWEgidzTAWnUq+NzAK68VA at mail.gmail.com>
>> > Content-Type: text/plain; charset="utf-8"
>> >
>> > Thanks, great material! I got pydotplus with graphviz to work.
>> >
>> > Using the code on sklean website [1], tree.plot_tree(clf.fit(iris.data,
>> > iris.target)) gives an error:
>> > AttributeError: module 'sklearn.tree' has no attribute 'plot_tree'
>> >
>> > Both my colleague and I got the same error message. Per this post
>> > https://github.com/Microsoft/LightGBM/issues/1844, a PyPI update is
>> > needed.
>> >
>> > [1] sklearn link:
>> > https://scikit-learn.org/stable/modules/tree.html#classification
>> >
>> >
>> > On Fri, Oct 4, 2019 at 11:52 PM Sebastian Raschka <
>> > mail at sebastianraschka.com>
>> > wrote:
>> >
>> > > The docs show a way such that you don't need to write it as png file
>> > using
>> > > tree.plot_tree:
>> > > https://scikit-learn.org/stable/modules/tree.html#classification
>> > >
>> > > I don't remember why, but I think I had problems with that in the
>> past (I
>> > > think it didn't look so nice visually, but don't remember), which is
>> why
>> > I
>> > > still stick to graphviz. For my use cases, it's not much hassle -- it
>> > used
>> > > to be a bit of a hassle to get GraphViz working, but now you can do
>> > >
>> > > conda install pydotplus
>> > > conda install graphviz
>> > >
>> > > Coincidentally, I just made an example for a lecture I was teaching on
>> > > Tue:
>> > >
>> >
>> https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb
>> > >
>> > > Best,
>> > > Sebastian
>> > >
>> > >
>> > > > On Oct 4, 2019, at 10:09 PM, C W <tmrsg11 at gmail.com> wrote:
>> > > >
>> > > > On a separate note, what do you use for plotting?
>> > > >
>> > > > I found graphviz, but you have to first save it as a png on your
>> > > computer. That's a lot work for just one plot. Is there something
>> like a
>> > > matplotlib?
>> > > >
>> > > > Thanks!
>> > > >
>> > > > On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka <
>> > > mail at sebastianraschka.com> wrote:
>> > > > Yeah, think of it more as a computational workaround for achieving
>> the
>> > > same thing more efficiently (although it looks inelegant/weird)--
>> > something
>> > > like that wouldn't be mentioned in textbooks.
>> > > >
>> > > > Best,
>> > > > Sebastian
>> > > >
>> > > > > On Oct 4, 2019, at 6:33 PM, C W <tmrsg11 at gmail.com> wrote:
>> > > > >
>> > > > > Thanks Sebastian, I think I get it.
>> > > > >
>> > > > > It's just have never seen it this way. Quite different from what
>> I'm
>> > > used in Elements of Statistical Learning.
>> > > > >
>> > > > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka <
>> > > mail at sebastianraschka.com> wrote:
>> > > > > Not sure if there's a website for that. In any case, to explain
>> this
>> > > differently, as discussed earlier sklearn assumes continuous features
>> for
>> > > decision trees. So, it will use a binary threshold for splitting
>> along a
>> > > feature attribute. In other words, it cannot do sth like
>> > > > >
>> > > > > if x == 1 then right child node
>> > > > > else left child node
>> > > > >
>> > > > > Instead, what it does is
>> > > > >
>> > > > > if x >= 0.5 then right child node
>> > > > > else left child node
>> > > > >
>> > > > > These are basically equivalent as you can see when you just plug
>> in
>> > > values 0 and 1 for x.
>> > > > >
>> > > > > Best,
>> > > > > Sebastian
>> > > > >
>> > > > > > On Oct 4, 2019, at 5:34 PM, C W <tmrsg11 at gmail.com> wrote:
>> > > > > >
>> > > > > > I don't understand your answer.
>> > > > > >
>> > > > > > Why after one-hot-encoding it still outputs greater than 0.5 or
>> > less
>> > > than? Does sklearn website have a working example on categorical
>> input?
>> > > > > >
>> > > > > > Thanks!
>> > > > > >
>> > > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka <
>> > > mail at sebastianraschka.com> wrote:
>> > > > > > Like Nicolas said, the 0.5 is just a workaround but will do the
>> > > right thing on the one-hot encoded variables, here. You will find that
>> > the
>> > > threshold is always at 0.5 for these variables. I.e., what it will do
>> is
>> > to
>> > > use the following conversion:
>> > > > > >
>> > > > > > treat as car_Audi=1 if car_Audi >= 0.5
>> > > > > > treat as car_Audi=0 if car_Audi < 0.5
>> > > > > >
>> > > > > > or, it may be
>> > > > > >
>> > > > > > treat as car_Audi=1 if car_Audi > 0.5
>> > > > > > treat as car_Audi=0 if car_Audi <= 0.5
>> > > > > >
>> > > > > > (Forgot which one sklearn is using, but either way. it will be
>> > fine.)
>> > > > > >
>> > > > > > Best,
>> > > > > > Sebastian
>> > > > > >
>> > > > > >
>> > > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug <niourf at gmail.com>
>> wrote:
>> > > > > >>
>> > > > > >>
>> > > > > >>> But, decision tree is still mistaking one-hot-encoding as
>> > > numerical input and split at 0.5. This is not right. Perhaps, I'm
>> doing
>> > > something wrong?
>> > > > > >>
>> > > > > >> You're not doing anything wrong, and neither is the tree. Trees
>> > > don't support categorical variables in sklearn, so everything is
>> treated
>> > as
>> > > numerical.
>> > > > > >>
>> > > > > >> This is why we do one-hot-encoding: so that a set of numerical
>> > (one
>> > > hot encoded) features can be treated as if they were just one
>> categorical
>> > > feature.
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >> Nicolas
>> > > > > >>
>> > > > > >> On 10/4/19 2:01 PM, C W wrote:
>> > > > > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So,
>> > > typo on my part.
>> > > > > >>>
>> > > > > >>> Looks like I did one-hot-encoding correctly. My new variable
>> > names
>> > > are: car_Audi, car_BMW, etc.
>> > > > > >>>
>> > > > > >>> But, decision tree is still mistaking one-hot-encoding as
>> > > numerical input and split at 0.5. This is not right. Perhaps, I'm
>> doing
>> > > something wrong?
>> > > > > >>>
>> > > > > >>> Is there a good toy example on the sklearn website? I am only
>> see
>> > > this:
>> > >
>> >
>> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
>> > > .
>> > > > > >>>
>> > > > > >>> Thanks!
>> > > > > >>>
>> > > > > >>>
>> > > > > >>>
>> > > > > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <
>> > > mail at sebastianraschka.com> wrote:
>> > > > > >>> Hi,
>> > > > > >>>
>> > > > > >>>> The funny part is: the tree is taking one-hot-encoding
>> (BMW=0,
>> > > Toyota=1, Audi=2) as numerical values, not category.The tree splits at
>> > 0.5
>> > > and 1.5
>> > > > > >>>
>> > > > > >>> that's not a onehot encoding then.
>> > > > > >>>
>> > > > > >>> For an Audi datapoint, it should be
>> > > > > >>>
>> > > > > >>> BMW=0
>> > > > > >>> Toyota=0
>> > > > > >>> Audi=1
>> > > > > >>>
>> > > > > >>> for BMW
>> > > > > >>>
>> > > > > >>> BMW=1
>> > > > > >>> Toyota=0
>> > > > > >>> Audi=0
>> > > > > >>>
>> > > > > >>> and for Toyota
>> > > > > >>>
>> > > > > >>> BMW=0
>> > > > > >>> Toyota=1
>> > > > > >>> Audi=0
>> > > > > >>>
>> > > > > >>> The split threshold should then be at 0.5 for any of these
>> > > features.
>> > > > > >>>
>> > > > > >>> Based on your email, I think you were assuming that the DT
>> does
>> > > the one-hot encoding internally, which it doesn't. In practice, it is
>> > hard
>> > > to guess what is a nominal and what is a ordinal variable, so you
>> have to
>> > > do the onehot encoding before you give the data to the decision tree.
>> > > > > >>>
>> > > > > >>> Best,
>> > > > > >>> Sebastian
>> > > > > >>>
>> > > > > >>>> On Oct 4, 2019, at 11:48 AM, C W <tmrsg11 at gmail.com> wrote:
>> > > > > >>>>
>> > > > > >>>> I'm getting some funny results. I am doing a regression
>> decision
>> > > tree, the response variables are assigned to levels.
>> > > > > >>>>
>> > > > > >>>> The funny part is: the tree is taking one-hot-encoding
>> (BMW=0,
>> > > Toyota=1, Audi=2) as numerical values, not category.
>> > > > > >>>>
>> > > > > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding
>> > > wrong? How does the sklearn know internally 0 vs. 1 is categorical,
>> not
>> > > numerical?
>> > > > > >>>>
>> > > > > >>>> In R for instance, you do as.factor(), which explicitly
>> states
>> > > the data type.
>> > > > > >>>>
>> > > > > >>>> Thank you!
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <
>> > > t3kcit at gmail.com> wrote:
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote:
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>> On Sat, 14 Sep 2019 at 20:59, C W <tmrsg11 at gmail.com>
>> wrote:
>> > > > > >>>>> Thanks, Guillaume.
>> > > > > >>>>> Column transformer looks pretty neat. I've also heard
>> though,
>> > > this pipeline can be tedious to set up? Specifying what you want for
>> > every
>> > > feature is a pain.
>> > > > > >>>>>
>> > > > > >>>>> It would be interesting for us which part of the pipeline is
>> > > tedious to set up to know if we can improve something there.
>> > > > > >>>>> Do you mean, that you would like to automatically detect of
>> > > which type of feature (categorical/numerical) and apply a
>> > > > > >>>>> default encoder/scaling such as discuss there:
>> > >
>> >
>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>> > > > > >>>>>
>> > > > > >>>>> IMO, one a user perspective, it would be cleaner in some
>> cases
>> > > at the cost of applying blindly a black box
>> > > > > >>>>> which might be dangerous.
>> > > > > >>>> Also see
>> > >
>> >
>> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
>> > > > > >>>> Which basically does that.
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>> Jaiver,
>> > > > > >>>>> Actually, you guessed right. My real data has only one
>> > numerical
>> > > variable, looks more like this:
>> > > > > >>>>>
>> > > > > >>>>> Gender Date            Income  Car   Attendance
>> > > > > >>>>> Male     2019/3/01   10000   BMW          Yes
>> > > > > >>>>> Female 2019/5/02    9000   Toyota          No
>> > > > > >>>>> Male     2019/7/15   12000    Audi           Yes
>> > > > > >>>>>
>> > > > > >>>>> I am predicting income using all other categorical
>> variables.
>> > > Maybe it is catboost!
>> > > > > >>>>>
>> > > > > >>>>> Thanks,
>> > > > > >>>>>
>> > > > > >>>>> M
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez <jlopez at ende.cc
>> >
>> > > wrote:
>> > > > > >>>>> If you have datasets with many categorical features, and
>> > perhaps
>> > > many categories, the tools in sklearn are quite limited,
>> > > > > >>>>> but there are alternative implementations of boosted trees
>> that
>> > > are designed with categorical features in mind. Take a look
>> > > > > >>>>> at catboost [1], which has an sklearn-compatible API.
>> > > > > >>>>>
>> > > > > >>>>> J
>> > > > > >>>>>
>> > > > > >>>>> [1] https://catboost.ai/
>> > > > > >>>>>
>> > > > > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrsg11 at gmail.com>
>> wrote:
>> > > > > >>>>> Hello all,
>> > > > > >>>>> I'm very confused. Can the decision tree module handle both
>> > > continuous and categorical features in the dataset? In this case, it's
>> > just
>> > > CART (Classification and Regression Trees).
>> > > > > >>>>>
>> > > > > >>>>> For example,
>> > > > > >>>>> Gender Age Income  Car   Attendance
>> > > > > >>>>> Male     30   10000   BMW          Yes
>> > > > > >>>>> Female 35     9000  Toyota          No
>> > > > > >>>>> Male     50   12000    Audi           Yes
>> > > > > >>>>>
>> > > > > >>>>> According to the documentation
>> > >
>> >
>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart
>> > ,
>> > > it can not!
>> > > > > >>>>>
>> > > > > >>>>> It says: "scikit-learn implementation does not support
>> > > categorical variables for now".
>> > > > > >>>>>
>> > > > > >>>>> Is this true? If not, can someone point me to an example? If
>> > > yes, what do people do?
>> > > > > >>>>>
>> > > > > >>>>> Thank you very much!
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>> _______________________________________________
>> > > > > >>>>> scikit-learn mailing list
>> > > > > >>>>> scikit-learn at python.org
>> > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > > >>>>> _______________________________________________
>> > > > > >>>>> scikit-learn mailing list
>> > > > > >>>>> scikit-learn at python.org
>> > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > > >>>>> _______________________________________________
>> > > > > >>>>> scikit-learn mailing list
>> > > > > >>>>> scikit-learn at python.org
>> > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>> --
>> > > > > >>>>> Guillaume Lemaitre
>> > > > > >>>>> INRIA Saclay - Parietal team
>> > > > > >>>>> Center for Data Science Paris-Saclay
>> > > > > >>>>> https://glemaitre.github.io/
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>> _______________________________________________
>> > > > > >>>>> scikit-learn mailing list
>> > > > > >>>>>
>> > > > > >>>>> scikit-learn at python.org
>> > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > > >>>>
>> > > > > >>>> _______________________________________________
>> > > > > >>>> scikit-learn mailing list
>> > > > > >>>> scikit-learn at python.org
>> > > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > > >>>> _______________________________________________
>> > > > > >>>> scikit-learn mailing list
>> > > > > >>>> scikit-learn at python.org
>> > > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > > >>>
>> > > > > >>> _______________________________________________
>> > > > > >>> scikit-learn mailing list
>> > > > > >>> scikit-learn at python.org
>> > > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > > >>>
>> > > > > >>>
>> > > > > >>> _______________________________________________
>> > > > > >>> scikit-learn mailing list
>> > > > > >>>
>> > > > > >>> scikit-learn at python.org
>> > > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > > >> _______________________________________________
>> > > > > >> scikit-learn mailing list
>> > > > > >> scikit-learn at python.org
>> > > > > >> https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > > >
>> > > > > > _______________________________________________
>> > > > > > scikit-learn mailing list
>> > > > > > scikit-learn at python.org
>> > > > > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > > > _______________________________________________
>> > > > > > scikit-learn mailing list
>> > > > > > scikit-learn at python.org
>> > > > > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > >
>> > > > > _______________________________________________
>> > > > > scikit-learn mailing list
>> > > > > scikit-learn at python.org
>> > > > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > > _______________________________________________
>> > > > > scikit-learn mailing list
>> > > > > scikit-learn at python.org
>> > > > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > > >
>> > > > _______________________________________________
>> > > > scikit-learn mailing list
>> > > > scikit-learn at python.org
>> > > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > > > _______________________________________________
>> > > > scikit-learn mailing list
>> > > > scikit-learn at python.org
>> > > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > >
>> > > _______________________________________________
>> > > scikit-learn mailing list
>> > > scikit-learn at python.org
>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > >
>> > -------------- next part --------------
>> > An HTML attachment was scrubbed...
>> > URL: <
>> >
>> http://mail.python.org/pipermail/scikit-learn/attachments/20191005/7234be32/attachment.html
>> > >
>> >
>> > ------------------------------
>> >
>> > Subject: Digest Footer
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> > ------------------------------
>> >
>> > End of scikit-learn Digest, Vol 43, Issue 10
>> > ********************************************
>> >
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://mail.python.org/pipermail/scikit-learn/attachments/20191005/14272924/attachment.html
>> >
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> ------------------------------
>>
>> End of scikit-learn Digest, Vol 43, Issue 11
>> ********************************************
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191006/134b3057/attachment-0001.html>