[scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

Andreas Mueller t3kcit at gmail.com
Thu Jun 13 10:41:39 EDT 2019


He doesn't only talk about black box vs statistical, he talks about 
model based vs prediction based.
He says that if you validate predictions, you don't need to 
(necessarily) worry about model misspecification.

A linear regression model can be misspecified, and it can be overfit. 
Just fitting the model will not inform you whether either of these is 
the case.
Because the model is simple and well understood, there is ways to check 
model misspecification and overfit in several ways.
A train-test-split doesn't exactly tell you whether the model is 
misspecified (errors could be non-normal and prediction could still be 
good),
but it gives you an idea if the model is "useful".

Basically: you need to validate whatever you did. There are model-based 
approaches and there are prediction based approaches.
Prediction based approaches are always applicable, model-based 
approaches are usually more limited and harder to do (but if you find a 
good model you got a model of the process, which is great!). But you 
need to pick at least one of the two approaches.


On 6/12/19 2:36 PM, C W wrote:
> Thank you both for the papers references.
>
> @ Andreas,
> What is your take? And what are you implying?
>
> The Breiman (2001) paper points out the black box vs. statistical 
> approach. I call them black box vs. open box. He advocates black box 
> in the paper.
> Black box:
> y <--- nature <--- x
>
> Open box:
> y <--- linear regression <---- x
>
> Decision trees and neural nets are black box model. They require large 
> amount of data to train, and skip the part where it tries to 
> understand nature.
>
> Because it is a black box, you can't open up to see what's inside. 
> Linear regression is a very simple model that you can use to 
> approximate nature, but the key thing is that you need to know how the 
> data are generated.
>
> @ Brown,
> I know nothing about molecular modeling. The paper your linked "Beware 
> of q2!" paper raises some interesting point, as far as I see in 
> sklearn linear regression, score is R^2.
>
> On Wed, Jun 5, 2019 at 9:11 AM Andreas Mueller <t3kcit at gmail.com 
> <mailto:t3kcit at gmail.com>> wrote:
>
>
>     On 6/4/19 8:44 PM, C W wrote:
>     > Thank you all for the replies.
>     >
>     > I agree that prediction accuracy is great for evaluating
>     black-box ML
>     > models. Especially advanced models like neural networks, or
>     > not-so-black models like LASSO, because they are NP-hard to solve.
>     >
>     > Linear regression is not a black-box. I view prediction accuracy
>     as an
>     > overkill on interpretable models. Especially when you can use
>     > R-squared, coefficient significance, etc.
>     >
>     > Prediction accuracy also does not tell you which feature is
>     important.
>     >
>     > What do you guys think? Thank you!
>     >
>     Did you read the paper that I sent? ;)
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190613/e3f49098/attachment.html>


More information about the scikit-learn mailing list