[scikit-learn] Differences between scikit-learn and Spark.ml for regression toy problem
Joel Nothman
joel.nothman at gmail.com
Wed Mar 15 23:57:35 EDT 2017
sklearn's (and hence liblinear's) intercept is not being used here, but a
feature is added in Python to represent the bias, so it's being regularised
in any case.
On 16 March 2017 at 14:27, Sebastian Raschka <se.raschka at gmail.com> wrote:
> I think the liblinear solver (default in LogisticRegression) does
> regularize the bias. So, even if both solutions (sklearn, spark) anneal to
> the global cost optimum, the model parameters would be different.
> Maybe a better way to make that comparison would be to turn off
> regularization completely for now. And when you run the LogisticRegression,
> maybe run it multiple times with different random seeds to see if your
> solutions are generally stable.
>
> Best,
> Sebastian
>
> > On Mar 13, 2017, at 1:06 PM, Stuart Reynolds <stuart at stuartreynolds.net>
> wrote:
> >
> > Both libraries are heavily parameterized. You should check what the
> defaults are for both.
> >
> > Some ideas:
> > - What regularization is being used. L1/L2?
> > - Does the regularization parameter have the same interpretation? 1/C =
> lambda? Some libraries use C. Some use lambda.
> > - Also, some libraries regularize the intercept (scikit), other do not.
> (It doesn't seem like a particularly good idea to regularize the intercept
> if your optimizer permits not doing it).
> >
> >
> >
> > On Sun, Mar 12, 2017 at 7:07 PM, Frank Astier via scikit-learn <
> scikit-learn at python.org> wrote:
> > (this was also posted to stackoverflow on 03/10)
> >
> > I am setting up a very simple logistic regression problem in
> scikit-learn and in spark.ml, and the results diverge: the models they
> learn are different, but I can't figure out why (data is the same, model
> type is the same, regularization is the same...).
> >
> > No doubt I am missing some setting on one side or the other. Which
> setting? How should I set up either scikit or spark.ml to find the same
> model as its counterpart?
> >
> > I give the sklearn code and spark.ml code below. Both should be ready
> to cut-and-paste and run.
> >
> > scikit-learn code:
> > ----------------------
> >
> > import numpy as np
> > from sklearn.linear_model import LogisticRegression, Ridge
> >
> > X = np.array([
> > [-0.7306653538519616, 0.0],
> > [0.6750417712898752, -0.4232874171873786],
> > [0.1863463229359709, -0.8163423997075965],
> > [-0.6719842051493347, 0.0],
> > [0.9699938346531928, 0.0],
> > [0.22759406190283604, 0.0],
> > [0.9688721028330911, 0.0],
> > [0.5993795346650845, 0.0],
> > [0.9219423508390701, -0.8972778242305388],
> > [0.7006904841584055, -0.5607635619919824]
> > ])
> >
> > y = np.array([
> > 0.0,
> > 1.0,
> > 1.0,
> > 0.0,
> > 1.0,
> > 1.0,
> > 1.0,
> > 0.0,
> > 0.0,
> > 0.0
> > ])
> >
> > m, n = X.shape
> >
> > # Add intercept term to simulate inputs to GameEstimator
> > X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis]))
> >
> > l = 0.3
> > e = LogisticRegression(
> > fit_intercept=False,
> > penalty='l2',
> > C=1/l,
> > max_iter=100,
> > tol=1e-11)
> >
> > e.fit(X_with_intercept, y)
> >
> > print e.coef_
> > # => [[ 0.98662189 0.45571052 -0.23467255]]
> >
> > # Linear regression is called Ridge in sklearn
> > e = Ridge(
> > fit_intercept=False,
> > alpha=l,
> > max_iter=100,
> > tol=1e-11)
> >
> > e.fit(X_with_intercept, y)
> >
> > print e.coef_
> > # =>[ 0.32155545 0.17904355 0.41222418]
> >
> > spark.ml code:
> > -------------------
> >
> > import org.apache.spark.{SparkConf, SparkContext}
> > import org.apache.spark.ml.classification.LogisticRegression
> > import org.apache.spark.ml.regression.LinearRegression
> > import org.apache.spark.mllib.linalg.Vectors
> > import org.apache.spark.mllib.regression.LabeledPoint
> > import org.apache.spark.sql.SQLContext
> >
> > object TestSparkRegression {
> > def main(args: Array[String]): Unit = {
> > import org.apache.log4j.{Level, Logger}
> >
> > Logger.getLogger("org").setLevel(Level.OFF)
> > Logger.getLogger("akka").setLevel(Level.OFF)
> >
> > val conf = new SparkConf().setAppName("test").setMaster("local")
> > val sc = new SparkContext(conf)
> >
> > val sparkTrainingData = new SQLContext(sc)
> > .createDataFrame(Seq(
> > LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)),
> > LabeledPoint(1.0, Vectors.dense(0.6750417712898752,
> -0.4232874171873786)),
> > LabeledPoint(1.0, Vectors.dense(0.1863463229359709,
> -0.8163423997075965)),
> > LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)),
> > LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)),
> > LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)),
> > LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)),
> > LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)),
> > LabeledPoint(0.0, Vectors.dense(0.9219423508390701,
> -0.8972778242305388)),
> > LabeledPoint(0.0, Vectors.dense(0.7006904841584055,
> -0.5607635619919824))))
> > .toDF("label", "features")
> >
> > val logisticModel = new LogisticRegression()
> > .setRegParam(0.3)
> > .setLabelCol("label")
> > .setFeaturesCol("features")
> > .fit(sparkTrainingData)
> >
> > println(s"Spark logistic model coefficients:
> ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}")
> > // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713]
> Intercept: -0.13897955358689987
> >
> > val linearModel = new LinearRegression()
> > .setRegParam(0.3)
> > .setLabelCol("label")
> > .setFeaturesCol("features")
> > .setSolver("l-bfgs")
> > .fit(sparkTrainingData)
> >
> > println(s"Spark linear model coefficients:
> ${linearModel.coefficients} Intercept: ${linearModel.intercept}")
> > // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802]
> Intercept: 0.45464906876832323
> >
> > sc.stop()
> > }
> > }
> >
> > Thanks,
> >
> > Frank
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170316/be8eaf92/attachment-0001.html>
More information about the scikit-learn
mailing list