<div dir="ltr"><div>(this was also posted to stackoverflow on 03/10)</div><div><br></div><div><div>I am setting up a very simple logistic regression problem in scikit-learn and in <a href="http://spark.ml">spark.ml</a>, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the same...). </div><div><br></div><div>No doubt I am missing some setting on one side or the other. Which setting? How should I set up either scikit or <a href="http://spark.ml">spark.ml</a> to find the same model as its counterpart?</div><div><br></div><div>I give the sklearn code and <a href="http://spark.ml">spark.ml</a> code below. Both should be ready to cut-and-paste and run.</div><div><br></div><div>scikit-learn code:</div><div>----------------------</div><div><br></div><div> import numpy as np</div><div> from sklearn.linear_model import LogisticRegression, Ridge</div><div> </div><div> X = np.array([</div><div> [-0.7306653538519616, 0.0],</div><div> [0.6750417712898752, -0.4232874171873786],</div><div> [0.1863463229359709, -0.8163423997075965],</div><div> [-0.6719842051493347, 0.0],</div><div> [0.9699938346531928, 0.0],</div><div> [0.22759406190283604, 0.0],</div><div> [0.9688721028330911, 0.0],</div><div> [0.5993795346650845, 0.0],</div><div> [0.9219423508390701, -0.8972778242305388],</div><div> [0.7006904841584055, -0.5607635619919824]</div><div> ])</div><div> </div><div> y = np.array([</div><div> 0.0,</div><div> 1.0,</div><div> 1.0,</div><div> 0.0,</div><div> 1.0,</div><div> 1.0,</div><div> 1.0,</div><div> 0.0,</div><div> 0.0,</div><div> 0.0</div><div> ])</div><div> </div><div> m, n = X.shape</div><div> </div><div> # Add intercept term to simulate inputs to GameEstimator</div><div> X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis]))</div><div> </div><div> l = 0.3</div><div> e = LogisticRegression(</div><div> fit_intercept=False,</div><div> penalty='l2',</div><div> C=1/l,</div><div> max_iter=100,</div><div> tol=1e-11)</div><div> </div><div> e.fit(X_with_intercept, y)</div><div> </div><div> print e.coef_</div><div> # => [[ 0.98662189 0.45571052 -0.23467255]]</div><div> </div><div> # Linear regression is called Ridge in sklearn</div><div> e = Ridge(</div><div> fit_intercept=False,</div><div> alpha=l,</div><div> max_iter=100,</div><div> tol=1e-11)</div><div> </div><div> e.fit(X_with_intercept, y)</div><div> </div><div> print e.coef_</div><div> # =>[ 0.32155545 0.17904355 0.41222418]</div><div><br></div><div><a href="http://spark.ml">spark.ml</a> code:</div><div>-------------------</div><div><br></div><div> import org.apache.spark.{SparkConf, SparkContext}</div><div> import org.apache.spark.ml.classification.LogisticRegression</div><div> import org.apache.spark.ml.regression.LinearRegression</div><div> import org.apache.spark.mllib.linalg.Vectors</div><div> import org.apache.spark.mllib.regression.LabeledPoint</div><div> import org.apache.spark.sql.SQLContext</div><div> </div><div> object TestSparkRegression {</div><div> def main(args: Array[String]): Unit = {</div><div> import org.apache.log4j.{Level, Logger}</div><div> </div><div> Logger.getLogger("org").setLevel(Level.OFF)</div><div> Logger.getLogger("akka").setLevel(Level.OFF)</div><div> </div><div> val conf = new SparkConf().setAppName("test").setMaster("local")</div><div> val sc = new SparkContext(conf)</div><div> </div><div> val sparkTrainingData = new SQLContext(sc)</div><div> .createDataFrame(Seq(</div><div> LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)),</div><div> LabeledPoint(1.0, Vectors.dense(0.6750417712898752, -0.4232874171873786)),</div><div> LabeledPoint(1.0, Vectors.dense(0.1863463229359709, -0.8163423997075965)),</div><div> LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)),</div><div> LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)),</div><div> LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)),</div><div> LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)),</div><div> LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)),</div><div> LabeledPoint(0.0, Vectors.dense(0.9219423508390701, -0.8972778242305388)),</div><div> LabeledPoint(0.0, Vectors.dense(0.7006904841584055, -0.5607635619919824))))</div><div> .toDF("label", "features")</div><div> </div><div> val logisticModel = new LogisticRegression()</div><div> .setRegParam(0.3)</div><div> .setLabelCol("label")</div><div> .setFeaturesCol("features")</div><div> .fit(sparkTrainingData)</div><div> </div><div> println(s"Spark logistic model coefficients: ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}")</div><div> // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713] Intercept: -0.13897955358689987</div><div> </div><div> val linearModel = new LinearRegression()</div><div> .setRegParam(0.3)</div><div> .setLabelCol("label")</div><div> .setFeaturesCol("features")</div><div> .setSolver("l-bfgs")</div><div> .fit(sparkTrainingData)</div><div> </div><div> println(s"Spark linear model coefficients: ${linearModel.coefficients} Intercept: ${linearModel.intercept}")</div><div> // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802] Intercept: 0.45464906876832323</div><div> </div><div> sc.stop()</div><div> }</div><div> }</div><div><br></div><div><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Thanks,</div><div dir="ltr"><br></div><div>Frank</div></div></div></div></div></div></div>
</div></div>