Differences between scikit-learn and Spark.ml for regression toy problem
(this was also posted to stackoverflow on 03/10) I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the same...). No doubt I am missing some setting on one side or the other. Which setting? How should I set up either scikit or spark.ml to find the same model as its counterpart? I give the sklearn code and spark.ml code below. Both should be ready to cut-and-paste and run. scikit-learn code: ---------------------- import numpy as np from sklearn.linear_model import LogisticRegression, Ridge X = np.array([ [-0.7306653538519616, 0.0], [0.6750417712898752, -0.4232874171873786], [0.1863463229359709, -0.8163423997075965], [-0.6719842051493347, 0.0], [0.9699938346531928, 0.0], [0.22759406190283604, 0.0], [0.9688721028330911, 0.0], [0.5993795346650845, 0.0], [0.9219423508390701, -0.8972778242305388], [0.7006904841584055, -0.5607635619919824] ]) y = np.array([ 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0 ]) m, n = X.shape # Add intercept term to simulate inputs to GameEstimator X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis])) l = 0.3 e = LogisticRegression( fit_intercept=False, penalty='l2', C=1/l, max_iter=100, tol=1e-11) e.fit(X_with_intercept, y) print e.coef_ # => [[ 0.98662189 0.45571052 -0.23467255]] # Linear regression is called Ridge in sklearn e = Ridge( fit_intercept=False, alpha=l, max_iter=100, tol=1e-11) e.fit(X_with_intercept, y) print e.coef_ # =>[ 0.32155545 0.17904355 0.41222418] spark.ml code: ------------------- import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.sql.SQLContext object TestSparkRegression { def main(args: Array[String]): Unit = { import org.apache.log4j.{Level, Logger} Logger.getLogger("org").setLevel(Level.OFF) Logger.getLogger("akka").setLevel(Level.OFF) val conf = new SparkConf().setAppName("test").setMaster("local") val sc = new SparkContext(conf) val sparkTrainingData = new SQLContext(sc) .createDataFrame(Seq( LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)), LabeledPoint(1.0, Vectors.dense(0.6750417712898752, -0.4232874171873786)), LabeledPoint(1.0, Vectors.dense(0.1863463229359709, -0.8163423997075965)), LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)), LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)), LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)), LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)), LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)), LabeledPoint(0.0, Vectors.dense(0.9219423508390701, -0.8972778242305388)), LabeledPoint(0.0, Vectors.dense(0.7006904841584055, -0.5607635619919824)))) .toDF("label", "features") val logisticModel = new LogisticRegression() .setRegParam(0.3) .setLabelCol("label") .setFeaturesCol("features") .fit(sparkTrainingData) println(s"Spark logistic model coefficients: ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}") // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713] Intercept: -0.13897955358689987 val linearModel = new LinearRegression() .setRegParam(0.3) .setLabelCol("label") .setFeaturesCol("features") .setSolver("l-bfgs") .fit(sparkTrainingData) println(s"Spark linear model coefficients: ${linearModel.coefficients} Intercept: ${linearModel.intercept}") // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802] Intercept: 0.45464906876832323 sc.stop() } } Thanks, Frank
Both libraries are heavily parameterized. You should check what the defaults are for both. Some ideas: - What regularization is being used. L1/L2? - Does the regularization parameter have the same interpretation? 1/C = lambda? Some libraries use C. Some use lambda. - Also, some libraries regularize the intercept (scikit), other do not. (It doesn't seem like a particularly good idea to regularize the intercept if your optimizer permits not doing it). On Sun, Mar 12, 2017 at 7:07 PM, Frank Astier via scikit-learn < scikit-learn@python.org> wrote:
(this was also posted to stackoverflow on 03/10)
I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the same...).
No doubt I am missing some setting on one side or the other. Which setting? How should I set up either scikit or spark.ml to find the same model as its counterpart?
I give the sklearn code and spark.ml code below. Both should be ready to cut-and-paste and run.
scikit-learn code: ----------------------
import numpy as np from sklearn.linear_model import LogisticRegression, Ridge
X = np.array([ [-0.7306653538519616, 0.0], [0.6750417712898752, -0.4232874171873786], [0.1863463229359709, -0.8163423997075965], [-0.6719842051493347, 0.0], [0.9699938346531928, 0.0], [0.22759406190283604, 0.0], [0.9688721028330911, 0.0], [0.5993795346650845, 0.0], [0.9219423508390701, -0.8972778242305388], [0.7006904841584055, -0.5607635619919824] ])
y = np.array([ 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0 ])
m, n = X.shape
# Add intercept term to simulate inputs to GameEstimator X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis]))
l = 0.3 e = LogisticRegression( fit_intercept=False, penalty='l2', C=1/l, max_iter=100, tol=1e-11)
e.fit(X_with_intercept, y)
print e.coef_ # => [[ 0.98662189 0.45571052 -0.23467255]]
# Linear regression is called Ridge in sklearn e = Ridge( fit_intercept=False, alpha=l, max_iter=100, tol=1e-11)
e.fit(X_with_intercept, y)
print e.coef_ # =>[ 0.32155545 0.17904355 0.41222418]
spark.ml code: -------------------
import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.sql.SQLContext
object TestSparkRegression { def main(args: Array[String]): Unit = { import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF) Logger.getLogger("akka").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("test").setMaster("local") val sc = new SparkContext(conf)
val sparkTrainingData = new SQLContext(sc) .createDataFrame(Seq( LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)), LabeledPoint(1.0, Vectors.dense(0.6750417712898752, -0.4232874171873786)), LabeledPoint(1.0, Vectors.dense(0.1863463229359709, -0.8163423997075965)), LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)), LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)), LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)), LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)), LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)), LabeledPoint(0.0, Vectors.dense(0.9219423508390701, -0.8972778242305388)), LabeledPoint(0.0, Vectors.dense(0.7006904841584055, -0.5607635619919824)))) .toDF("label", "features")
val logisticModel = new LogisticRegression() .setRegParam(0.3) .setLabelCol("label") .setFeaturesCol("features") .fit(sparkTrainingData)
println(s"Spark logistic model coefficients: ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}") // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713] Intercept: -0.13897955358689987
val linearModel = new LinearRegression() .setRegParam(0.3) .setLabelCol("label") .setFeaturesCol("features") .setSolver("l-bfgs") .fit(sparkTrainingData)
println(s"Spark linear model coefficients: ${linearModel.coefficients} Intercept: ${linearModel.intercept}") // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802] Intercept: 0.45464906876832323
sc.stop() } }
Thanks,
Frank
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
I think the liblinear solver (default in LogisticRegression) does regularize the bias. So, even if both solutions (sklearn, spark) anneal to the global cost optimum, the model parameters would be different. Maybe a better way to make that comparison would be to turn off regularization completely for now. And when you run the LogisticRegression, maybe run it multiple times with different random seeds to see if your solutions are generally stable. Best, Sebastian
On Mar 13, 2017, at 1:06 PM, Stuart Reynolds <stuart@stuartreynolds.net> wrote:
Both libraries are heavily parameterized. You should check what the defaults are for both.
Some ideas: - What regularization is being used. L1/L2? - Does the regularization parameter have the same interpretation? 1/C = lambda? Some libraries use C. Some use lambda. - Also, some libraries regularize the intercept (scikit), other do not. (It doesn't seem like a particularly good idea to regularize the intercept if your optimizer permits not doing it).
On Sun, Mar 12, 2017 at 7:07 PM, Frank Astier via scikit-learn <scikit-learn@python.org> wrote: (this was also posted to stackoverflow on 03/10)
I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the same...).
No doubt I am missing some setting on one side or the other. Which setting? How should I set up either scikit or spark.ml to find the same model as its counterpart?
I give the sklearn code and spark.ml code below. Both should be ready to cut-and-paste and run.
scikit-learn code: ----------------------
import numpy as np from sklearn.linear_model import LogisticRegression, Ridge
X = np.array([ [-0.7306653538519616, 0.0], [0.6750417712898752, -0.4232874171873786], [0.1863463229359709, -0.8163423997075965], [-0.6719842051493347, 0.0], [0.9699938346531928, 0.0], [0.22759406190283604, 0.0], [0.9688721028330911, 0.0], [0.5993795346650845, 0.0], [0.9219423508390701, -0.8972778242305388], [0.7006904841584055, -0.5607635619919824] ])
y = np.array([ 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0 ])
m, n = X.shape
# Add intercept term to simulate inputs to GameEstimator X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis]))
l = 0.3 e = LogisticRegression( fit_intercept=False, penalty='l2', C=1/l, max_iter=100, tol=1e-11)
e.fit(X_with_intercept, y)
print e.coef_ # => [[ 0.98662189 0.45571052 -0.23467255]]
# Linear regression is called Ridge in sklearn e = Ridge( fit_intercept=False, alpha=l, max_iter=100, tol=1e-11)
e.fit(X_with_intercept, y)
print e.coef_ # =>[ 0.32155545 0.17904355 0.41222418]
spark.ml code: -------------------
import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.sql.SQLContext
object TestSparkRegression { def main(args: Array[String]): Unit = { import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF) Logger.getLogger("akka").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("test").setMaster("local") val sc = new SparkContext(conf)
val sparkTrainingData = new SQLContext(sc) .createDataFrame(Seq( LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)), LabeledPoint(1.0, Vectors.dense(0.6750417712898752, -0.4232874171873786)), LabeledPoint(1.0, Vectors.dense(0.1863463229359709, -0.8163423997075965)), LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)), LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)), LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)), LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)), LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)), LabeledPoint(0.0, Vectors.dense(0.9219423508390701, -0.8972778242305388)), LabeledPoint(0.0, Vectors.dense(0.7006904841584055, -0.5607635619919824)))) .toDF("label", "features")
val logisticModel = new LogisticRegression() .setRegParam(0.3) .setLabelCol("label") .setFeaturesCol("features") .fit(sparkTrainingData)
println(s"Spark logistic model coefficients: ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}") // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713] Intercept: -0.13897955358689987
val linearModel = new LinearRegression() .setRegParam(0.3) .setLabelCol("label") .setFeaturesCol("features") .setSolver("l-bfgs") .fit(sparkTrainingData)
println(s"Spark linear model coefficients: ${linearModel.coefficients} Intercept: ${linearModel.intercept}") // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802] Intercept: 0.45464906876832323
sc.stop() } }
Thanks,
Frank
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
sklearn's (and hence liblinear's) intercept is not being used here, but a feature is added in Python to represent the bias, so it's being regularised in any case. On 16 March 2017 at 14:27, Sebastian Raschka <se.raschka@gmail.com> wrote:
I think the liblinear solver (default in LogisticRegression) does regularize the bias. So, even if both solutions (sklearn, spark) anneal to the global cost optimum, the model parameters would be different. Maybe a better way to make that comparison would be to turn off regularization completely for now. And when you run the LogisticRegression, maybe run it multiple times with different random seeds to see if your solutions are generally stable.
Best, Sebastian
On Mar 13, 2017, at 1:06 PM, Stuart Reynolds <stuart@stuartreynolds.net> wrote:
Both libraries are heavily parameterized. You should check what the defaults are for both.
Some ideas: - What regularization is being used. L1/L2? - Does the regularization parameter have the same interpretation? 1/C = lambda? Some libraries use C. Some use lambda. - Also, some libraries regularize the intercept (scikit), other do not. (It doesn't seem like a particularly good idea to regularize the intercept if your optimizer permits not doing it).
On Sun, Mar 12, 2017 at 7:07 PM, Frank Astier via scikit-learn < scikit-learn@python.org> wrote: (this was also posted to stackoverflow on 03/10)
I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the same...).
No doubt I am missing some setting on one side or the other. Which setting? How should I set up either scikit or spark.ml to find the same model as its counterpart?
I give the sklearn code and spark.ml code below. Both should be ready to cut-and-paste and run.
scikit-learn code: ----------------------
import numpy as np from sklearn.linear_model import LogisticRegression, Ridge
X = np.array([ [-0.7306653538519616, 0.0], [0.6750417712898752, -0.4232874171873786], [0.1863463229359709, -0.8163423997075965], [-0.6719842051493347, 0.0], [0.9699938346531928, 0.0], [0.22759406190283604, 0.0], [0.9688721028330911, 0.0], [0.5993795346650845, 0.0], [0.9219423508390701, -0.8972778242305388], [0.7006904841584055, -0.5607635619919824] ])
y = np.array([ 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0 ])
m, n = X.shape
# Add intercept term to simulate inputs to GameEstimator X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis]))
l = 0.3 e = LogisticRegression( fit_intercept=False, penalty='l2', C=1/l, max_iter=100, tol=1e-11)
e.fit(X_with_intercept, y)
print e.coef_ # => [[ 0.98662189 0.45571052 -0.23467255]]
# Linear regression is called Ridge in sklearn e = Ridge( fit_intercept=False, alpha=l, max_iter=100, tol=1e-11)
e.fit(X_with_intercept, y)
print e.coef_ # =>[ 0.32155545 0.17904355 0.41222418]
spark.ml code: -------------------
import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.sql.SQLContext
object TestSparkRegression { def main(args: Array[String]): Unit = { import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF) Logger.getLogger("akka").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("test").setMaster("local") val sc = new SparkContext(conf)
val sparkTrainingData = new SQLContext(sc) .createDataFrame(Seq( LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)), LabeledPoint(1.0, Vectors.dense(0.6750417712898752, -0.4232874171873786)), LabeledPoint(1.0, Vectors.dense(0.1863463229359709, -0.8163423997075965)), LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)), LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)), LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)), LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)), LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)), LabeledPoint(0.0, Vectors.dense(0.9219423508390701, -0.8972778242305388)), LabeledPoint(0.0, Vectors.dense(0.7006904841584055, -0.5607635619919824)))) .toDF("label", "features")
val logisticModel = new LogisticRegression() .setRegParam(0.3) .setLabelCol("label") .setFeaturesCol("features") .fit(sparkTrainingData)
println(s"Spark logistic model coefficients: ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}") // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713] Intercept: -0.13897955358689987
val linearModel = new LinearRegression() .setRegParam(0.3) .setLabelCol("label") .setFeaturesCol("features") .setSolver("l-bfgs") .fit(sparkTrainingData)
println(s"Spark linear model coefficients: ${linearModel.coefficients} Intercept: ${linearModel.intercept}") // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802] Intercept: 0.45464906876832323
sc.stop() } }
Thanks,
Frank
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (4)
-
Frank Astier -
Joel Nothman -
Sebastian Raschka -
Stuart Reynolds