[scikit-learn] Smoke and Metamorphic Testing of scikit-learn

Steffen Herbold herbold at cs.uni-goettingen.de
Thu Aug 23 07:39:01 EDT 2018

Hi Andy,

thanks for your detailed feedback.

The random states are fixed, and set immediately before calling the fit 
function. Here is a gist with the code for one smoke tests and a 
metamorphic test [1].

I will run the tests for LinearDiscriminantAnalysis and the 
SGDClassifier. I somehow missed them when I scanned the documentation.

I know that these problems should sometimes be expected. However, I was 
actually not sure what to expect, especially after I started to look at 
the results for different ML libraries in comparison. The random forest 
you brought up are good example. I also expected them to be dependent on 
feature/instance order. However, they are not in Weka, only in 
scikit-learn and Spark MLlib. There are more such examples, like 
logistic regression that exihibits different behavior in all three 

I already have a comparison regarding expected differences between 
machine learning frameworks planned as a topic for future work.


[1] https://gist.github.com/sherbold/570c9399e9bc39dd980d6c2bdbf3b64a

Am 22.08.2018 um 17:49 schrieb Andreas Mueller:
> Hi Steffen.
> Thanks for sharing your analysis. We really need more work in this 
> direction.
> I assume you fixed the random states everywhere?
> I consider these tests helpful but not all your expectations are 
> warranted depending on the model.
> If you add one to each feature, there is no expectations that results 
> will be the same, unless for the tree models.
> For tree-based models with fixed random states, however, it's expected 
> that reordering features will change the result.
> For non-convex optimization it's expected that results are not 
> symmetric (i.e. the MLPClassifier will not flip
> the decision function because the optimization is initialized in an 
> asymetric way), and reordering features will
> also change the result. If using mini-batches (the default) the 
> results will also change when instances are reordered.
> I assume you didn't test SGDClassifier or any of it's derivatives 
> because it doesn't show up here. Did you test LinearDiscriminantAnalysis?
> For the invariance tests it would be interesting to know if they are 
> due to tie-breaking or numerical issues.
> There is some numerical issues that are very hard to control, and I'm 
> pretty sure we have asymmetric tie-breaking
> (multiclass libsvm is "always predict the first class" 
> https://github.com/scikit-learn/scikit-learn/issues/8276 )
> I would looks at QuadraticDiscriminantAnalysis a bit more closely as a 
> consequence of your tests.
> Maybe check if the SVM, RF and KNN issues are due to tie-breaking.
> We could try and document all the cases where the result will not 
> fulfill these invariances, but I think that might be too much.
> At some point we need the users to understand what's going on. If you 
> look at the random forest algorithm and you fix
> the random state it's obvious that feature order matters.
> A big question here is how big the differences are. Some algorithms 
> are randomized (I think the coordinate descent in
> some of the linear models uses random orders), but the results are 
> expected to be near-identical, independent of the ordering.
> Cheers,
> Andy
> On 8/22/18 7:12 AM, Steffen Herbold wrote:
>> Dear developers,
>> I am writing you because I applied an approach for the automated 
>> testing of classification algorithms to scikit-learn and would like 
>> to forward the results to you.
>> The approach is a combination of smoke testing and metamorphic 
>> testing. The smoke tests try to find problems by executing the 
>> training and prediction functions of classifiers with different data. 
>> These smoke tests should ensure the basic functioning of classifiers. 
>> I defined 20 different data sets, some very simple (uniform features 
>> in [0,1]), some with extreme distributions, e.g., data close to 
>> machine precision. The metamorphic tests determine if classification 
>> results change as expected if the training data is modified, e.g., by 
>> reordering features, flipping class labels, or reordering instances.
>> I generated 70 different Python unittest tests for eleven different 
>> scikit-learn classifiers. In summary, I found the following potential 
>> problems:
>> - Two errors due to possibly infinite loops for the 
>> LogisticRegressionClassifier for data that approaches MAXDOUBLE.
>> - The classification of LogisticRegression, MLPClassifier, 
>> QuadraticDiscriminantAnalysis, and SVM with a polynomial kernel 
>> changed if one is added to each feature value.
>> - The classification of DecisionTreeClassifier, LogisticRegression, 
>> MLPClassifier, QuadraticDiscriminantAnalysis, RandomForestClassifier, 
>> and SVM with a linear and a polynomial kernel were not inverted when 
>> all binary class labels are flipped.
>> - The classification of LogisticRegression, MLPClassifier, 
>> QuadraticDiscriminantAnalysis, and RandomForestClassifier sometimes 
>> changed when the features are reordered.
>> - The classification of KNeighborsClassifier, MLPClassifier, 
>> QuadraticDiscriminantAnalysis, RandomForestClassifier, and SVM with a 
>> linear kernel sometimes changed when the instances are reordered.
>> You can find details of our results online [1]. The provided 
>> resources include the current draft of the paper that describes the 
>> tests as well as detailed results in detail. Moreover, we provide an 
>> executable test suite with all tests we executed, as well as the 
>> export of our test results as XML file that contains all details of 
>> the test execution, including stack traces in case of exceptions. The 
>> preprint and online materials also contain the results for two other 
>> machine learning libraries, i.e., Weka and Spark MLlib. Additionally, 
>> you can find the atoml tool used to generate the tests on GitHub [2].
>> I hope that these tests may help with the future development of 
>> scikit-learn. You could help me a lot by answering the following 
>> questions:
>> - Do you consider the tests helpful?
>> - Do you consider any source code or documentation changes due to our 
>> findings?
>> - Would you be interested in a pull request or any other type of 
>> integration of (a subset of) the tests into your project?
>> - Would you be interested in more such tests, e.g., for the 
>> consideration of hyper parameters, other algorithm types like 
>> clustering, or more complex algorithm specific metamorphic tests?
>> I am looking forward to your feedback.
>> Best regards,
>> Steffen Herbold
>> [1] http://user.informatik.uni-goettingen.de/~sherbold/atoml-results/
>> [2] https://github.com/sherbold/atoml
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

Dr. Steffen Herbold
Institute of Computer Science
University of Goettingen
Goldschmidtstraße 7
37077 Göttingen, Germany
mailto. herbold at cs.uni-goettingen.de
tel. +49 551 39-172037

More information about the scikit-learn mailing list