<div dir="ltr">another reason is that we take as threshold the mid point between sample values<div>which is not invariant to arbitrary scaling of the features</div><div><br></div><div>Alex</div><div><br><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Oct 22, 2019 at 11:56 AM Guillaume Lemaître <<a href="mailto:g.lemaitre58@gmail.com">g.lemaitre58@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="background-color:rgb(255,255,255);background-image:initial;line-height:initial"><div id="gmail-m_-1128988392826413229response_container_BBPPID" style="outline:none" dir="auto"> <div name="BB10" id="gmail-m_-1128988392826413229BB10_response_div_BBPPID" dir="auto" style="width:100%">Even with the same random state, it can happen that several features will lead to a best split and this split is chosen randomly (even with the seed fixed - this is reported as an issue I think). Therefore, the rest of the tree could be different leading to different prediction. </div><div name="BB10" id="gmail-m_-1128988392826413229BB10_response_div_BBPPID" dir="auto" style="width:100%"><br></div><div name="BB10" id="gmail-m_-1128988392826413229BB10_response_div_BBPPID" dir="auto" style="width:100%">Another possibility is that we compute the difference between the current threshold and the next to be tried and only check the entropy if it is larger than a specific value (I would need to check the source code). After scaling, it could happen that 2 feature values become too closed to be considered as a potential split which will make a difference between scaled and scaled features. But this diff should be really small. </div><div name="BB10" id="gmail-m_-1128988392826413229BB10_response_div_BBPPID" dir="auto" style="width:100%"><br></div><div name="BB10" id="gmail-m_-1128988392826413229BB10_response_div_BBPPID" dir="auto" style="width:100%">This is the what I can think on the top of the head. </div> <div name="BB10" id="gmail-m_-1128988392826413229response_div_spacer_BBPPID" dir="auto" style="width:100%"> <br style="display:initial"></div> <div id="gmail-m_-1128988392826413229blackberry_signature_BBPPID" name="BB10" dir="auto"> <div id="gmail-m_-1128988392826413229_signaturePlaceholder_BBPPID" name="BB10" dir="auto"><p dir="ltr">Sent from my phone - sorry to be brief and potential misspell. </p></div> </div></div><div id="gmail-m_-1128988392826413229_original_msg_header_BBPPID" dir="auto"> <table width="100%" style="background-color:white;border-spacing:0px;display:table;outline:none"><tbody><tr><td colspan="2" style="padding:initial;font-size:initial;text-align:initial;background-color:rgb(255,255,255)"> <div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(181,196,223);padding:3pt 0in 0in;font-family:Tahoma,"BB Alpha Sans","Slate Pro";font-size:10pt"> <div id="gmail-m_-1128988392826413229from"><b>From:</b> <a href="mailto:geoffrey.bolmier@gmail.com" target="_blank">geoffrey.bolmier@gmail.com</a></div><div id="gmail-m_-1128988392826413229sent"><b>Sent:</b> 22 October 2019 11:34</div><div id="gmail-m_-1128988392826413229to"><b>To:</b> <a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a></div><div id="gmail-m_-1128988392826413229reply_to"><b>Reply to:</b> <a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a></div><div id="gmail-m_-1128988392826413229subject"><b>Subject:</b> [scikit-learn] Decision tree results sometimes different with scaled data</div></div></td></tr></tbody></table> <br> </div><div name="BB10" dir="auto" style="background-image:initial;line-height:initial;outline:none"><div>Hi all,</div><br><div>First, let me thank you for the great job your guys are doing developing and maintaining such a popular library!</div><br><div>As we all know decision trees are not impacted by scaled data because splits don't take into account distances between two values within a feature.</div><br><div>However I experienced a strange behavior using sklearn decision tree algorithm. Sometimes results of the model are different depending if input data has been scaled or not.</div><br><div>To illustrate my point I ran experiments on the iris dataset consisting of:</div><ul><li><div>perform a train/test split</div></li><li><div>fit the training set and predict the test set</div></li><li><div>fit and predict again with standardized inputs (removing the mean and scaling to unit variance)</div></li><li><div>compare both model predictions</div></li></ul><div>Experiments have been ran 10,000 times with different random seeds (cf. traceback and code to reproduce it at the end).</div><div>Results showed that for a bit more than 10% of the time we find at least one different prediction. Hopefully when it's the case only a few predictions differ, 1 or 2 most of the time. I checked the inputs causing different predictions and they are not the same depending of the run.</div><br><div>I'm worried if the rate of different predictions could be larger for other datasets...</div><div>Do you have an idea where it come from, maybe due to floating point errors or am I doing something wrong?</div><br><div>Cheers,</div><div>Geoffrey</div><br><br><div>------------------------------------------------------------</div><div>Traceback:</div><div>------------------------------------------------------------</div><div>Error rate: 12.22%</div><br><div>Seed: 241862</div><div>All pred equal: False</div><div>Not scale data confusion matrix:</div><div>[[16 0 0]</div><div>[ 0 17 0]</div><div>[ 0 4 13]]</div><div>Scale data confusion matrix:</div><div>[[16 0 0]</div><div>[ 0 15 2]</div><div>[ 0 4 13]]</div><div>------------------------------------------------------------</div><div>Code:</div><div>------------------------------------------------------------</div><div>import numpy as np</div><br><div>from <a href="http://sklearn.datasets" target="_blank">sklearn.datasets</a> import load_iris</div><div>from <a href="http://sklearn.metrics" target="_blank">sklearn.metrics</a> import confusion_matrix</div><div>from sklearn.model_selection import train_test_split</div><div>from <a href="http://sklearn.preprocessing" target="_blank">sklearn.preprocessing</a> import StandardScaler</div><div>from <a href="http://sklearn.tree" target="_blank">sklearn.tree</a> import DecisionTreeClassifier</div><br><br><div>X, y = load_iris(return_X_y=True)</div><br><div>def run_experiment(X, y, seed): </div><div> X_train, X_test, y_train, y_test = train_test_split(</div><div> X,</div><div> y,</div><div> stratify=y,</div><div> test_size=0.33,</div><div> random_state=seed</div><div> )</div><div> </div><div> scaler = StandardScaler()</div><div> </div><div> X_train_scaled = scaler.fit_transform(X_train)</div><div> X_test_scaled = <a href="http://scaler.transform" target="_blank">scaler.transform</a>(X_test)</div><div> </div><div> clf = DecisionTreeClassifier(random_state=seed)</div><div> clf_scaled = DecisionTreeClassifier(random_state=seed)</div><div> </div><div> <a href="http://clf.fit" target="_blank">clf.fit</a>(X_train, y_train)</div><div> clf_<a href="http://scaled.fit" target="_blank">scaled.fit</a>(X_train_scaled, y_train)</div><div> </div><div> pred = <a href="http://clf.predict" target="_blank">clf.predict</a>(X_test)</div><div> pred_scaled = clf_<a href="http://scaled.predict" target="_blank">scaled.predict</a>(X_test_scaled)</div><div> </div><div> err = 0 if all(pred == pred_scaled) else 1</div><div> </div><div> return err, y_test, pred, pred_scaled</div><br><br><div>n_err, n_run, seed_err = 0, 10000, None</div><br><div>for _ in range(n_run):</div><div> seed = <a href="http://np.random.randint" target="_blank">np.random.randint</a>(10000000)</div><div> err, _, _, _ = run_experiment(X, y, seed)</div><div> n_err += err</div><div> </div><div> # keep aside last seed causing an error</div><div> seed_err = seed if err == 1 else seed_err</div><br><div> </div><div>print(f'Error rate: {round(n_err / n_run * 100, 2)}%', end='\n\n')</div><br><div>_, y_test, pred, pred_scaled = run_experiment(X, y, seed_err)</div><br><div>print(f'Seed: {seed_err}')</div><div>print(f'All pred equal: {all(pred == pred_scaled)}')</div><div>print(f'Not scale data confusion matrix:\n{confusion_matrix(y_test, pred)}')</div><div>print(f'Scale data confusion matrix:\n{confusion_matrix(y_test, pred_scaled)}')</div><img alt="Sent from Mailspring" width="0" height="0" style="border: 0px; width: 0px; min-height: 0px;" src="https://link.getmailspring.com/open/04CA54E6-6894-4104-BB66-9A9FE89EED7F@getmailspring.com?me=2e3ab824&recipient=c2Npa2l0LWxlYXJuQHB5dGhvbi5vcmc%3D"></div></div>_______________________________________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>
</blockquote></div>