Query about use of standard deviation on tree feature_importances_ in demo plot_forest_importances.html
Hi all. I'm looking at the code behind one of the tree ensemble demos: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importance... and I'm unsure about the error bars. They are calculated using the standard deviation of the feature_importances_ attribute across trees. Can we depend on this being a Normal distribution? I'm wondering if the plot tells enough of the story to be genuinely useful? I don't have a strong belief in the likely distribution of feature_importances_, I haven't dug into how the feature importances are calculated (frankly I'm a bit lost here). I know that on a RF Regression case I'm working on I can see unimodal and bimodal feature importance distributions - this came up on a discussion on the yellowbrick sklearn visualisation package: https://github.com/DistrictDataLabs/yellowbrick/pull/195 I don't know what is "normal" for feature importances and if they look different between classification tasks (as in the plot_forest_importances demo) and regression tasks. Maybe I've got an outlier in my task? If I use the provided demo code then my error bars can go negative, so that feels unhelpful. Does anyone have an opinion? Perhaps more importantly - is a visual indication of the spread of feature importances in an ensemble actually a useful thing to plot? Does it serve a diagnostic value? I saw Sebastian Raschka's reference to Gilles Louppe et al.'s NIPS paper (in here, 2016-05-17) on variable importances, I'll dig into that if nobody has a strong opinion. BTW Sebastian - thanks for writing your book. Cheers, Ian. -- Ian Ozsvald (Data Scientist, PyDataLondon co-chair) ian@IanOzsvald.com http://IanOzsvald.com http://ModelInsight.io http://twitter.com/IanOzsvald
+1 for changing this example to have error bars represent 5 & 95 percentiles or 25 and 75 percentiles (quartiles). Or event bootstrapped confidence intervals or the mean feature importance for each variable. This might be a bit too verbose for an example though.
Perhaps more importantly - is a visual indication of the spread of feature importances in an ensemble actually a useful thing to plot? Does it serve a diagnostic value?
Yes. Otherwise people might be over-confident in the stability of those feature importances. -- Olivier
Good. I'd suggested a box plot or use of IQR (on a bar chart) on the yellowbrick list. I was assuming that if distribution of feature importances contained many '0's might indeed be worth highlighting as a diagnostic. Cheers, Ian. On 23 June 2017 at 18:51, Olivier Grisel <olivier.grisel@ensta.org> wrote:
+1 for changing this example to have error bars represent 5 & 95 percentiles or 25 and 75 percentiles (quartiles).
Or event bootstrapped confidence intervals or the mean feature importance for each variable. This might be a bit too verbose for an example though.
Perhaps more importantly - is a visual indication of the spread of feature importances in an ensemble actually a useful thing to plot? Does it serve a diagnostic value?
Yes. Otherwise people might be over-confident in the stability of those feature importances.
-- Olivier _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Ian Ozsvald (Data Scientist, PyDataLondon co-chair) ian@IanOzsvald.com http://IanOzsvald.com http://ModelInsight.io http://twitter.com/IanOzsvald
participants (2)
-
Ian Ozsvald -
Olivier Grisel