[scikit-learn] Confidence interval estimation for probability estimators
Stuart Reynolds
stuart at stuartreynolds.net
Tue Oct 3 13:48:16 EDT 2017
Let's say I have a base estimator that predicts the likelihood of an
binary (Bernoulli) outcome:
model.fit(X, y) where y contains [0 or 1]
P = model.predict(X)/predict_proba(X) give values in the range [0 to 1]
(model here might be a calibrated LogisticRegression model).
Is there a way to estimate confidences for the rows in P?
Is seems like this can be done with Gaussian Process Regression for
regression tasks:
https://stats.stackexchange.com/questions/169995/why-does-my-train-data-not-fall-in-confidence-interval-with-scikit-learn-gaussia
For regression task I this this method could be used to wrap other
models and estimate the confidence.
For example, it looks like we can do:
gp = GaussianProcessorRegressor(..)
gp.fit(model.predict(X), y)
ypred, sigma = gp.predict(model.predict(X))
to give us an estimate of the confidence in the output of model, *for
regression*.
I'd like the same, for probability estimates. However, i don't think
the above works directly:
- my outcomes is constrained between 0..1 (the GP Regressor is not)
- using normal approximation to obtain confidence intervals for
Bernoulli processes can leads to some pretty awful estimates,
particularly for probabilities close to 0 or 1.
- the above example gives a single sigma value. For constrained
outputs, the CI is not symmetric (this bound closer to 0.5 should be
further from the probability prediction than the bound closes to 0 or
1.
I was hoping that GaussianProcessClassifier might be able to generate
intervals, but I don't see how.
My current approach is:
- for some prediction p,
- pick y_p from y, the rows who have predictions close to p:
- for this sample, estimate the CI with
statsmodels.stats.proportion.proportion_confint(
sum(y_p), len(y_p), alpha=1-ciwidth, method="wilson" # or
"jeffrey" -- normal, beta are broken for p close to 0 or 1
Which works OK, but is quite slow and not very data efficient.
Any thoughts?
Thanks,
- Stuart
More information about the scikit-learn
mailing list