[scikit-learn] Confidence interval estimation for probability estimators

Stuart Reynolds stuart at stuartreynolds.net
Tue Oct 3 13:48:16 EDT 2017


Let's say I have a base estimator that predicts the likelihood of an
binary (Bernoulli) outcome:
  model.fit(X, y) where y contains [0 or 1]
  P = model.predict(X)/predict_proba(X)  give values in the range [0 to 1]
(model here might be a calibrated LogisticRegression model).

Is there a way to estimate confidences for the rows in P?

Is seems like this can be done with Gaussian Process Regression for
regression tasks:
https://stats.stackexchange.com/questions/169995/why-does-my-train-data-not-fall-in-confidence-interval-with-scikit-learn-gaussia
For regression task I this this method could be used to wrap other
models and estimate the confidence.
For example, it looks like we can do:
  gp = GaussianProcessorRegressor(..)
  gp.fit(model.predict(X), y)
  ypred, sigma = gp.predict(model.predict(X))
to give us an estimate of the confidence in the output of model, *for
regression*.

I'd like the same, for probability estimates. However, i don't think
the above works directly:
 - my outcomes is constrained between 0..1 (the GP Regressor is not)
 - using normal approximation to obtain confidence intervals for
Bernoulli processes can leads to some pretty awful estimates,
particularly for probabilities close to 0 or 1.
 - the above example gives a single sigma value. For constrained
outputs, the CI is not symmetric (this bound closer to 0.5 should be
further from the probability prediction than the bound closes to 0 or
1.

I was hoping that GaussianProcessClassifier might be able to generate
intervals, but I don't see how.

My current approach is:

 - for some prediction p,
     - pick y_p from y, the rows who have predictions close to p:
       - for this sample, estimate the CI with
statsmodels.stats.proportion.proportion_confint(
            sum(y_p), len(y_p), alpha=1-ciwidth, method="wilson" # or
"jeffrey" -- normal, beta are broken for p close to 0 or 1

Which works OK, but is quite slow and not very data efficient.


Any thoughts?

Thanks,
- Stuart


More information about the scikit-learn mailing list