[scikit-learn] [ANN] Scikit-learn 0.20.0

Javier López jlopez at ende.cc
Fri Sep 28 16:45:16 EDT 2018

On Fri, Sep 28, 2018 at 8:46 PM Andreas Mueller <t3kcit at gmail.com> wrote:

> Basically what you're saying is that you're fine with versioning the
> models and having the model break loudly if anything changes.
> That's not actually what most people want. They want to be able to make
> predictions with a given model for ever into the future.

Are we talking about "(the new version of) the old model can still make
predictions" or "the old model makes exactly the same predictions as
before"? I'd like the first to hold, don't care that much about the second.

> Your use-case is similar, but if retraining the model is not an issue,
> why don't you want to retrain every time scikit-learn releases a new
> version?

Thousands of models. I don't want to retrain ALL of them unless needed

> We're now storing the version of scikit-learn that was used in the
> pickle and warn if you're trying to load with a different version.

This is not the whole truth. Yes, you store the sklearn version on the
pickle and raise a warning; I am mostly ok with that, but the pickles are
brittle and oftentimes they stop loading when other versions of other stuff
change. I am not talking about "Warning: wrong version", but rather
"Unpickling error: expected bytes, found tuple" that prevent the file from
loading entirely.

> That's basically a stricter test than what you wanted. Yes, there are
> false positives, but given that this release took a year,
> this doesn't seem that big an issue?

1. Things in the current state break when something else changes, not only
2. Sharing pickles is a bad practice due to a number of reasons.
3. We might want to explore model parameters without having to load the
entire runtime

Also, in order to retrain the model we need to keep the whole model
description with parameters. This needs to be saved somewhere, which in the
current state would force us to keep two files: one with the parameters (in
a text format to avoid the "non-loadng" problems from above) and the pkl
with the fitted model. My proposal would keep both in a single file.

As mentioned in previous emails, we already have our own solution that
kind-of-works for our needs, but we have to do a few hackish things to keep
things running. If sklearn estimators simply included a text serialization
method (similar in spirit to the one used for __display__ or __repr__) it
would make things easier.

But I understand that not everyone's needs are the same, so if you guys
don't consider this type of thing a priority, we can live with that :) I
mostly mentioned it since "Backwards-compatible de/serialization of some
estimators" is listed in the roadmap as a desirable goal for version 1.0
and feedback on such roadmap was requested.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180928/07cf7cd5/attachment.html>

More information about the scikit-learn mailing list