[scikit-learn] [ANN] Scikit-learn 0.20.0
Javier López
jlopez at ende.cc
Fri Sep 28 15:20:04 EDT 2018
On Fri, Sep 28, 2018 at 6:41 PM Andreas Mueller <t3kcit at gmail.com> wrote:
> Javier:
> The problem is not so much storing the "model" but storing how to make
> predictions. Different versions could act differently
> on the same data structure - and the data structure could change. Both
> happen in scikit-learn.
> So if you want to make sure the right thing happens across versions, you
> either need to provide serialization and deserialization for
> every version and conversion between those or you need to provide a way
> to store the prediction function,
> which basically means you need a turing-complete language (that's what
> ONNX does).
>
I understand the difficulty of the situation, but an approximate solution
to that is saving the predictions from a large enough validation set. If
the prediction for the newly created model are "close enough" to the old
ones, we deem the unserialized model to be the same and move forward, if
there are serious discrepancies, then we dive deep to see what's going on,
and if needed refit the offending submodels with the newer version.
Since we only want to compare the predictions here, we don't need a ground
truth and thus the validation set doesn't even need to be a real dataset,
it can consist of synthetic datapoints created via SMOTE, Caruana's MUNGE
algorithm, or any other method, and can be made arbitrarily large on in
advance.
This method has worked reasonably well for us in practice; we deal with
ensembles containing
hundreds or thousands of models, and this technique saves us from having to
refit many of them that don't change very often, and if something changes a
lot, we want to know in either case to ascertain what was amiss (either
with the old version or with the new one).
The situation I am proposing is not worse than what we have right now,
which is save a pickle and then hope that it can be read later on;
sometimes it can, sometimes it cannot depending on what changed. Stuff
unrelated to the models themselves, such as changes in the joblib dump
method broke several of our pickles files in the past. What I would like to
have is a text-based representation of the fitted model that can always be
read, stored in a database, or sent over the wire through simple methods.
J
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180928/910679c1/attachment.html>
More information about the scikit-learn
mailing list