[scikit-learn] [ANN] Scikit-learn 0.20.0

Gael Varoquaux gael.varoquaux at normalesup.org
Tue Oct 2 12:01:41 EDT 2018

On Fri, Sep 28, 2018 at 09:45:16PM +0100, Javier López wrote:
> This is not the whole truth. Yes, you store the sklearn version on the pickle
> and raise a warning; I am mostly ok with that, but the pickles are brittle and
> oftentimes they stop loading when other versions of other stuff change. I am
> not talking about "Warning: wrong version", but rather "Unpickling error:
> expected bytes, found tuple" that prevent the file from loading entirely.
> [...]
> 1. Things in the current state break when something else changes, not only
> sklearn.
> 2. Sharing pickles is a bad practice due to a number of reasons.

The reason that pickles are brittle and that sharing pickles is a bad
practice is that pickle use an implicitly defined data model, which is
defined via the internals of objects.

The "right" solution is to use an explicit data model. This is for
instance what is done with an object database. However, this comes at the
cost of making it very hard to change objects. First, all objects must be
stored with a schema (or language) that is rich enough to represent it,
and yet defined somewhat explicitly (to avoid running into the problems
of pickle). Second, if the internal representation of the object change,
there needs to be explicit conversion code to go from one version to the
next. Typically, upgrade of websites that use object database need
maintainers to write this conversion code.

So, the problems of pickle are not specific to pickle, but rather
intrinsic to any generic persistence code [*]. Writing persistence code that
does not fall in these problems is very costly in terms of developer time
and makes it harder to add new methods or improve existing one. I am not
excited about it.

Rather, the good practice is that if you want to deploy model you deploy
on the exact same environment that you have trained them. The web world
is very used to doing that (because they keep falling in these problems),
and has developed technology to do this, such as docker containers. I
know that it is clunky technology. I don't like it myself, but I don't
see a way out of it with our resources.


[*] Back in the days, when I was working on Mayavi, we developed our
persistence code, because we were not happy with pickle. It was not
pleasant to maintain, and had the same "smell" as pickle. I don't think
that it was a great use of our time.

More information about the scikit-learn mailing list