[scikit-learn] [ANN] Scikit-learn 0.20.0

Wed Oct 3 13:14:13 EDT 2018

The ONNX-approach sounds most promising, esp. because it will also allow library interoperability but I wonder if this is for parametric models only and not for the nonparametric ones like KNN, tree-based classifiers, etc.

All-in-all I can definitely see the appeal for having a way to export sklearn estimators in a text-based format (e.g., via JSON), since it would make sharing code easier. This doesn't even have to be compatible with multiple sklearn versions. A typical use case would be to include these JSON exports as e.g., supplemental files of a research paper for other people to run the models etc. (here, one can just specify which sklearn version it would require; of course, one could also share pickle files, by I am personally always hesitant reg. running/trusting other people's pickle files).

Unfortunately though, as Gael pointed out, this "feature" would be a huge burden for the devs, and it would probably also negatively impact the development of scikit-learn itself because it imposes another design constraint.

However, I do think this sounds like an excellent case for a contrib project. Like scikit-export, scikit-serialize or sth like that.

Best,
Sebastian

> On Oct 3, 2018, at 5:49 AM, Javier López <jlopez at ende.cc> wrote:
> 
> 
> On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux <gael.varoquaux at normalesup.org> wrote:
> The reason that pickles are brittle and that sharing pickles is a bad
> practice is that pickle use an implicitly defined data model, which is
> defined via the internals of objects.
> 
> Plus the fact that loading a pickle can execute arbitrary code, and there is no way to know
> if any malicious code is in there in advance because the contents of the pickle cannot
> be easily inspected without loading/executing it.
>  
> So, the problems of pickle are not specific to pickle, but rather
> intrinsic to any generic persistence code [*]. Writing persistence code that
> does not fall in these problems is very costly in terms of developer time
> and makes it harder to add new methods or improve existing one. I am not
> excited about it.
> 
> My "text-based serialization" suggestion was nowhere near as ambitious as that,
> as I have already explained, and wasn't aiming at solving the versioning issues, but
> rather at having something which is "about as good" as pickle but in a human-readable
> format. I am not asking for a Turing-complete language to reproduce the prediction
> function, but rather something simple in the spirit of the output produced by the gist code I linked above, just for the model families where it is reasonable:
> 
> https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31
> 
> The code I posted mostly works (specific cases of nested models need to be addressed 
> separately, as well as pipelines), and we have been using (a version of) it in production
> for quite some time. But there are hackish aspects to it that we are not happy with,
> such as the manual separation of init and fitted parameters by checking if the name ends with "_", having to infer class name and location using 
> "model.__class__.__name__" and "model.__module__", and the wacky use of "__import__".
> 
> My suggestion was more along the lines of adding some metadata to sklearn estimators so
> that a code in a similar style would be nicer to write; little things like having a `init_parameters` and `fit_parameters` properties that would return the lists of named parameters, 
> or a `model_info` method that would return data like sklearn version, class name and location, or a package level dictionary pointing at the estimator classes by a string name, like
> 
> from sklearn.linear_models import LogisticRegression
> estimator_classes = {"LogisticRegression": LogisticRegression, ...}
> 
> so that one can load the appropriate class from the string description without calling __import__ or eval; that sort of stuff.
> 
> I am aware this would not address the common complain of "prefect prediction reproducibility"
> across versions, but I think we can all agree that this utopia of perfect reproducibility is not 
> feasible.
> 
> And in the long, long run, I agree that PFA/onnx or whichever similar format that emerges, is
> the way to go.
> 
> J
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn