[scikit-learn] [Scikit-learn-general] Estimator serialisability

Dale T Smith Dale.T.Smith at macys.com
Thu Jul 14 08:35:27 EDT 2016


Hello,

I investigated this subject last year, and have tried to keep up, so I can perhaps offer some alternatives.


·         The only packages I know that read PMML in Python are proprietary. There are several alternatives for writing to PMML, as you can easily find.

I also found

https://code.google.com/archive/p/augustus/

and

https://github.com/ctrl-alt-d/lightpmmlpredictor

Depending on your project, sklearn-compiledtrees may be an option.

https://github.com/ajtulloch/sklearn-compiledtrees

Py2PMML (https://support.zementis.com/entries/37092748-Introducing-Py2PMML) is by Zemantis and it’s a commercial product, meaning you pay for a license.


·         Another option is what we planned to do at an old job of mine – read the model characteristics out of the scikit-learn object after fit, and produce C code ourselves. This is a viable option for decision trees. Adapt print_decision_trees() from this Stackoverflow answer.

http://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree


·         You can also reconsider your use of joblib.dump again. I’m aware that it has problems, but you can include enough versioning information in the objects you dump in order to apply checks in your code to make sure scikit-learn versions are compatible, etc. I know this is a pain in the neck, but it’s a viable alternative to creating your own PMML reader, writing a code generator of some kind, or buying a license.


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Joel Nothman
Sent: Thursday, July 14, 2016 4:18 AM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] [Scikit-learn-general] Estimator serialisability

⚠ EXT MSG:
This has been discussed numerous times. I suppose no one thinks supporting pickle only is great, but a custom dict is unmaintainable. The best we've got AFAIK (and it looks<https://github.com/jpmml/jpmml-sklearn/graphs/contributors> like it's getting better all the time) is a tool to convert one-way to PMML, which is portable to production environments. See https://github.com/jpmml/sklearn2pmml (python interface) and https://github.com/jpmml/jpmml-sklearn(command-line interface and guts of the thing).

I hope that helps; and thanks to Villu Ruusmann: that list of supported estimators is awesome!

PS: please write to the new list at scikit-learn at python.org<mailto:scikit-learn at python.org>

On 14 July 2016 at 17:24, Miroslav Zoričák <miroslav.zoricak at gmail.com<mailto:miroslav.zoricak at gmail.com>> wrote:
Hi everybody,

I have been using scikit-learn for a while, but I have run into a problem that does not seem to have any good solutions.

Basically I would like to:
- build my pipeline in a Jupyter Notebook
- persist it (to json or hdf5)
- load it in production and execute the prediction there

The problem is that for persisting estimators such as the RobustScaler for example, the recommended way is to pickle them. Now I don't want to do this, for three reasons:

- Security, pickle is potentially dangerous
- Portability, I can't unpickle it in scala for example
- Pickle stores a lot of details and information which is not strictly necessary to reconstruct the RobustScaler and therefore might prevent it from being reconstructed correctly if a different version is used.

Another option I would seem to have is to access the private members of each serialiser that I want to use and store them on my own, but this is inconvenient, because:

- It forces me as a user to understand how the robust scaler works and how it stores its internal state, which is generally bad for usability
- The internal implementation could change, leaving me to fix my serialisers (see #1)
- I would need to do this for each new Estimator I decide to use

Now, to me it seems the solution is quite obvious:
Write a Mixin or update the BaseEstimator class to include two additional methods:

to_dict() - will return a dictionary such, that when passed to
from_dict(dictionary) - it will reconstruct the original object

these dictionaries could be passed to the JSON module or the YAML module or stored elsewhere. We could provide more convenience methods to do this for the user.

In case of the RobustScaler the dict would look something like:
{ "center": "0,0", "scale": "1.0"}

Now the bulk of the work is writing these serialisers and deserialisers for all of the estimators, but that can be simplified by adding a method that could do that automatically via reflection and the estimator would only need to specify which fields to serialise.

I am happy to start working on this and create a pull request on Github, but before I do that I wanted to get some initial thoughts and reactions from the community, so please let me know what you think.

Best Regards,
Miroslav Zoricak
--
Best Regards,
Miroslav Zoricak

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general at lists.sourceforge.net<mailto:Scikit-learn-general at lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160714/f52ede14/attachment-0001.html>


More information about the scikit-learn mailing list