Rationale behind *_gen and *_frozen in _multivariate.py

Hello Scipy, I would like to implement additional distributions (at least locally for now). To do so, I looked at scipy/stats/_multivariate.py and would like to understand the rationale behind the *_gen and *_frozen classes. - Are the frozen-classes used to avoid parameter checks during run time? - Why is i.e. in line 1360 dirichlet = dirichlet_gen() [1]? It seems like an object is created during the import although it appears to the user as if scipy.stats.dirichlet was a module and scipy.stats.dirichlet.pdf() was a function of just that module? I do not want to change the scipy code. I would just like to know, what the benefits are. With best regards Lukas [1] https://github.com/scipy/scipy/blob/ffaebc9e684e5bd23bbd3d5234c27a71369990b7... <https://github.com/scipy/scipy/blob/ffaebc9e684e5bd23bbd3d5234c27a71369990b7...>

On Wed, Jul 27, 2016 at 7:09 AM, Lukas Drude <mail@lukas-drude.de> wrote:
Hello Scipy,
I would like to implement additional distributions (at least locally for now).
To do so, I looked at scipy/stats/_multivariate.py and would like to understand the rationale behind the *_gen and *_frozen classes.
- Are the frozen-classes used to avoid parameter checks during run time? - Why is i.e. in line 1360 dirichlet = dirichlet_gen() [1]? It seems like an object is created during the import although it appears to the user as if scipy.stats.dirichlet was a module and scipy.stats.dirichlet.pdf() was a function of just that module?
I do not want to change the scipy code. I would just like to know, what the benefits are.
With best regards Lukas
[1] https://github.com/scipy/scipy/blob/ffaebc9e684e5bd23bbd3d5234c27a71369990b7...
some history in the following, Evgeni knows better the recent changes The original implementation of the distributions was mostly "functional". Classes are used as namespace and to make implementation easier, but users only used a single global instance of the distribution classes. Because it is only a single global instance it cannot keep state, i.e. store intermediate results and parameters as attributes. This was a headache and source of bugs when state spilled over in the global instance from one use to the next. The better design would have been to have users use the classes to create new instances for each use. `frozen` distributions was the way to create a new instance that stores the parameters of the distribution. The use of `frozen` distributions has been improved and expanded mostly by Evgeni. It is also used more extensively in the multivariate distributions which have only been added in the last few years. The main advantage of frozen distributions is that by having a new instance each time, it is possible to store intermediate results to improve performance. Josef
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev

On Wed, Jul 27, 2016 at 12:36 PM, <josef.pktd@gmail.com> wrote:
On Wed, Jul 27, 2016 at 7:09 AM, Lukas Drude <mail@lukas-drude.de> wrote:
Hello Scipy,
I would like to implement additional distributions (at least locally for now).
To do so, I looked at scipy/stats/_multivariate.py and would like to understand the rationale behind the *_gen and *_frozen classes.
- Are the frozen-classes used to avoid parameter checks during run time? - Why is i.e. in line 1360 dirichlet = dirichlet_gen() [1]? It seems
object is created during the import although it appears to the user as if scipy.stats.dirichlet was a module and scipy.stats.dirichlet.pdf() was a function of just that module?
I do not want to change the scipy code. I would just like to know, what
like an the
benefits are.
With best regards Lukas
[1]
https://github.com/scipy/scipy/blob/ffaebc9e684e5bd23bbd3d5234c27a71369990b7...
some history in the following, Evgeni knows better the recent changes
The original implementation of the distributions was mostly "functional". Classes are used as namespace and to make implementation easier, but users only used a single global instance of the distribution classes.
Because it is only a single global instance it cannot keep state, i.e. store intermediate results and parameters as attributes. This was a headache and source of bugs when state spilled over in the global instance from one use to the next.
That's not quite right, I don't think. Only the multivariate distributions, which are quite new, store intermediate results. No global state was ever stored in the "unfrozen" distribution instances. Storing intermediate results were not a consideration in adding frozen distributions. When the distributions were first designed, Python did not have classmethods. So the API `norm.pdf(x, loc, scale)` would not have been possible if `norm` were a class. You had to make an instance of a class to get callable methods. At first, this was the only API provided. At the time, scipy definitely had a bias against using objects in its API (i.e. forcing users to construct objects, not just using pre-existing instances). Object-orientation was seen as an unnecessary complication for scientific programmers. Things are different now. However, this API was sometimes inconvenient because one would always have to pass around the distribution and the arguments separately, making it hard to write generic code. Frozen distributions were added to bind the parameters to the distribution so that one could just pass around a single object. Now, you can write generic code that just accepts a single frozen distribution object and call `dist.pdf(x)` without the code needing to know anything about which distribution is being used or its parameters. -- Robert Kern

On Wed, Jul 27, 2016 at 8:44 AM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Jul 27, 2016 at 12:36 PM, <josef.pktd@gmail.com> wrote:
On Wed, Jul 27, 2016 at 7:09 AM, Lukas Drude <mail@lukas-drude.de> wrote:
Hello Scipy,
I would like to implement additional distributions (at least locally for now).
To do so, I looked at scipy/stats/_multivariate.py and would like to understand the rationale behind the *_gen and *_frozen classes.
- Are the frozen-classes used to avoid parameter checks during run time? - Why is i.e. in line 1360 dirichlet = dirichlet_gen() [1]? It seems like an object is created during the import although it appears to the user as if scipy.stats.dirichlet was a module and scipy.stats.dirichlet.pdf() was a function of just that module?
I do not want to change the scipy code. I would just like to know, what the benefits are.
With best regards Lukas
[1]
https://github.com/scipy/scipy/blob/ffaebc9e684e5bd23bbd3d5234c27a71369990b7...
some history in the following, Evgeni knows better the recent changes
The original implementation of the distributions was mostly "functional". Classes are used as namespace and to make implementation easier, but users only used a single global instance of the distribution classes.
Because it is only a single global instance it cannot keep state, i.e. store intermediate results and parameters as attributes. This was a headache and source of bugs when state spilled over in the global instance from one use to the next.
That's not quite right, I don't think. Only the multivariate distributions, which are quite new, store intermediate results. No global state was ever stored in the "unfrozen" distribution instances. Storing intermediate results were not a consideration in adding frozen distributions.
It took me a few months to figure out why the distributions sometimes produces different, i.e. wrong, results, and to fix those bugs. Using attributes and state might not have been the plan, but it was and is in the actual implementation. (And it's the source of my allergy to the possibility of stale state in statsmodels.)
When the distributions were first designed, Python did not have classmethods. So the API `norm.pdf(x, loc, scale)` would not have been possible if `norm` were a class. You had to make an instance of a class to get callable methods. At first, this was the only API provided. At the time, scipy definitely had a bias against using objects in its API (i.e. forcing users to construct objects, not just using pre-existing instances). Object-orientation was seen as an unnecessary complication for scientific programmers. Things are different now.
However, this API was sometimes inconvenient because one would always have to pass around the distribution and the arguments separately, making it hard to write generic code. Frozen distributions were added to bind the parameters to the distribution so that one could just pass around a single object. Now, you can write generic code that just accepts a single frozen distribution object and call `dist.pdf(x)` without the code needing to know anything about which distribution is being used or its parameters.
This clarifies but doesn't contradict my comments. "frozen" was then the object-oriented backdoor for users that didn't want to know about objects. Times have fortunately changed away from the matlab/fortran tradition, except for Julia where developers and users still prefer greek and one letter names and no classes. :) Norm(loc, scale).pdf(x) is much more work than norm.pdf(x, loc, scale) Josef
-- Robert Kern
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev

On Wed, Jul 27, 2016 at 2:04 PM, <josef.pktd@gmail.com> wrote:
On Wed, Jul 27, 2016 at 8:44 AM, Robert Kern <robert.kern@gmail.com>
wrote:
On Wed, Jul 27, 2016 at 12:36 PM, <josef.pktd@gmail.com> wrote:
some history in the following, Evgeni knows better the recent changes
The original implementation of the distributions was mostly "functional". Classes are used as namespace and to make implementation easier, but users only used a single global instance of the distribution classes.
Because it is only a single global instance it cannot keep state, i.e. store intermediate results and parameters as attributes. This was a headache and source of bugs when state spilled over in the global instance from one use to the next.
That's not quite right, I don't think. Only the multivariate distributions, which are quite new, store intermediate results. No global state was ever stored in the "unfrozen" distribution instances. Storing intermediate results were not a consideration in adding frozen distributions.
It took me a few months to figure out why the distributions sometimes produces different, i.e. wrong, results, and to fix those bugs. Using attributes and state might not have been the plan, but it was and is in the actual implementation.
(And it's the source of my allergy to the possibility of stale state in statsmodels.)
I'm pretty sure that we had frozen distributions before you encountered those bugs. We didn't add frozen distributions to get rid of those bugs. We added them for the API reasons I described. https://mail.scipy.org/pipermail/scipy-user/2003-October/002278.html -- Robert Kern

On Wed, Jul 27, 2016 at 9:26 AM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Jul 27, 2016 at 2:04 PM, <josef.pktd@gmail.com> wrote:
On Wed, Jul 27, 2016 at 8:44 AM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Jul 27, 2016 at 12:36 PM, <josef.pktd@gmail.com> wrote:
some history in the following, Evgeni knows better the recent changes
The original implementation of the distributions was mostly "functional". Classes are used as namespace and to make implementation easier, but users only used a single global instance of the distribution classes.
Because it is only a single global instance it cannot keep state, i.e. store intermediate results and parameters as attributes. This was a headache and source of bugs when state spilled over in the global instance from one use to the next.
That's not quite right, I don't think. Only the multivariate distributions, which are quite new, store intermediate results. No global state was ever stored in the "unfrozen" distribution instances. Storing intermediate results were not a consideration in adding frozen distributions.
It took me a few months to figure out why the distributions sometimes produces different, i.e. wrong, results, and to fix those bugs. Using attributes and state might not have been the plan, but it was and is in the actual implementation.
(And it's the source of my allergy to the possibility of stale state in statsmodels.)
I'm pretty sure that we had frozen distributions before you encountered those bugs. We didn't add frozen distributions to get rid of those bugs. We added them for the API reasons I described.
Frozen distributions were there when I started and there were no specific problems with them, AFAIR. The API reasons that you described required that a distribution stores the parameters as attributes. So, it required to have new instances/objects for each new set of parameters. That sounds like adding object orientation as API convenience, while I would have preferred to additionally drop the global instances to make the implementation simpler and less error-prone. Josef
https://mail.scipy.org/pipermail/scipy-user/2003-October/002278.html
-- Robert Kern
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev

On Wed, Jul 27, 2016 at 9:43 AM, <josef.pktd@gmail.com> wrote:
On Wed, Jul 27, 2016 at 9:26 AM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Jul 27, 2016 at 2:04 PM, <josef.pktd@gmail.com> wrote:
On Wed, Jul 27, 2016 at 8:44 AM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Jul 27, 2016 at 12:36 PM, <josef.pktd@gmail.com> wrote:
some history in the following, Evgeni knows better the recent changes
The original implementation of the distributions was mostly "functional". Classes are used as namespace and to make implementation easier, but users only used a single global instance of the distribution classes.
Because it is only a single global instance it cannot keep state, i.e. store intermediate results and parameters as attributes. This was a headache and source of bugs when state spilled over in the global instance from one use to the next.
That's not quite right, I don't think. Only the multivariate distributions, which are quite new, store intermediate results. No global state was ever stored in the "unfrozen" distribution instances. Storing intermediate results were not a consideration in adding frozen distributions.
It took me a few months to figure out why the distributions sometimes produces different, i.e. wrong, results, and to fix those bugs. Using attributes and state might not have been the plan, but it was and is in the actual implementation.
(And it's the source of my allergy to the possibility of stale state in statsmodels.)
I'm pretty sure that we had frozen distributions before you encountered those bugs. We didn't add frozen distributions to get rid of those bugs. We added them for the API reasons I described.
Frozen distributions were there when I started and there were no specific problems with them, AFAIR.
The API reasons that you described required that a distribution stores the parameters as attributes. So, it required to have new instances/objects for each new set of parameters.
partial correction as part of my memory comes back IIRC, frozen distributions still used the global instance like standalone functions. That wasn't fully object oriented, frozen was just a wrapper class. Evgeni changed it a while ago that freeze creates new instances each time, which allows more flexibility in holding state besides the parameters. Josef
That sounds like adding object orientation as API convenience, while I would have preferred to additionally drop the global instances to make the implementation simpler and less error-prone.
Josef
https://mail.scipy.org/pipermail/scipy-user/2003-October/002278.html
-- Robert Kern
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev

On Wed, Jul 27, 2016 at 1:44 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Jul 27, 2016 at 12:36 PM, <josef.pktd@gmail.com> wrote:
On Wed, Jul 27, 2016 at 7:09 AM, Lukas Drude <mail@lukas-drude.de> wrote:
Hello Scipy,
I would like to implement additional distributions (at least locally for now).
To do so, I looked at scipy/stats/_multivariate.py and would like to understand the rationale behind the *_gen and *_frozen classes.
- Are the frozen-classes used to avoid parameter checks during run time? - Why is i.e. in line 1360 dirichlet = dirichlet_gen() [1]? It seems like an object is created during the import although it appears to the user as if scipy.stats.dirichlet was a module and scipy.stats.dirichlet.pdf() was a function of just that module?
I do not want to change the scipy code. I would just like to know, what the benefits are.
With best regards Lukas
[1]
https://github.com/scipy/scipy/blob/ffaebc9e684e5bd23bbd3d5234c27a71369990b7...
some history in the following, Evgeni knows better the recent changes
The original implementation of the distributions was mostly "functional". Classes are used as namespace and to make implementation easier, but users only used a single global instance of the distribution classes.
Because it is only a single global instance it cannot keep state, i.e. store intermediate results and parameters as attributes. This was a headache and source of bugs when state spilled over in the global instance from one use to the next.
That's not quite right, I don't think. Only the multivariate distributions, which are quite new, store intermediate results. No global state was ever stored in the "unfrozen" distribution instances. Storing intermediate results were not a consideration in adding frozen distributions.
When the distributions were first designed, Python did not have classmethods. So the API `norm.pdf(x, loc, scale)` would not have been possible if `norm` were a class. You had to make an instance of a class to get callable methods. At first, this was the only API provided. At the time, scipy definitely had a bias against using objects in its API (i.e. forcing users to construct objects, not just using pre-existing instances). Object-orientation was seen as an unnecessary complication for scientific programmers. Things are different now.
However, this API was sometimes inconvenient because one would always have to pass around the distribution and the arguments separately, making it hard to write generic code. Frozen distributions were added to bind the parameters to the distribution so that one could just pass around a single object. Now, you can write generic code that just accepts a single frozen distribution object and call `dist.pdf(x)` without the code needing to know anything about which distribution is being used or its parameters.
-- Robert Kern
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev
Re: storing intermediate results. There was some discussion a while ago, https://github.com/scipy/scipy/issues/2823 At least my conclusion was that it's not worth it, not generically at least: https://github.com/scipy/scipy/issues/2823#issuecomment-23806104 It's perfectly possible one can do better and there's a way :-). At the moment, a frozen distribution holds an instance separate from the global one. (https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure...) E.g. In [32]: from scipy.stats import gamma In [33]: gamma.shapes Out[33]: 'a' In [34]: rv = gamma(a=1) In [35]: rv.dist Out[35]: <scipy.stats._continuous_distns.gamma_gen at 0x7f9706647710> In [36]: gamma Out[36]: <scipy.stats._continuous_distns.gamma_gen at 0x7f9706d11e50> In [37]: rv.dist is gamma Out[37]: False So that one can * use a separate random_state for drawing variates: n [38]: gamma.random_state Out[38]: <mtrand.RandomState at 0x7f970d834e10> In [39]: rv.random_state Out[39]: <mtrand.RandomState at 0x7f970d834e10> # same! In [40]: rv.random_state = 123 In [41]: rv.random_state Out[41]: <mtrand.RandomState at 0x7f97141084d0> # different In [42]: gamma.random_state Out[42]: <mtrand.RandomState at 0x7f970d834e10> # intact * monkey-patch the instance methods to store intermediates if desired.
participants (4)
-
Evgeni Burovski
-
josef.pktd@gmail.com
-
Lukas Drude
-
Robert Kern