Tweedie distributions in scipy.stats

Dear Scipy Developers and mailing list Readers I'd like to address the issue [1] to implement Tweedie distributions [2] in scipy.stats. Purpose The family of Tweedie distributions contains many known distributions like the Poisson and the Gamma distribution, but also distributions between them, aka compound poisson gamma distribution, see [3]. These are often appropriate for insurance claims and other fields, where one has a (Poisson) random count process of events and every event has a (Gamma) random size/amount. The distribution would enable simulations, maximum likelihood estimation of all parameters, choice and visualization of distributions, etc. Implementation I started PR [4] for Wrights generalized Bessel functions as a private function in scipy.special. Once this is ready, the pdf follows immediately. For the range of interest of Y ~ compound poisson gamma distribution, the distribution of Y has a point mass at zero and is otherwise continuous for Y>0. As already discussed in the issue [1], Tweedie might best fit as |rv_generic|. As such, it would be the first one, all others are either |rv_discrete| or |rv_continuous|. Without a template, I would need guidance how to implement a new rv_generic. References: [1] https://github.com/scipy/scipy/issues/11291 [2] https://en.wikipedia.org/wiki/Tweedie_distribution [3] https://en.wikipedia.org/wiki/Compound_Poisson_distribution [4] https://github.com/scipy/scipy/pull/11313 I'm looking forward to your feeback, thoughts and insights. Kind regards, Christian

On Wed, Mar 11, 2020 at 3:35 PM Christian Lorentzen <lorentzen.ch@gmail.com> wrote:
Dear Scipy Developers and mailing list Readers
I'd like to address the issue [1] to implement Tweedie distributions [2] in scipy.stats.
Purpose The family of Tweedie distributions contains many known distributions like the Poisson and the Gamma distribution, but also distributions between them, aka compound poisson gamma distribution, see [3]. These are often appropriate for insurance claims and other fields, where one has a (Poisson) random count process of events and every event has a (Gamma) random size/amount. The distribution would enable simulations, maximum likelihood estimation of all parameters, choice and visualization of distributions, etc.
Implementation I started PR [4] for Wrights generalized Bessel functions as a private function in scipy.special. Once this is ready, the pdf follows immediately. For the range of interest of Y ~ compound poisson gamma distribution, the distribution of Y has a point mass at zero and is otherwise continuous for Y>0. As already discussed in the issue [1], Tweedie might best fit as rv_generic. As such, it would be the first one, all others are either rv_discrete or rv_continuous. Without a template, I would need guidance how to implement a new rv_generic.
FWIW, `rv_generic` isn't really intended to be a concrete class. It was only intended to be a base class implementing the common parts needed by `rv_continuous` and `rv_discrete`. Nothing "fits into" `rv_generic`, per se. The Tweedie distributions, for some parameters at least, may not fit into the `scipy.stats` infrastructure at all. We have no infrastructure for continuous-with-point-mass distributions. `rv_generic` is still built under the assumption that it's going to be implementing *either* a continuous *or* a discrete distribution. I recommend implementing the functionality that you need outside of scipy following whatever API solves your problems best. Then we can evaluate if there is infrastructure that can be built that would help the second continuous-with-point-mass distribution that we might want next. -- Robert Kern

On Wed, Mar 11, 2020 at 5:02 PM Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Mar 11, 2020 at 3:35 PM Christian Lorentzen < lorentzen.ch@gmail.com> wrote:
Dear Scipy Developers and mailing list Readers
I'd like to address the issue [1] to implement Tweedie distributions [2] in scipy.stats.
Purpose The family of Tweedie distributions contains many known distributions like the Poisson and the Gamma distribution, but also distributions between them, aka compound poisson gamma distribution, see [3]. These are often appropriate for insurance claims and other fields, where one has a (Poisson) random count process of events and every event has a (Gamma) random size/amount. The distribution would enable simulations, maximum likelihood estimation of all parameters, choice and visualization of distributions, etc.
Implementation I started PR [4] for Wrights generalized Bessel functions as a private function in scipy.special. Once this is ready, the pdf follows immediately. For the range of interest of Y ~ compound poisson gamma distribution, the distribution of Y has a point mass at zero and is otherwise continuous for Y>0. As already discussed in the issue [1], Tweedie might best fit as rv_generic. As such, it would be the first one, all others are either rv_discrete or rv_continuous. Without a template, I would need guidance how to implement a new rv_generic.
FWIW, `rv_generic` isn't really intended to be a concrete class. It was only intended to be a base class implementing the common parts needed by `rv_continuous` and `rv_discrete`. Nothing "fits into" `rv_generic`, per se. The Tweedie distributions, for some parameters at least, may not fit into the `scipy.stats` infrastructure at all. We have no infrastructure for continuous-with-point-mass distributions. `rv_generic` is still built under the assumption that it's going to be implementing *either* a continuous *or* a discrete distribution.
I recommend implementing the functionality that you need outside of scipy following whatever API solves your problems best. Then we can evaluate if there is infrastructure that can be built that would help the second continuous-with-point-mass distribution that we might want next.
a long long time ago, I started a ParametricMixture model for this https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/d... (until I gave up on distributions) An alternative as temporary solution would be to add some methods/functions like logpdf to scipy.stats, so statsmodels and sklearn can reuse those. Josef
-- Robert Kern _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Thank you for your feedback. If it were possible to mix distributions together, e.g. rv1 + rv2, the compound poisson could be represented. AFAIK, PyMC3 supports that with "Mixture" distributions [1]. So I see three options: 1. Implement only a log likelihood function as Josef suggests. In which package to put it? Scipy, Statsmodels, PyMC3? 2. Ask PyMC3 developers for this distribution. 3. Ask Statsmodel developers for this distribution. @Josef: I hereby ask you;-) The tricky part: As I intend to calculate the log likelihood via Wrights generalized Bessel function and PR [2] implements this as a private special function, can this function stay in scipy or should it move to the other packages in that case. [1] https://docs.pymc.io/api/distributions/mixture.html [2] https://github.com/scipy/scipy/pull/11313 Kind regards Christian On 11.03.20 22:41, josef.pktd@gmail.com wrote:
On Wed, Mar 11, 2020 at 5:02 PM Robert Kern <robert.kern@gmail.com <mailto:robert.kern@gmail.com>> wrote:
On Wed, Mar 11, 2020 at 3:35 PM Christian Lorentzen <lorentzen.ch@gmail.com <mailto:lorentzen.ch@gmail.com>> wrote:
Dear Scipy Developers and mailing list Readers
I'd like to address the issue [1] to implement Tweedie distributions [2] in scipy.stats.
Purpose The family of Tweedie distributions contains many known distributions like the Poisson and the Gamma distribution, but also distributions between them, aka compound poisson gamma distribution, see [3]. These are often appropriate for insurance claims and other fields, where one has a (Poisson) random count process of events and every event has a (Gamma) random size/amount. The distribution would enable simulations, maximum likelihood estimation of all parameters, choice and visualization of distributions, etc.
Implementation I started PR [4] for Wrights generalized Bessel functions as a private function in scipy.special. Once this is ready, the pdf follows immediately. For the range of interest of Y ~ compound poisson gamma distribution, the distribution of Y has a point mass at zero and is otherwise continuous for Y>0. As already discussed in the issue [1], Tweedie might best fit as |rv_generic|. As such, it would be the first one, all others are either |rv_discrete| or |rv_continuous|. Without a template, I would need guidance how to implement a new rv_generic.
FWIW, `rv_generic` isn't really intended to be a concrete class. It was only intended to be a base class implementing the common parts needed by `rv_continuous` and `rv_discrete`. Nothing "fits into" `rv_generic`, per se. The Tweedie distributions, for some parameters at least, may not fit into the `scipy.stats` infrastructure at all. We have no infrastructure for continuous-with-point-mass distributions. `rv_generic` is still built under the assumption that it's going to be implementing /either/ a continuous /or/ a discrete distribution.
I recommend implementing the functionality that you need outside of scipy following whatever API solves your problems best. Then we can evaluate if there is infrastructure that can be built that would help the second continuous-with-point-mass distribution that we might want next.
a long long time ago, I started a ParametricMixture model for this https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/d...
(until I gave up on distributions)
An alternative as temporary solution would be to add some methods/functions like logpdf to scipy.stats, so statsmodels and sklearn can reuse those.
Josef
-- Robert Kern _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org <mailto:SciPy-Dev@python.org> https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

On Fri, Mar 13, 2020 at 11:16 AM Christian Lorentzen <lorentzen.ch@gmail.com> wrote:
Thank you for your feedback.
If it were possible to mix distributions together, e.g. rv1 + rv2, the compound poisson could be represented. AFAIK, PyMC3 supports that with "Mixture" distributions [1]. So I see three options:
1. Implement only a log likelihood function as Josef suggests. In which package to put it? Scipy, Statsmodels, PyMC3? 2. Ask PyMC3 developers for this distribution. 3. Ask Statsmodel developers for this distribution. @Josef: I hereby ask you;-)
The tricky part: As I intend to calculate the log likelihood via Wrights generalized Bessel function and PR [2] implements this as a private special function, can this function stay in scipy or should it move to the other packages in that case.
[1] https://docs.pymc.io/api/distributions/mixture.html [2] https://github.com/scipy/scipy/pull/11313
IMO, scipy would be a good central location for logpdf and associated wright. All packages like sklearn and statsmodels depend on scipy but not on each other. If wright in special works out, then adding just logpdf would be a good short term solution. Adding a new base distribution class e.g. for distribution mixtures will be a lot more work and won't happen fast. Because statsmodels has currently only an approximate tweedie logpdf, we would add any improved version if it doesn't go into scipy. We can add it also to compat.scipy until it is in all scipy versions that we support. scipy.stats has now multivariate distributions that don't fit into the old univariate distribution setup. Similarly, I think it would be possible to add other distributions that don't fit into the existing class hierarchy. That would be an intermediate step without adding full support for generic mixture distributions, or discrete-continuous combinations. Josef Josef
Kind regards Christian On 11.03.20 22:41, josef.pktd@gmail.com wrote:
On Wed, Mar 11, 2020 at 5:02 PM Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Mar 11, 2020 at 3:35 PM Christian Lorentzen < lorentzen.ch@gmail.com> wrote:
Dear Scipy Developers and mailing list Readers
I'd like to address the issue [1] to implement Tweedie distributions [2] in scipy.stats.
Purpose The family of Tweedie distributions contains many known distributions like the Poisson and the Gamma distribution, but also distributions between them, aka compound poisson gamma distribution, see [3]. These are often appropriate for insurance claims and other fields, where one has a (Poisson) random count process of events and every event has a (Gamma) random size/amount. The distribution would enable simulations, maximum likelihood estimation of all parameters, choice and visualization of distributions, etc.
Implementation I started PR [4] for Wrights generalized Bessel functions as a private function in scipy.special. Once this is ready, the pdf follows immediately. For the range of interest of Y ~ compound poisson gamma distribution, the distribution of Y has a point mass at zero and is otherwise continuous for Y>0. As already discussed in the issue [1], Tweedie might best fit as rv_generic. As such, it would be the first one, all others are either rv_discrete or rv_continuous. Without a template, I would need guidance how to implement a new rv_generic.
FWIW, `rv_generic` isn't really intended to be a concrete class. It was only intended to be a base class implementing the common parts needed by `rv_continuous` and `rv_discrete`. Nothing "fits into" `rv_generic`, per se. The Tweedie distributions, for some parameters at least, may not fit into the `scipy.stats` infrastructure at all. We have no infrastructure for continuous-with-point-mass distributions. `rv_generic` is still built under the assumption that it's going to be implementing *either* a continuous *or* a discrete distribution.
I recommend implementing the functionality that you need outside of scipy following whatever API solves your problems best. Then we can evaluate if there is infrastructure that can be built that would help the second continuous-with-point-mass distribution that we might want next.
a long long time ago, I started a ParametricMixture model for this
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/d...
(until I gave up on distributions)
An alternative as temporary solution would be to add some methods/functions like logpdf to scipy.stats, so statsmodels and sklearn can reuse those.
Josef
-- Robert Kern _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing listSciPy-Dev@python.orghttps://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Aside: compound Poisson is a convolution of distributions and not a finite mixture. Allowing an infinite mixing distribution like Poisson creates numerical problems in the upper tail that are not easy to solve in general. In most cases, computation have to be truncated at the upper tail, but then the problem is to figure out the truncation threshold for a required precision. My guess is that this would be a lot of work to get it to scipy standards. I was looking at the general case for convolution and compound poisson a long time ago, mainly using fft to get the pdf and cdf from the characteristic function,, the cf is relatively simple to compute for convolutions. The references in extreme value and risk applications that I looked at, was emphasizing tail precision and ways how to work around it, or comparing different methods in how precise they are. fft was fast, but I only eyeballed the truncation threshold for my examples. Josef On Fri, Mar 13, 2020 at 11:46 AM <josef.pktd@gmail.com> wrote:
On Fri, Mar 13, 2020 at 11:16 AM Christian Lorentzen < lorentzen.ch@gmail.com> wrote:
Thank you for your feedback.
If it were possible to mix distributions together, e.g. rv1 + rv2, the compound poisson could be represented. AFAIK, PyMC3 supports that with "Mixture" distributions [1]. So I see three options:
1. Implement only a log likelihood function as Josef suggests. In which package to put it? Scipy, Statsmodels, PyMC3? 2. Ask PyMC3 developers for this distribution. 3. Ask Statsmodel developers for this distribution. @Josef: I hereby ask you;-)
The tricky part: As I intend to calculate the log likelihood via Wrights generalized Bessel function and PR [2] implements this as a private special function, can this function stay in scipy or should it move to the other packages in that case.
[1] https://docs.pymc.io/api/distributions/mixture.html [2] https://github.com/scipy/scipy/pull/11313
IMO, scipy would be a good central location for logpdf and associated wright. All packages like sklearn and statsmodels depend on scipy but not on each other.
If wright in special works out, then adding just logpdf would be a good short term solution. Adding a new base distribution class e.g. for distribution mixtures will be a lot more work and won't happen fast.
Because statsmodels has currently only an approximate tweedie logpdf, we would add any improved version if it doesn't go into scipy. We can add it also to compat.scipy until it is in all scipy versions that we support.
scipy.stats has now multivariate distributions that don't fit into the old univariate distribution setup. Similarly, I think it would be possible to add other distributions that don't fit into the existing class hierarchy. That would be an intermediate step without adding full support for generic mixture distributions, or discrete-continuous combinations.
Josef
Josef
Kind regards Christian On 11.03.20 22:41, josef.pktd@gmail.com wrote:
On Wed, Mar 11, 2020 at 5:02 PM Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Mar 11, 2020 at 3:35 PM Christian Lorentzen < lorentzen.ch@gmail.com> wrote:
Dear Scipy Developers and mailing list Readers
I'd like to address the issue [1] to implement Tweedie distributions [2] in scipy.stats.
Purpose The family of Tweedie distributions contains many known distributions like the Poisson and the Gamma distribution, but also distributions between them, aka compound poisson gamma distribution, see [3]. These are often appropriate for insurance claims and other fields, where one has a (Poisson) random count process of events and every event has a (Gamma) random size/amount. The distribution would enable simulations, maximum likelihood estimation of all parameters, choice and visualization of distributions, etc.
Implementation I started PR [4] for Wrights generalized Bessel functions as a private function in scipy.special. Once this is ready, the pdf follows immediately. For the range of interest of Y ~ compound poisson gamma distribution, the distribution of Y has a point mass at zero and is otherwise continuous for Y>0. As already discussed in the issue [1], Tweedie might best fit as rv_generic. As such, it would be the first one, all others are either rv_discrete or rv_continuous. Without a template, I would need guidance how to implement a new rv_generic.
FWIW, `rv_generic` isn't really intended to be a concrete class. It was only intended to be a base class implementing the common parts needed by `rv_continuous` and `rv_discrete`. Nothing "fits into" `rv_generic`, per se. The Tweedie distributions, for some parameters at least, may not fit into the `scipy.stats` infrastructure at all. We have no infrastructure for continuous-with-point-mass distributions. `rv_generic` is still built under the assumption that it's going to be implementing *either* a continuous *or* a discrete distribution.
I recommend implementing the functionality that you need outside of scipy following whatever API solves your problems best. Then we can evaluate if there is infrastructure that can be built that would help the second continuous-with-point-mass distribution that we might want next.
a long long time ago, I started a ParametricMixture model for this
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/d...
(until I gave up on distributions)
An alternative as temporary solution would be to add some methods/functions like logpdf to scipy.stats, so statsmodels and sklearn can reuse those.
Josef
-- Robert Kern _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing listSciPy-Dev@python.orghttps://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

On Fri, Mar 13, 2020 at 12:04 PM <josef.pktd@gmail.com> wrote:
Aside: compound Poisson is a convolution of distributions and not a finite mixture. Allowing an infinite mixing distribution like Poisson creates numerical problems in the upper tail that are not easy to solve in general. In most cases, computation have to be truncated at the upper tail, but then the problem is to figure out the truncation threshold for a required precision. My guess is that this would be a lot of work to get it to scipy standards.
I was looking at the general case for convolution and compound poisson a long time ago, mainly using fft to get the pdf and cdf from the characteristic function,, the cf is relatively simple to compute for convolutions. The references in extreme value and risk applications that I looked at, was emphasizing tail precision and ways how to work around it, or comparing different methods in how precise they are. fft was fast, but I only eyeballed the truncation threshold for my examples.
I thought of representing tweedie for the computation as a mixture between a mass point/discrete distribution and a distribution for the continuous part, so we can handle the two parts separately. Following this, it might be possible to add a zero-truncated tweedie distribution as a continuous distribution subclass in scipy. Then we could just add a simple mixture of the mass point at zero and the zero-truncated tweedie. The same idea is used in hurdle models for count data, which have a mass point at zero and a zero-truncated Poisson or other count distributions for points > 0.
Josef
On Fri, Mar 13, 2020 at 11:46 AM <josef.pktd@gmail.com> wrote:
On Fri, Mar 13, 2020 at 11:16 AM Christian Lorentzen < lorentzen.ch@gmail.com> wrote:
Thank you for your feedback.
If it were possible to mix distributions together, e.g. rv1 + rv2, the compound poisson could be represented. AFAIK, PyMC3 supports that with "Mixture" distributions [1]. So I see three options:
1. Implement only a log likelihood function as Josef suggests. In which package to put it? Scipy, Statsmodels, PyMC3? 2. Ask PyMC3 developers for this distribution. 3. Ask Statsmodel developers for this distribution. @Josef: I hereby ask you;-)
The tricky part: As I intend to calculate the log likelihood via Wrights generalized Bessel function and PR [2] implements this as a private special function, can this function stay in scipy or should it move to the other packages in that case.
[1] https://docs.pymc.io/api/distributions/mixture.html [2] https://github.com/scipy/scipy/pull/11313
IMO, scipy would be a good central location for logpdf and associated wright. All packages like sklearn and statsmodels depend on scipy but not on each other.
If wright in special works out, then adding just logpdf would be a good short term solution. Adding a new base distribution class e.g. for distribution mixtures will be a lot more work and won't happen fast.
Because statsmodels has currently only an approximate tweedie logpdf, we would add any improved version if it doesn't go into scipy. We can add it also to compat.scipy until it is in all scipy versions that we support.
scipy.stats has now multivariate distributions that don't fit into the old univariate distribution setup. Similarly, I think it would be possible to add other distributions that don't fit into the existing class hierarchy. That would be an intermediate step without adding full support for generic mixture distributions, or discrete-continuous combinations.
Josef
Josef
Kind regards Christian On 11.03.20 22:41, josef.pktd@gmail.com wrote:
On Wed, Mar 11, 2020 at 5:02 PM Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Mar 11, 2020 at 3:35 PM Christian Lorentzen < lorentzen.ch@gmail.com> wrote:
Dear Scipy Developers and mailing list Readers
I'd like to address the issue [1] to implement Tweedie distributions [2] in scipy.stats.
Purpose The family of Tweedie distributions contains many known distributions like the Poisson and the Gamma distribution, but also distributions between them, aka compound poisson gamma distribution, see [3]. These are often appropriate for insurance claims and other fields, where one has a (Poisson) random count process of events and every event has a (Gamma) random size/amount. The distribution would enable simulations, maximum likelihood estimation of all parameters, choice and visualization of distributions, etc.
Implementation I started PR [4] for Wrights generalized Bessel functions as a private function in scipy.special. Once this is ready, the pdf follows immediately. For the range of interest of Y ~ compound poisson gamma distribution, the distribution of Y has a point mass at zero and is otherwise continuous for Y>0. As already discussed in the issue [1], Tweedie might best fit as rv_generic. As such, it would be the first one, all others are either rv_discrete or rv_continuous. Without a template, I would need guidance how to implement a new rv_generic.
FWIW, `rv_generic` isn't really intended to be a concrete class. It was only intended to be a base class implementing the common parts needed by `rv_continuous` and `rv_discrete`. Nothing "fits into" `rv_generic`, per se. The Tweedie distributions, for some parameters at least, may not fit into the `scipy.stats` infrastructure at all. We have no infrastructure for continuous-with-point-mass distributions. `rv_generic` is still built under the assumption that it's going to be implementing *either* a continuous *or* a discrete distribution.
I recommend implementing the functionality that you need outside of scipy following whatever API solves your problems best. Then we can evaluate if there is infrastructure that can be built that would help the second continuous-with-point-mass distribution that we might want next.
a long long time ago, I started a ParametricMixture model for this
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/d...
(until I gave up on distributions)
An alternative as temporary solution would be to add some methods/functions like logpdf to scipy.stats, so statsmodels and sklearn can reuse those.
Josef
-- Robert Kern _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing listSciPy-Dev@python.orghttps://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

On Fri, Mar 13, 2020 at 12:14 PM <josef.pktd@gmail.com> wrote:
On Fri, Mar 13, 2020 at 12:04 PM <josef.pktd@gmail.com> wrote:
Aside: compound Poisson is a convolution of distributions and not a finite mixture. Allowing an infinite mixing distribution like Poisson creates numerical problems in the upper tail that are not easy to solve in general. In most cases, computation have to be truncated at the upper tail, but then the problem is to figure out the truncation threshold for a required precision. My guess is that this would be a lot of work to get it to scipy standards.
I was looking at the general case for convolution and compound poisson a long time ago, mainly using fft to get the pdf and cdf from the characteristic function,, the cf is relatively simple to compute for convolutions. The references in extreme value and risk applications that I looked at, was emphasizing tail precision and ways how to work around it, or comparing different methods in how precise they are. fft was fast, but I only eyeballed the truncation threshold for my examples.
I thought of representing tweedie for the computation as a mixture between a mass point/discrete distribution and a distribution for the continuous part, so we can handle the two parts separately.
Following this, it might be possible to add a zero-truncated tweedie distribution as a continuous distribution subclass in scipy. Then we could just add a simple mixture of the mass point at zero and the zero-truncated tweedie. The same idea is used in hurdle models for count data, which have a mass point at zero and a zero-truncated Poisson or other count distributions for points > 0.
I haven't looked at Tweedie in a few years. AFAIK/AFAIR statsmodels only supports power parameter p in the open interval (1, 2), where it has the nice Poisson-Gamma distribution.
Josef
On Fri, Mar 13, 2020 at 11:46 AM <josef.pktd@gmail.com> wrote:
On Fri, Mar 13, 2020 at 11:16 AM Christian Lorentzen < lorentzen.ch@gmail.com> wrote:
Thank you for your feedback.
If it were possible to mix distributions together, e.g. rv1 + rv2, the compound poisson could be represented. AFAIK, PyMC3 supports that with "Mixture" distributions [1]. So I see three options:
1. Implement only a log likelihood function as Josef suggests. In which package to put it? Scipy, Statsmodels, PyMC3? 2. Ask PyMC3 developers for this distribution. 3. Ask Statsmodel developers for this distribution. @Josef: I hereby ask you;-)
The tricky part: As I intend to calculate the log likelihood via Wrights generalized Bessel function and PR [2] implements this as a private special function, can this function stay in scipy or should it move to the other packages in that case.
[1] https://docs.pymc.io/api/distributions/mixture.html [2] https://github.com/scipy/scipy/pull/11313
IMO, scipy would be a good central location for logpdf and associated wright. All packages like sklearn and statsmodels depend on scipy but not on each other.
If wright in special works out, then adding just logpdf would be a good short term solution. Adding a new base distribution class e.g. for distribution mixtures will be a lot more work and won't happen fast.
Because statsmodels has currently only an approximate tweedie logpdf, we would add any improved version if it doesn't go into scipy. We can add it also to compat.scipy until it is in all scipy versions that we support.
scipy.stats has now multivariate distributions that don't fit into the old univariate distribution setup. Similarly, I think it would be possible to add other distributions that don't fit into the existing class hierarchy. That would be an intermediate step without adding full support for generic mixture distributions, or discrete-continuous combinations.
Josef
Josef
Kind regards Christian On 11.03.20 22:41, josef.pktd@gmail.com wrote:
On Wed, Mar 11, 2020 at 5:02 PM Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Mar 11, 2020 at 3:35 PM Christian Lorentzen < lorentzen.ch@gmail.com> wrote:
Dear Scipy Developers and mailing list Readers
I'd like to address the issue [1] to implement Tweedie distributions [2] in scipy.stats.
Purpose The family of Tweedie distributions contains many known distributions like the Poisson and the Gamma distribution, but also distributions between them, aka compound poisson gamma distribution, see [3]. These are often appropriate for insurance claims and other fields, where one has a (Poisson) random count process of events and every event has a (Gamma) random size/amount. The distribution would enable simulations, maximum likelihood estimation of all parameters, choice and visualization of distributions, etc.
Implementation I started PR [4] for Wrights generalized Bessel functions as a private function in scipy.special. Once this is ready, the pdf follows immediately. For the range of interest of Y ~ compound poisson gamma distribution, the distribution of Y has a point mass at zero and is otherwise continuous for Y>0. As already discussed in the issue [1], Tweedie might best fit as rv_generic. As such, it would be the first one, all others are either rv_discrete or rv_continuous. Without a template, I would need guidance how to implement a new rv_generic.
FWIW, `rv_generic` isn't really intended to be a concrete class. It was only intended to be a base class implementing the common parts needed by `rv_continuous` and `rv_discrete`. Nothing "fits into" `rv_generic`, per se. The Tweedie distributions, for some parameters at least, may not fit into the `scipy.stats` infrastructure at all. We have no infrastructure for continuous-with-point-mass distributions. `rv_generic` is still built under the assumption that it's going to be implementing *either* a continuous *or* a discrete distribution.
I recommend implementing the functionality that you need outside of scipy following whatever API solves your problems best. Then we can evaluate if there is infrastructure that can be built that would help the second continuous-with-point-mass distribution that we might want next.
a long long time ago, I started a ParametricMixture model for this
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/d...
(until I gave up on distributions)
An alternative as temporary solution would be to add some methods/functions like logpdf to scipy.stats, so statsmodels and sklearn can reuse those.
Josef
-- Robert Kern _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing listSciPy-Dev@python.orghttps://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

On Fri, Mar 13, 2020 at 12:15 PM <josef.pktd@gmail.com> wrote:
On Fri, Mar 13, 2020 at 12:04 PM <josef.pktd@gmail.com> wrote:
Aside: compound Poisson is a convolution of distributions and not a finite mixture. Allowing an infinite mixing distribution like Poisson creates numerical problems in the upper tail that are not easy to solve in general. In most cases, computation have to be truncated at the upper tail, but then the problem is to figure out the truncation threshold for a required precision. My guess is that this would be a lot of work to get it to scipy standards.
I was looking at the general case for convolution and compound poisson a long time ago, mainly using fft to get the pdf and cdf from the characteristic function,, the cf is relatively simple to compute for convolutions. The references in extreme value and risk applications that I looked at, was emphasizing tail precision and ways how to work around it, or comparing different methods in how precise they are. fft was fast, but I only eyeballed the truncation threshold for my examples.
I thought of representing tweedie for the computation as a mixture between a mass point/discrete distribution and a distribution for the continuous part, so we can handle the two parts separately.
Following this, it might be possible to add a zero-truncated tweedie distribution as a continuous distribution subclass in scipy. Then we could just add a simple mixture of the mass point at zero and the zero-truncated tweedie.
That could certainly work. It seems like handling that smoothly may be a pain for the user; you'd have to coordinate the effect of the parameters on both the size of the point mass and the continuous part, as well as the mixture. My recommendation is to implement this in its own package, using whatever frameworks you find help you solve your data analysis problems. Then we can figure out where it ought to finally live and how to extend the existing frameworks to handle this case best. The code doesn't have to start out in scipy.stats in order to make use of the scipy.stats framework. Please do continue to put the necessary special functions into scipy.special; that framework is a little harder to use outside of scipy.special. If you need my vote of support for that on that PR, you have it. -- Robert Kern

On Mar 13, 2020, at 12:46 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Fri, Mar 13, 2020 at 12:15 PM <josef.pktd@gmail.com> wrote:
On Fri, Mar 13, 2020 at 12:04 PM <josef.pktd@gmail.com> wrote: Aside: compound Poisson is a convolution of distributions and not a finite mixture. Allowing an infinite mixing distribution like Poisson creates numerical problems in the upper tail that are not easy to solve in general. In most cases, computation have to be truncated at the upper tail, but then the problem is to figure out the truncation threshold for a required precision. My guess is that this would be a lot of work to get it to scipy standards.
I was looking at the general case for convolution and compound poisson a long time ago, mainly using fft to get the pdf and cdf from the characteristic function,, the cf is relatively simple to compute for convolutions. The references in extreme value and risk applications that I looked at, was emphasizing tail precision and ways how to work around it, or comparing different methods in how precise they are. fft was fast, but I only eyeballed the truncation threshold for my examples.
I think I also looked at that at my previous employer, I think the reference I had used is this one https://eprints.usq.edu.au/3888/1/Dunn_Smyth_Stats_and_Comp_v18n1.pdf Hopefully that helps.
I thought of representing tweedie for the computation as a mixture between a mass point/discrete distribution and a distribution for the continuous part, so we can handle the two parts separately.
I came to the same conclusion after thinking about this a bit over the last few days.
Following this, it might be possible to add a zero-truncated tweedie distribution as a continuous distribution subclass in scipy. Then we could just add a simple mixture of the mass point at zero and the zero-truncated tweedie.
The difference in zero inflated poisson is that it can be handled directly within the rv_discrete framework (I think). An rv_continuous with a zero point mass would handle a tweedie with 1<p<2 and any other point mass mixture I thought of that weren’t artificial. E.g. a normal and 0 point mass mixture (used in stochastic search variable selection) and a 0 point mass mixture with a moment normal used in Valen Johnson’s work. These seem to be the common applications (0 point mass).
That could certainly work. It seems like handling that smoothly may be a pain for the user; you'd have to coordinate the effect of the parameters on both the size of the point mass and the continuous part, as well as the mixture.
My recommendation is to implement this in its own package, using whatever frameworks you find help you solve your data analysis problems. Then we can figure out where it ought to finally live and how to extend the existing frameworks to handle this case best.
Thanks for suggesting this Robert, this is a wise strategy, this will enable to work out something that would generalize outside of the specifics of only the tweedie distribution.
The code doesn't have to start out in scipy.stats in order to make use of the scipy.stats framework. Please do continue to put the necessary special functions into scipy.special; that framework is a little harder to use outside of scipy.special. If you need my vote of support for that on that PR, you have it.
I found this from another statsModels developer that may be helpful to use as reference https://github.com/thequackdaddy/tweedie/blob/master/tweedie/tweedie_dist.py Hopefully you find it helpful.
-- Robert Kern _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Hi So to summarize: The plan is to put Wright's generalized Bessel function (for some parameter ranges) as a *private function* into scipy.special. This way, other libraries can use it under the hood, e.g. to compute likelihoods or provide distributions (rv, pdf, ...). Just as info, PR [1] for Wright's generalized Bessel function is meanwhile ready for a round of review, in my point of view. Side note: I deliberately chose to implement this special function and did not consider the various ways of computing the pdf of a Tweedie distribution as in [2], because I think an additional special function might have value for other purposes as well, and an independent approach to Tweedie distributions provides a good double check (I expect advantages and shortcomings of this approach). All the best, Christian [1] https://github.com/scipy/scipy/pull/11313 [2] https://cran.r-project.org/package=tweedie On 16.03.20 00:00, rlucas7@vt.edu wrote:
On Mar 13, 2020, at 12:46 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Fri, Mar 13, 2020 at 12:15 PM <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote:
On Fri, Mar 13, 2020 at 12:04 PM <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote:
Aside: compound Poisson is a convolution of distributions and not a finite mixture. Allowing an infinite mixing distribution like Poisson creates numerical problems in the upper tail that are not easy to solve in general. In most cases, computation have to be truncated at the upper tail, but then the problem is to figure out the truncation threshold for a required precision. My guess is that this would be a lot of work to get it to scipy standards.
I was looking at the general case for convolution and compound poisson a long time ago, mainly using fft to get the pdf and cdf from the characteristic function,, the cf is relatively simple to compute for convolutions. The references in extreme value and risk applications that I looked at, was emphasizing tail precision and ways how to work around it, or comparing different methods in how precise they are. fft was fast, but I only eyeballed the truncation threshold for my examples.
I think I also looked at that at my previous employer, I think the reference I had used is this one https://eprints.usq.edu.au/3888/1/Dunn_Smyth_Stats_and_Comp_v18n1.pdf Hopefully that helps.
I thought of representing tweedie for the computation as a mixture between a mass point/discrete distribution and a distribution for the continuous part, so we can handle the two parts separately.
I came to the same conclusion after thinking about this a bit over the last few days.
Following this, it might be possible to add a zero-truncated tweedie distribution as a continuous distribution subclass in scipy. Then we could just add a simple mixture of the mass point at zero and the zero-truncated tweedie.
The difference in zero inflated poisson is that it can be handled directly within the rv_discrete framework (I think). An rv_continuous with a zero point mass would handle a tweedie with 1<p<2 and any other point mass mixture I thought of that weren’t artificial. E.g. a normal and 0 point mass mixture (used in stochastic search variable selection) and a 0 point mass mixture with a moment normal used in Valen Johnson’s work. These seem to be the common applications (0 point mass).
That could certainly work. It seems like handling that smoothly may be a pain for the user; you'd have to coordinate the effect of the parameters on both the size of the point mass and the continuous part, as well as the mixture.
My recommendation is to implement this in its own package, using whatever frameworks you find help you solve your data analysis problems. Then we can figure out where it ought to finally live and how to extend the existing frameworks to handle this case best.
Thanks for suggesting this Robert, this is a wise strategy, this will enable to work out something that would generalize outside of the specifics of only the tweedie distribution.
The code doesn't have to start out in scipy.stats in order to make use of the scipy.stats framework. Please do continue to put the necessary special functions into scipy.special; that framework is a little harder to use outside of scipy.special. If you need my vote of support for that on that PR, you have it.
I found this from another statsModels developer that may be helpful to use as reference
https://github.com/thequackdaddy/tweedie/blob/master/tweedie/tweedie_dist.py Hopefully you find it helpful.
-- Robert Kern _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
participants (4)
-
Christian Lorentzen
-
josef.pktd@gmail.com
-
rlucas7@vt.edu
-
Robert Kern