On Fri, Mar 13, 2020 at 12:14 PM <josef.pktd@gmail.com> wrote:


On Fri, Mar 13, 2020 at 12:04 PM <josef.pktd@gmail.com> wrote:
Aside:
compound Poisson is a convolution of distributions and not a finite mixture.
Allowing an infinite mixing distribution like Poisson creates numerical problems in the upper tail that are not easy to solve in general.
In most cases, computation have to be truncated at the upper tail, but then the problem is to figure out the truncation threshold for a required precision.
My guess is that this would be a lot of work to get it to scipy standards.

I was looking at the general case for convolution and compound poisson a long time ago, mainly using fft to get the pdf and cdf from the characteristic function,, the cf is relatively simple to compute for convolutions. The references in extreme value and risk applications that I looked at, was emphasizing tail precision and ways how to work around it, or comparing different methods in how precise they are.
fft was fast, but I only eyeballed the truncation threshold for my examples.

I thought of representing tweedie for the computation as a mixture between a mass point/discrete distribution and a distribution for the continuous part, so we can handle the two parts separately.

Following this, it might be possible to add a zero-truncated tweedie distribution as a continuous distribution subclass in scipy. 
Then we could just add a simple mixture of the mass point at zero and the zero-truncated tweedie.
The same idea is used in hurdle models for count data, which have a mass point at zero and a zero-truncated Poisson or other count distributions for points > 0.



I haven't looked at Tweedie in a few years.
AFAIK/AFAIR statsmodels only supports power parameter p in the open interval (1, 2), where it has the nice Poisson-Gamma distribution.
 
 

Josef


On Fri, Mar 13, 2020 at 11:46 AM <josef.pktd@gmail.com> wrote:


On Fri, Mar 13, 2020 at 11:16 AM Christian Lorentzen <lorentzen.ch@gmail.com> wrote:

Thank you for your feedback.

If it were possible to mix distributions together, e.g. rv1 + rv2, the compound poisson could be represented. AFAIK, PyMC3 supports that with "Mixture" distributions [1]. So I see three options:

  1. Implement only a log likelihood function as Josef suggests. In which package to put it? Scipy, Statsmodels, PyMC3?
  2. Ask PyMC3 developers for this distribution.
  3. Ask Statsmodel developers for this distribution.
    @Josef: I hereby ask you;-)

The tricky part: As I intend to calculate the log likelihood via Wrights generalized Bessel function and PR [2] implements this as a private special function, can this function stay in scipy or should it move to the other packages in that case.

[1] https://docs.pymc.io/api/distributions/mixture.html
[2] https://github.com/scipy/scipy/pull/11313


IMO, scipy would be a good central location for logpdf and associated wright.
All packages like sklearn and statsmodels depend on scipy but not on each other.

If wright in special works out, then adding just logpdf would be a good short term solution.
Adding a new base distribution class e.g. for distribution mixtures will be a lot more work and won't happen fast.

Because statsmodels has currently only an approximate tweedie logpdf, we would add any improved version if it doesn't go into scipy.
We can add it also to compat.scipy until it is in all scipy versions that we support.


scipy.stats has now multivariate distributions that don't fit into the old univariate distribution setup.
Similarly, I think it would be possible to add other distributions that don't fit into the existing class hierarchy.
That would be an intermediate step without adding full support for generic mixture distributions, or discrete-continuous combinations.

Josef


Josef



Kind regards
Christian

On 11.03.20 22:41, josef.pktd@gmail.com wrote:


On Wed, Mar 11, 2020 at 5:02 PM Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Mar 11, 2020 at 3:35 PM Christian Lorentzen <lorentzen.ch@gmail.com> wrote:

Dear Scipy Developers and mailing list Readers

I'd like to address the issue [1] to implement Tweedie distributions [2] in scipy.stats.

Purpose
The family of Tweedie distributions contains many known distributions like the Poisson and the Gamma distribution, but also distributions between them, aka compound poisson gamma distribution, see [3]. These are often appropriate for insurance claims and other fields, where one has a (Poisson) random count process of events and every event has a (Gamma) random size/amount.
The distribution would enable simulations, maximum likelihood estimation of all parameters, choice and visualization of distributions, etc.

Implementation
I started PR [4] for Wrights generalized Bessel functions as a private function in scipy.special.
Once this is ready, the pdf follows immediately.
For the range of interest of Y ~ compound poisson gamma distribution, the distribution of Y has a point mass at zero and is otherwise continuous for Y>0.
As already discussed in the issue [1], Tweedie might best fit as
rv_generic.
As such, it would be the first one, all others are either
rv_discrete or rv_continuous.
Without a template, I would need guidance how to implement a new rv_generic.

FWIW, `rv_generic` isn't really intended to be a concrete class. It was only intended to be a base class implementing the common parts needed by `rv_continuous` and `rv_discrete`. Nothing "fits into" `rv_generic`, per se. The Tweedie distributions, for some parameters at least, may not fit into the `scipy.stats` infrastructure at all. We have no infrastructure for continuous-with-point-mass distributions. `rv_generic` is still built under the assumption that it's going to be implementing either a continuous or a discrete distribution.

I recommend implementing the functionality that you need outside of scipy following whatever API solves your problems best. Then we can evaluate if there is infrastructure that can be built that would help the second continuous-with-point-mass distribution that we might want next.

a long long time ago, I started a ParametricMixture model for this
(until I gave up on distributions)

An alternative as temporary solution would be to add some methods/functions like logpdf to scipy.stats, so statsmodels and sklearn can reuse those.

Josef 

 
 
--
Robert Kern
_______________________________________________
SciPy-Dev mailing list
SciPy-Dev@python.org
https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________
SciPy-Dev mailing list
SciPy-Dev@python.org
https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________
SciPy-Dev mailing list
SciPy-Dev@python.org
https://mail.python.org/mailman/listinfo/scipy-dev