[SciPy-Dev] Fitting distributions [Was Re: Warnings raised (from fit in scipy.stats)]

Fri Jun 11 17:58:26 EDT 2010

On Fri, Jun 11, 2010 at 5:47 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Fri, Jun 11, 2010 at 5:42 PM,  <josef.pktd at gmail.com> wrote:
>> On Fri, Jun 11, 2010 at 5:11 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>> On Fri, Jun 11, 2010 at 2:49 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
>>>> On Fri, Jun 11, 2010 at 12:20 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>> On Fri, Jun 11, 2010 at 2:17 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
>>>>>> On Fri, Jun 11, 2010 at 11:34 AM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>>> On Fri, Jun 11, 2010 at 1:07 PM,  <josef.pktd at gmail.com> wrote:
>>>>>>>> On Fri, Jun 11, 2010 at 12:45 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
>>>>>>>>> Since the raising of warning behavior has been changed (I believe), I
>>>>>>>>> have been running into a lot of warnings in my code when say I do
>>>>>>>>> something like
>>>>>>>>>
>>>>>>>>> In [120]: from scipy import stats
>>>>>>>>>
>>>>>>>>> In [121]: y = [-45, -3, 1, 0, 1, 3]
>>>>>>>>>
>>>>>>>>> In [122]: v = stats.norm.pdf(y)/stats.norm.cdf(y)
>>>>>>>>> Warning: invalid value encountered in divide
>>>>>>>>>
>>>>>>>>> Sometimes, this is useful to know.  Sometimes, though, it's very
>>>>>>>>> disturbing when it's encountered in some kind of iteration or
>>>>>>>>> optimization.  I have been using numpy.clip to get around this in my
>>>>>>>>> own code, but when it's buried a bit deeper, it's not quite so simple.
>>>>>>>>>
>>>>>>>>> Take this example.
>>>>>>>>>
>>>>>>>>> In [123]: import numpy as np
>>>>>>>>>
>>>>>>>>> In [124]: np.random.seed(12345)
>>>>>>>>>
>>>>>>>>> In [125]: B = 6.0
>>>>>>>>>
>>>>>>>>> In [126]: x = np.random.exponential(scale=B, size=5000)
>>>>>>>>>
>>>>>>>>> In [127]: from scipy.stats import expon
>>>>>>>>>
>>>>>>>>> In [128]: expon.fit(x)
>>>>>>>>>
>>>>>>>>> <dozens of warnings clipped>
>>>>>>>>>
>>>>>>>>> Out[128]: (0.21874043533906118, 5.7122829778172939)
>>>>>>>>>
>>>>>>>>> The fit is achieved by fmin (as far as I know, since disp=0 in the
>>>>>>>>> rv_continuous.fit...), but there are a number of warnings emitted.  Is
>>>>>>>>> there any middle ground to be had in these type of situations via
>>>>>>>>> context management perhaps?
>>>>>>>>>
>>>>>>>>> Should I file a ticket?
>>>>>>>>
>>>>>>>> Which numpy scipy versions are you using?
>>>>>>
>>>>>> I get know warnings
>>>>>>>>> import numpy as np
>>>>>>>>> np.random.seed(12345)
>>>>>>>>> B = 6.0
>>>>>>>>> x = np.random.exponential(scale=B, size=5000)
>>>>>>>>> from scipy.stats import expon
>>>>>>>>> expon.fit(x)
>>>>>> array([  6.43573559e-04,   5.93058867e+00])
>>>>>>
>>>>>
>>>>> You also get different values than I do, which shouldn't be the case.
>>>>>
>>>>> I just discovered that my expon.fit(x) just returns the first and
>>>>> second moments of the data (even when I set floc = 0, I still get the
>>>>> second moment), and I am trying to figure out why.  It looks like
>>>>> something is amiss.
>>>>
>>>
>>> So maybe I am missing something (quite likely), but the reason that
>>> the expon.fit(x), (silently) doesn't work in the above is that
>>> expon.nnlf returns inf for the default start loc.
>>>
>>> Should fit_loc_scale be overwritten for expon?
>>>
>>> In [60]: expon.fit_loc_scale(x)
>>> Out[60]: (0.21874043533906118, 5.7122829778172939)
>>>
>>> In [61]: expon.nnlf(expon.fit_loc_scale(x),x)
>>> Out[61]: inf
>>>
>>> Changing fmin to disp=1, gives
>>>
>>> In[62]: expon.fit(x)
>>>
>>> <snip all the warnings, except from the solver>
>>>
>>> Warning: Maximum number of function evaluations has been exceeded.
>>> Out[62]: (0.21874043533906118, 5.7122829778172939)
>>>
>>> The default loc is defined (for this case as)
>>>
>>> loc = x.mean() - x.std()
>>>
>>> so for any x <= loc it is outside of the domain of the exponential
>>> distribution when it gets centered in nnlf.
>>>
>>> But
>>>
>>> In [63]: expon.fit(x,loc=0)
>>> Optimization terminated successfully.
>>>         Current function value: 13900.441325
>>>         Iterations: 59
>>>         Function evaluations: 107
>>> Out[63]: (0.00064357307755842945, 5.9303783425183241)
>>
>> I haven't looked yet in details at the new fit method.
>>
>> But for many distributions, especially those with a finite boundary,
>> fit works only if the starting values are well chosen. In this case, I
>> would set the starting value for loc at (x.min() - a little bit),
>> unless loc is frozen at zero.
>>
>
> Yeah that's what I settled on, though it wasn't obvious (to me) and
> took some digging to get to the bottom of.
>
>> I didn't check where the default starting value for expon has been changed.
>>
>
> It's inherited from rv_continuous.

If you need starting values for more distributions, I have a script
where I categorize all distributions, by whether the support is open,
left-bounded, right-bounded, bounded or ambiguous, and choose
appropriate starting values.

But this is not yet merged with a semi-frozen fit method.

fit for distributions that have support on the entire real line work
pretty well.

Josef

>
> Skipper
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>