[scikit-learn] Heisenbug?

Fri Jan 17 15:38:50 EST 2020

It's looking, at this point, like:
1) The NaN's are real
2) They're coming from some XGBoost native code, or perhaps a
Python<->native boundary, which is interfacing using ctypes.

The print's that didn't print were probably because of a misplaced flush.

The debugger that didn't debug was probably because of pytest capturing
stdout and async python code.

Thanks.

On Wed, Dec 18, 2019 at 4:09 PM Dan Stromberg <dstromberg at grokstream.com>
wrote:

>
> Any (further) suggestions folks?
>
> BTW, when I say pudb fails to start, I mean it's tracebacking trying to
> get None.fileno()  In other pieces of (C)Python code I've tried it in,
> pudb.set_trace() worked nicely.
>
> On Tue, Dec 17, 2019 at 7:50 AM Dan Stromberg <dstromberg at grokstream.com>
> wrote:
>
>>
>> Hi.
>>
>> Overflow does sound kind of possible.  We're sending semi-random values
>> to the test.
>>
>> I believe our systems are all x86_64, Linux.  Some are Ubuntu 16.04, some
>> are Mint 19.2.
>>
>> I realized on the way to work this morning, that I left out some
>> important information; I suspect a heisenbug for 3 reasons:
>>
>> 1) If I try to look at it with print functions, I get a traceback after
>> the print's, but no print output.  This happens with both writing to a
>> disk-based file, and with printing to stdout.
>>
>> 2) If I try to look at it with pudb (a debugger) via pudb.set_trace(), I
>> get a failure to start pudb.
>>
>> 3) If I create a small test program that sends the same inputs to the
>> function in question, the function works fine.
>>
>> Thanks.
>>
>> On Mon, Dec 16, 2019 at 11:20 PM Joel Nothman <joel.nothman at gmail.com>
>> wrote:
>>
>>> Hi Dan, this kind of error can come from overflow. Are all of your test
>>> systems the same architecture?
>>>
>>> On Tue., 17 Dec. 2019, 12:03 pm Dan Stromberg, <
>>> dstromberg at grokstream.com> wrote:
>>>
>>>> Hi folks.
>>>>
>>>> I'm new to Scikit-learn.
>>>>
>>>> I have a very large Python project that seems to have a heisenbug which
>>>> is manifesting in scikit-learn code.
>>>>
>>>> Short of constructing an SSCCE, are there any magical techniques I
>>>> should try for pinning down the precise cause?  Like valgrind or something?
>>>>
>>>> An SSCCE will most likely be pretty painful: the project has copious
>>>> shared, mutable state, and I've already tried a largish test program that
>>>> calls into the same code path with the error manifesting 0 times in 100.
>>>>
>>>> It's quite possible the root cause will turn out to be some other part
>>>> of the software stack.
>>>>
>>>> The traceback from pytest looks like:
>>>> sequential/test_training.py:101:
>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>> _ _ _ _ _ _ _ _ _ _ _ _ _
>>>> ../rt/classifier/coach.py:146: in train
>>>>     **self.classifier_section
>>>> ../domain/classifier/factories/classifier_academy.py:115: in
>>>> create_classifier
>>>>     **kwargs)
>>>> ../domain/classifier/factories/imp/xgb_factory.py:164: in create
>>>>     clf_random.fit(X_train, y_train)
>>>> ../../../../.local/lib/python3.6/site-packages/sklearn/model_selection/_search.py:722:
>>>> in fit
>>>>     self._run_search(evaluate_candidates)
>>>> ../../../../.local/lib/python3.6/site-packages/sklearn/model_selection/_search.py:1515:
>>>> in _run_search
>>>>     random_state=self.random_state))
>>>> ../../../../.local/lib/python3.6/site-packages/sklearn/model_selection/_search.py:711:
>>>> in evaluate_candidates
>>>>     cv.split(X, y, groups)))
>>>> ../../../../.local/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py:996:
>>>> in __call__
>>>>     self.retrieve()
>>>> ../../../../.local/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py:899:
>>>> in retrieve
>>>>     self._output.extend(job.get(timeout=self.timeout))
>>>> ../../../../.local/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py:517:
>>>> in wrap_future_result
>>>>     return future.result(timeout=timeout)
>>>> /usr/lib/python3.6/concurrent/futures/_base.py:425: in result
>>>>     return self.__get_result()
>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>> _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>
>>>> self = <Future at 0x7f15571ec7f0 state=finished raised ValueError>
>>>>
>>>>     def __get_result(self):
>>>>         if self._exception:
>>>> >           raise self._exception
>>>> E           ValueError: Input contains NaN, infinity or a value too
>>>> large for dtype('float32').
>>>>
>>>> /usr/lib/python3.6/concurrent/futures/_base.py:384: ValueError
>>>>
>>>>
>>>> The above exception is raised about 12 to 14 times in 100 in full-blown
>>>> automated testing.
>>>>
>>>> Thanks for the cool software.
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200117/1c41d00e/attachment.html>