Proposal to associate thread-local data/context with a faulthandler traceback
The faulthandler module is invaluable for tracking down segfaults in native code, however it is really lacking the ability to add some kind of useful breadcumb to aide debugging. Imagine you are running a large-scale distributed job over tens of millions of images and a single one causes opencv to segfault reliably. This can be very difficult to track down. It would be very useful to be able to add a string (limited by size) via the faulthandler module that is outputted along with the traceback. For example: ``` for image in millions_of_images(): faulthandler.context(f'{image.id=}') segfaulty_function(image) # Or maybe: for image in millions_of_images(): with faulthandler.context(f'{image.id=}'): segfaulty_function(image) ``` The traceback might be something like: ``` Fatal Python error: Segmentation fault Current thread 0x00007fb899f39700 (most recent call first): .... (traceback) Segmentation fault Context: image_id=foo ``` You could of course add logging to this function to print out the image ID before you run it through `segfaulty_function`, but if you're running this at high volume on a huge number of machines, this becomes a bit of an overhead, and it doesn't handle multiple threads running the function well.
On 2 Mar 2022, at 14:22, tom@tomforb.es wrote:
The faulthandler module is invaluable for tracking down segfaults in native code, however it is really lacking the ability to add some kind of useful breadcumb to aide debugging. Imagine you are running a large-scale distributed job over tens of millions of images and a single one causes opencv to segfault reliably. This can be very difficult to track down.
It would be very useful to be able to add a string (limited by size) via the faulthandler module that is outputted along with the traceback. For example:
``` for image in millions_of_images(): faulthandler.context(f'{image.id=}') segfaulty_function(image)
# Or maybe: for image in millions_of_images(): with faulthandler.context(f'{image.id=}'): segfaulty_function(image) ```
The traceback might be something like:
``` Fatal Python error: Segmentation fault
Current thread 0x00007fb899f39700 (most recent call first): .... (traceback) Segmentation fault Context: image_id=foo ```
You could of course add logging to this function to print out the image ID before you run it through `segfaulty_function`, but if you're running this at high volume on a huge number of machines, this becomes a bit of an overhead, and it doesn't handle multiple threads running the function well.
This does sound useful. I use this pattern of setting a context that is only logged if there is an error in my own code. Until you have this have you considered turning on core dumps? Then you will be able to see the image id in the dump file. Alternatively you could write the context to a file. Then log the contents of the file in a wrapper script that handlers python exiting on SEGV. I understand what you are getting at about not wanting to log every attempt. But you could have another program that was sent a message (over UDS) saying "about to process {context}" and after completion sent a message "processing of {context} completed. You would log only the {context} that are not completed. Barry
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/BG36W2... Code of Conduct: http://python.org/psf/codeofconduct/
Until you have this have you considered turning on core dumps? Then you will be able to see the image id in the dump file. Alternatively you could write the context to a file. Then log the contents of the file in a wrapper script that handlers python exiting on SEGV.
I understand what you are getting at about not wanting to log every attempt. But you could have another program that was sent a message (over UDS) saying "about to process {context}" and after completion sent a message "processing of {context} completed. You would log only the {context} that are not completed.
There are a number of ways to do this, but quite often the framework you're using is purely involved with processing batches of data, which makes the "poison pill" harder to find at that level. My suggestion is merely a quality of life one: it would be very, very handy to include _some_ information in the faulthandler traceback. Even a fixed 120 character ASCII-only string would be really useful - imagine your job finishes and you have 3 faulthandler tracebacks, being able to tell right away if these are because of the same input (be that batch, task, file, etc) or separate ones is invaluable. Right now you've 3 of the same tracebacks with no idea what actual input triggered them. For reference, this request comes from running Dask[1] jobs. Dask handles retrying and tracking tasks across machines but if you're dealing with a batch of inputs that reliably kills a worker it is really hard to debug, moreso if it only happens ~12 hours into your job. At certain scales it's quite hard to log every processing event reliably, and the overhead may not be worth it for a 1 in 10,000,000 failure. 1. https://dask.org/
If anyone is interested, I had a play around with this and came up with a pretty simple-ish implementation: https://github.com/orf/cpython/pull/1/files.
On 6 Mar 2022, at 07:19, tom@tomforb.es wrote:
For reference, this request comes from running Dask[1] jobs. Dask handles retrying and tracking tasks across machines but if you're dealing with a batch of inputs that reliably kills a worker it is really hard to debug, moreso if it only happens ~12 hours into your job. At certain scales it's quite hard to log every processing event reliably, and the overhead may not be worth it for a 1 in 10,000,000 failure.
I think that the core dump will get you an answer now. With your example you will have only 1 core dump to look at.
On 6 Mar 2022, at 08:29, tom@tomforb.es wrote:
If anyone is interested, I had a play around with this and came up with a pretty simple-ish implementation: https://github.com/orf/cpython/pull/1/files.
I wonder if you should just raise a bug against python for this and provide your PR for the implementation. Barry
On Mar 6, 2022, at 5:05 AM, Barry Scott <barry@barrys-emacs.org> wrote:
On 6 Mar 2022, at 07:19, tom@tomforb.es wrote:
For reference, this request comes from running Dask[1] jobs. Dask handles retrying and tracking tasks across machines but if you're dealing with a batch of inputs that reliably kills a worker it is really hard to debug, moreso if it only happens ~12 hours into your job. At certain scales it's quite hard to log every processing event reliably, and the overhead may not be worth it for a 1 in 10,000,000 failure.
I think that the core dump will get you an answer now. With your example you will have only 1 core dump to look at.
On 6 Mar 2022, at 08:29, tom@tomforb.es wrote:
If anyone is interested, I had a play around with this and came up with a pretty simple-ish implementation: https://github.com/orf/cpython/pull/1/files.
I wonder if you should just raise a bug against python for this and provide your PR for the implementation.
I think that’s a good idea. You should make Victor Stinner nosy on the issue. Eric
Ok, thank you. I will make the issue and pull request once the switch to GitHub issues is done. On Sun, Mar 6, 2022, at 10:20 AM, Eric V. Smith wrote:
On Mar 6, 2022, at 5:05 AM, Barry Scott <barry@barrys-emacs.org> wrote:
On 6 Mar 2022, at 07:19, tom@tomforb.es wrote:
For reference, this request comes from running Dask[1] jobs. Dask handles retrying and tracking tasks across machines but if you're dealing with a batch of inputs that reliably kills a worker it is really hard to debug, moreso if it only happens ~12 hours into your job. At certain scales it's quite hard to log every processing event reliably, and the overhead may not be worth it for a 1 in 10,000,000 failure.
I think that the core dump will get you an answer now. With your example you will have only 1 core dump to look at.
On 6 Mar 2022, at 08:29, tom@tomforb.es wrote:
If anyone is interested, I had a play around with this and came up with a pretty simple-ish implementation: https://github.com/orf/cpython/pull/1/files.
I wonder if you should just raise a bug against python for this and provide your PR for the implementation.
I think that’s a good idea. You should make Victor Stinner nosy on the issue.
Eric
participants (4)
-
Barry Scott
-
Eric V. Smith
-
Tom Forbes
-
tom@tomforb.es