2015-02-06 6:04 GMT-05:00 Armin Rigo <arigo@tunes.org>:

On 6 February 2015 at 08:24, Maciej Fijalkowski <fijall@gmail.com> wrote:
> I don't think it's safe to assume f_code is properly filled by the
> time you might read it, depending a bit where you find the frame
> object. Are you sure it's not full of garbage?

Yes, before discussing how to do the utf8 decoding, we should realize
that it is really unsafe code starting from the line before.  From a
signal handler you're only supposed to read data that was written to
"volatile" fields.  So even PyEval_GetFrame(), which is done by
reading the thread state's "frame" field, is not safe: this is not a
volatile.  This means that the compiler is free to do crazy things
like *first* write into this field and *then* initialize the actual
content of the frame.  The uninitialized content may be garbage, not
just NULLs.

Thanks for these comments. Of course accessing frames withing a signal handler is racy. I confirm that code encoded in non-ascii is not accessible from the uft8 buffer pointer. However, a call to PyUnicode_AsUTF8() encodes the data and caches it in the unicode object. Later access returns the byte buffer without memory allocation and re-encoding.

I think it is possible to solve both safety problems by registering a handler with PyPyEval_SetProfile(). On function entry, the handler will call PyUnicode_AsUTF8() on the required frame members to make sure the utf8 encoded string is available. Then, we increment the refcount of the frame and assign it to a thread local pointer. On function return, the refcount is decremented. These operations occurs in the normal context and they are not racy. The signal handler will use the thread local frame pointer instead of calling PyEval_GetFrame(). Does that sounds good?

Thanks again for your feedback!
