[Python-Dev] Encoding of PyFrameObject members

Fri Feb 6 01:56:36 CET 2015

On Fri, Feb 6, 2015 at 10:27 AM, Francis Giraldeau
<francis.giraldeau at gmail.com> wrote:
> Instead, I access members directly:
> char *str = PyUnicode_DATA(frame->f_code->co_filename);
> size_t len = PyUnicode_GET_DATA_SIZE(frame->f_code->co_filename);
>
> Is it safe to assume that unicode objects co_filename and co_name are always
> UTF-8 data for loaded code? I looked at the PyTokenizer_FromString() and it
> seems to convert everything to UTF-8 upfront, and I would like to make sure
> this assumption is valid.

I don't think you should be using _GET_DATA_SIZE with _DATA - they're
mix-and-matched from old and new APIs. If you want a raw,
no-allocation look at the data, you'd need to check PyUnicode_KIND and
then read Latin-1, UCS-2, or UCS-4 data:

https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_1BYTE_DATA

(By the way, I don't think the name "UCS-1" is part of the Unicode
spec. But it's an obvious parallel to UCS-2 and UCS-4.)

Getting UTF-8 data out of the structure, if it had indeed been cached,
ought to be possible. But I can't see a documented function or macro
for doing it. Is there a way? Reaching into the structure and grabbing
the utf8 and utf8_length members seems like a bad idea.

ChrisA