[Python-Dev] PEP 393: Flexible String Representation

Thu Jan 27 22:16:54 CET 2011

> Repetition of "11"; I'm guessing that the 2byte/UCS-2 should read "10",
> so that they give the width of the char representation.

Thanks, fixed.

>>   00 => null pointer
> 
> Naturally this assumes that all pointers are at least 4-byte aligned (so
> that they can be masked off).  I assume that this is sane on every
> platform that Python supports, but should it be spelled out explicitly
> somewhere in the PEP?

I'll change the PEP to move the type indicator into the state field, so
that issue becomes irrelevant.

>>   The string is null-terminated (in its respective representation).
>> - hash, state: same as in Python 3.2
>> - utf8_length, utf8: UTF-8 representation (null-terminated)
> If this is to share its buffer with the "str" representation for the
> Latin-1 case, then I take it this ptr will typically be (str & ~4) ?
> i.e. only "str" has the low-order-bit type info.

Yes, the other pointers are aligned. Notice that the case in which
sharing occurs is only ASCII, though (for Latin-1, some characters
require two bytes in UTF-8).

> Spelling out the meaning of "optional":
>   does this mean that the relevant ptr is NULL; if so, if utf8 is null,
> is utf8_length undefined, or is it some dummy value?

I've clarified this: I propose length is undefined (unless there is a
good reason to clear it).

>> If the string is created directly with the canonical representation
>> (see below), this representation doesn't take a separate memory block,
>> but is allocated right after the PyUnicodeObject struct.
> 
> Is the idea to do pointer arithmentic when deleting the PyUnicodeObject
> to determine if the ptr is in that location, and not delete it if it is,
> or is there some other way of determining whether the pointers need
> deallocating?

Correct.

> If the former, is this embedding an assumption that the
> underlying allocator couldn't have allocated a buffer directly adjacent
> to the PyUnicodeObject.  I know that GNU libc's malloc/free
> implementation has gaps of two machine words between each allocation;
> off the top of my head I'm not sure if the optimized Object/obmalloc.c
> allocator enforces such gaps.

No, it doesn't... So I guess I reserve another bit in the state for that.

> GDB Debugging Hooks
> -------------------
> Tools/gdb/libpython.py contains debugging hooks that embed knowledge
> about the internals of CPython's data types, include PyUnicodeObject
> instances.  It will need to be slightly updated to track the change.

Thanks, added.

Regards,
Martin