[Python-Dev] PyUnicodeObject / PyASCIIObject questions

Tue Dec 13 08:55:02 CET 2011

> (1)  Why is PyObject_HEAD used instead of PyObject_VAR_HEAD?  It is
> because of the names (.length vs .size), or a holdover from when
> unicode (as opposed to str) did not expect to be compact, or is there
> a deeper reason?

The unicode object is not a var object. In a var object, tp_itemsize
gives the element size, which is not possible for unicode objects,
since the itemsize may vary by instance. In addition, not all instances
have the items after the base object (plus the size of the base object
in tp_basicsize is also not always correct).

> (2)  Why does PyASCIIObject have a wstr member, and why does
> PyCompactUnicodeObject have wstr_length?  As best I can tell from the
> PEP or header file, wstr is only meaningful when either:

No. wstr is most of all relevant if someone calls
PyUnicode_AsUnicode(AndSize); any unicode object might get the wstr
pointer filled out at some point. It can be shared only if
sizeof(Py_UNICODE) matches the canonical width of the string.

wstr_length is only relevant if wstr is not NULL. For a pure ASCII
string (and also for Latin-1 and other BMP strings), the wstr length
will always equal the canonical length (number of code points). Only
for ASCII objects the optimization was made to drop the wstr_length
from the representation.

>         I'm also not sure why wstr can't be stored in the existing
> .data member -- once PyUnicode_READY
>         is called, it will either be there (shared) or be discarded.

Most objects won't have the .data member. For those that do, .data
holds the canonical representation (and *only* after PyUnicode_READY
has been called).

> (3)  I would feel much less nervous if the remaining 4 values of
> PyUnicode_Kind were explicitly reserved, and the macros raised an
> error when they showed up.  (Better still would be to allow other
> values, and to have the macros delegate to some attribute on the (sub)
> type object.)
> 
> Discussion on py-ideas strongly suggested that people should not be
> rolling their own string string representations, and that it won't
> really save as much as people think it will, etc ... but I'm not sure
> that saying "do it without inheritance" is the best solution -- and
> that is what treating kind as an exhaustive list does.

If people use C, they can construct all kinds of "illegal"
representations, for any object (e.g. lists where the stored length
differs from the actual length, dictionaries where key an value are
switched, and so on). If they do that, they likely get crashes and
other failures, so they quickly stop doing it. In the specific case
of kind values: many places will either work incorrectly, or have
an assertion in debug mode already if an unexpected kind is
encountered. I don't mind adding such checks to more places, but I
also don't see a need to explicitly care about this specific class
of bugs where people would have to deliberately try to "cheat".

Regards,
Martin