[Python-Dev] PyUnicodeObject / PyASCIIObject questions

Tue Dec 13 22:17:13 CET 2011

On Tue, Dec 13, 2011 at 2:55 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
>> (1)  Why is PyObject_HEAD used instead of PyObject_VAR_HEAD?

> The unicode object is not a var object. In a var object, tp_itemsize
> gives the element size, which is not possible for unicode objects,
> since the itemsize may vary by instance. In addition, not all instances
> have the items after the base object (plus the size of the base object
> in tp_basicsize is also not always correct).

That makes perfect sense.

Any chance of adding the rationale to the code?  Either inline, such
as changing unicodeobject.h line 291 from

    PyObject_HEAD
to something like:
    PyObject_HEAD               /* Not VAR_HEAD, because tp_itemsize
varies, and data may be elsewhere. */

or in the large comments around line 288:

    Note that Strings use PyObject_HEAD and a length field instead of
PyObject_VAR_HEAD, because the tp_itemsize varies by instance, and the
actual data is not always immediately after the PyASCIIObject  header.

>> (2)  Why does PyASCIIObject have a wstr member, and why does
>> PyCompactUnicodeObject have wstr_length?  As best I can tell from the
>> PEP or header file, wstr is only meaningful when either:

> No. wstr is most of all relevant if someone calls
> PyUnicode_AsUnicode(AndSize); any unicode object might get the
> wstr pointer filled out at some point.

I am willing to believe that requests for a wchar_t (or utf-8 or
System Locale charset) representation are common enough to justify
caching the data after the first request.

But then why throw it away in the first place?  Wouldn't programs that
create unicode from wchar_t data also be the most likely to request
wchar_t data back?

> wstr_length is only relevant if wstr is not NULL. For a pure ASCII
> string (and also for Latin-1 and other BMP strings), the wstr length
> will always equal the canonical length (number of code points).

wstr_length != length exactly when:

    2==sizeof(wchar_t) &&
    PyUnicode_4BYTE_KIND == PyUnicode_KIND( str )

which can sometimes be eliminated at compile-time, and always by
string creation time.

In all other cases, (wstr_length == length), and wstr can be generated
by widening the data without having to inspect it.  Is it worth
eliminating wstr_length (or even wstr) in those cases, or is that too
much complexity?

>> (3)  I would feel much less nervous if the remaining 4 values of
>> PyUnicode_Kind were explicitly reserved, and the macros raised an
>> error when they showed up. ...

> If people use C, they can construct all kinds of "illegal" ...
> kind values: many places will either work incorrectly, or have
> an assertion in debug mode already if an unexpected kind is
> encountered.

What I'm asking is that
(1)  The other values be documented as reserved, rather than as illegal.
(2)  The macros produce an error rather than silently corrupting data.

This allows at least the possibility of a later change such that

(3)  The macros handle the new values correctly, if only by delegating
back to type-supplied functions.

-jJ