PyUnicodeObject / PyASCIIObject questions
(see http://www.python.org/dev/peps/pep-0393/ and http://hg.python.org/cpython/file/6f097ff9ac04/Include/unicodeobject.h ) typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; struct { unsigned int interned:2; unsigned int kind:2; /* now 3 in implementation */ unsigned int compact:1; unsigned int ascii:1; unsigned int ready:1; } state; wchar_t *wstr; } PyASCIIObject; typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyCompactUnicodeObject; typedef struct { PyCompactUnicodeObject _base; union { void *any; Py_UCS1 *latin1; Py_UCS2 *ucs2; Py_UCS4 *ucs4; } data; } PyUnicodeObject; (1) Why is PyObject_HEAD used instead of PyObject_VAR_HEAD? It is because of the names (.length vs .size), or a holdover from when unicode (as opposed to str) did not expect to be compact, or is there a deeper reason? (2) Why does PyASCIIObject have a wstr member, and why does PyCompactUnicodeObject have wstr_length? As best I can tell from the PEP or header file, wstr is only meaningful when either: (2a) wstr is shared with (and redundant to) the canonical representation -- which will therefore not be ASCII. So wstr (and wstr_length) shouldn't need to be represented explicitly, and certainly not in the PyASCIIObject base. or (2b) The string is a "Legacy String" (and PyUnicode_READY has not been called). Because it is a Legacy String, the object header must already be a full PyUnicodeObject, and the wstr fields could at least be stored there. I'm also not sure why wstr can't be stored in the existing .data member -- once PyUnicode_READY is called, it will either be there (shared) or be discarded. Are there other times when the wstr will be explicitly re-filled and cached? (3) I would feel much less nervous if the remaining 4 values of PyUnicode_Kind were explicitly reserved, and the macros raised an error when they showed up. (Better still would be to allow other values, and to have the macros delegate to some attribute on the (sub) type object.) Discussion on py-ideas strongly suggested that people should not be rolling their own string string representations, and that it won't really save as much as people think it will, etc ... but I'm not sure that saying "do it without inheritance" is the best solution -- and that is what treating kind as an exhaustive list does. -jJ
(1) Why is PyObject_HEAD used instead of PyObject_VAR_HEAD? It is because of the names (.length vs .size), or a holdover from when unicode (as opposed to str) did not expect to be compact, or is there a deeper reason?
The unicode object is not a var object. In a var object, tp_itemsize gives the element size, which is not possible for unicode objects, since the itemsize may vary by instance. In addition, not all instances have the items after the base object (plus the size of the base object in tp_basicsize is also not always correct).
(2) Why does PyASCIIObject have a wstr member, and why does PyCompactUnicodeObject have wstr_length? As best I can tell from the PEP or header file, wstr is only meaningful when either:
No. wstr is most of all relevant if someone calls PyUnicode_AsUnicode(AndSize); any unicode object might get the wstr pointer filled out at some point. It can be shared only if sizeof(Py_UNICODE) matches the canonical width of the string. wstr_length is only relevant if wstr is not NULL. For a pure ASCII string (and also for Latin-1 and other BMP strings), the wstr length will always equal the canonical length (number of code points). Only for ASCII objects the optimization was made to drop the wstr_length from the representation.
I'm also not sure why wstr can't be stored in the existing .data member -- once PyUnicode_READY is called, it will either be there (shared) or be discarded.
Most objects won't have the .data member. For those that do, .data holds the canonical representation (and *only* after PyUnicode_READY has been called).
(3) I would feel much less nervous if the remaining 4 values of PyUnicode_Kind were explicitly reserved, and the macros raised an error when they showed up. (Better still would be to allow other values, and to have the macros delegate to some attribute on the (sub) type object.)
Discussion on py-ideas strongly suggested that people should not be rolling their own string string representations, and that it won't really save as much as people think it will, etc ... but I'm not sure that saying "do it without inheritance" is the best solution -- and that is what treating kind as an exhaustive list does.
If people use C, they can construct all kinds of "illegal" representations, for any object (e.g. lists where the stored length differs from the actual length, dictionaries where key an value are switched, and so on). If they do that, they likely get crashes and other failures, so they quickly stop doing it. In the specific case of kind values: many places will either work incorrectly, or have an assertion in debug mode already if an unexpected kind is encountered. I don't mind adding such checks to more places, but I also don't see a need to explicitly care about this specific class of bugs where people would have to deliberately try to "cheat". Regards, Martin
On Tue, Dec 13, 2011 at 2:55 AM, "Martin v. Löwis"
(1) Why is PyObject_HEAD used instead of PyObject_VAR_HEAD?
The unicode object is not a var object. In a var object, tp_itemsize gives the element size, which is not possible for unicode objects, since the itemsize may vary by instance. In addition, not all instances have the items after the base object (plus the size of the base object in tp_basicsize is also not always correct).
That makes perfect sense. Any chance of adding the rationale to the code? Either inline, such as changing unicodeobject.h line 291 from PyObject_HEAD to something like: PyObject_HEAD /* Not VAR_HEAD, because tp_itemsize varies, and data may be elsewhere. */ or in the large comments around line 288: Note that Strings use PyObject_HEAD and a length field instead of PyObject_VAR_HEAD, because the tp_itemsize varies by instance, and the actual data is not always immediately after the PyASCIIObject header.
(2) Why does PyASCIIObject have a wstr member, and why does PyCompactUnicodeObject have wstr_length? As best I can tell from the PEP or header file, wstr is only meaningful when either:
No. wstr is most of all relevant if someone calls PyUnicode_AsUnicode(AndSize); any unicode object might get the wstr pointer filled out at some point.
I am willing to believe that requests for a wchar_t (or utf-8 or System Locale charset) representation are common enough to justify caching the data after the first request. But then why throw it away in the first place? Wouldn't programs that create unicode from wchar_t data also be the most likely to request wchar_t data back?
wstr_length is only relevant if wstr is not NULL. For a pure ASCII string (and also for Latin-1 and other BMP strings), the wstr length will always equal the canonical length (number of code points).
wstr_length != length exactly when: 2==sizeof(wchar_t) && PyUnicode_4BYTE_KIND == PyUnicode_KIND( str ) which can sometimes be eliminated at compile-time, and always by string creation time. In all other cases, (wstr_length == length), and wstr can be generated by widening the data without having to inspect it. Is it worth eliminating wstr_length (or even wstr) in those cases, or is that too much complexity?
(3) I would feel much less nervous if the remaining 4 values of PyUnicode_Kind were explicitly reserved, and the macros raised an error when they showed up. ...
If people use C, they can construct all kinds of "illegal" ... kind values: many places will either work incorrectly, or have an assertion in debug mode already if an unexpected kind is encountered.
What I'm asking is that (1) The other values be documented as reserved, rather than as illegal. (2) The macros produce an error rather than silently corrupting data. This allows at least the possibility of a later change such that (3) The macros handle the new values correctly, if only by delegating back to type-supplied functions. -jJ
Any chance of adding the rationale to the code?
I'm really short of time right now, so you need to find somebody else to make such a change.
I am willing to believe that requests for a wchar_t (or utf-8 or System Locale charset) representation are common enough to justify caching the data after the first request.
That's not the issue; the real issue is memory management.
But then why throw it away in the first place? Wouldn't programs that create unicode from wchar_t data also be the most likely to request wchar_t data back?
Perhaps. But are they likely to access the string they just created again at all? They know what's in it, so why look at it again?
In all other cases, (wstr_length == length), and wstr can be generated by widening the data without having to inspect it. Is it worth eliminating wstr_length (or even wstr) in those cases, or is that too much complexity?
It's too little saving.
What I'm asking is that (1) The other values be documented as reserved, rather than as illegal.
How is that different?
(2) The macros produce an error rather than silently corrupting data.
In debug mode, or release mode? -1 on release mode. Regards, Martin
On 12/13/2011 7:01 PM, "Martin v. Löwis" wrote:
What I'm asking is that (1) The other values be documented as reserved, rather than as illegal. How is that different? (2) The macros produce an error rather than silently corrupting data. In debug mode, or release mode? -1 on release mode.
These two requests seem slight contradictory. Non-official __xxx__ names are reserved for future use but not illegal now for user-use, and user-generated examples do not raise an exception. They simply do not get any special attention unless and until given an official meaning. Then too bad if that breaks code. So by analogy, reserved type value would be ignored, neither corrupting data or raising errors, until put in use. But I don't know how easy/practical that would be. Or maybe more to the point, how expensive a check would be. Not checking names for reservedness is the easiest thing to do. -- Terry Jan Reedy
Le mardi 13 décembre 2011 02:09:02 Jim Jewett a écrit :
(3) I would feel much less nervous if the remaining 4 values of PyUnicode_Kind were explicitly reserved, and the macros raised an error when they showed up. (Better still would be to allow other values, and to have the macros delegate to some attribute on the (sub) type object.)
A macro is not supposed to raise an error. In debug mode, _PyUnicode_CheckConsistency() ensures that the kind is valid and PyUnicode_KIND() fails with an assertion error if kind is PyUnicode_WCHAR_KIND. Python cannot create a string with a kind different than PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or PyUnicode_4BYTE_KIND (the legacy API creates strings with a temporary PyUnicode_WCHAR_KIND kind, kind quickly replaces by PyUnicode_READY). If you write your own extension generating an invalid string, I don't think that Python can help you. Python cannot check all data, it would be too slow. If we change something, I would suggest to remove PyUnicode_WCHAR_KIND from the PyUnicode_Kind, so you can be sure that PyUnicode_KIND() result is an enum with 3 possible values (PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or PyUnicode_4BYTE_KIND). It would help to make quiet the compiler on switch/case ;-) Victor
participants (5)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Jim Jewett
-
Terry Reedy
-
Victor Stinner