
In benchmarking PEP 393, I noticed that many UTF-8 decode calls originate from C code with static strings, in particular PyObject_CallMethod. Many of such calls already have been optimized to cache a string object, however, PyObject_CallMethod remains unoptimized since it requires a char*. I find the ad-hoc approach of declaring and initializing variables inadequate, in particular since it is difficult to clean up all those string objects at interpreter shutdown. I propose to add an explicit API to deal with such identifiers. With this API, tmp = PyObject_CallMethod(result, "update", "O", other); would be replaced with PyObject *tmp; Py_identifier(update); ... tmp = PyObject_CallMethodId(result, &PyId_update, "O", other); Py_identifier expands to a struct typedef struct Py_Identifier { struct Py_Identifier *next; const char* string; PyObject *object; } Py_Identifier; string will be initialized by the compiler, next and object on first use. The new API for that will be PyObject* PyUnicode_FromId(Py_Identifier*); PyObject* PyObject_CallMethodId(PyObject*, Py_Identifier*, char*, ...); PyObject* PyObject_GetAttrId(PyObject*, Py_Identifier*); int PyObject_SetAttrId(PyObject*, Py_Identifier*, PyObject*); int PyObject_HasAttrId(PyObject*, Py_Identifier*); I have micro-benchmarked this; for import time d={} i=d.items() t=time.time() for _ in range(10**6): i | d print(time.time()-t) I get a speed-up of 30% (notice that "i | d" invokes the above PyObject_CallMethod call). Regards, Martin

On Sat, 08 Oct 2011 16:54:06 +0200 "Martin v. Löwis" <martin@v.loewis.de> wrote:
I find the ad-hoc approach of declaring and initializing variables inadequate, in particular since it is difficult to clean up all those string objects at interpreter shutdown.
I propose to add an explicit API to deal with such identifiers.
That sounds like a good idea.
With this API,
tmp = PyObject_CallMethod(result, "update", "O", other);
would be replaced with
PyObject *tmp; Py_identifier(update); ... tmp = PyObject_CallMethodId(result, &PyId_update, "O", other);
Surely there is something missing to initialize the "const char *" in the structure? Or is "Py_identifier()" actually a macro?
string will be initialized by the compiler, next and object on first use. The new API for that will be
PyObject* PyUnicode_FromId(Py_Identifier*); PyObject* PyObject_CallMethodId(PyObject*, Py_Identifier*, char*, ...); PyObject* PyObject_GetAttrId(PyObject*, Py_Identifier*); int PyObject_SetAttrId(PyObject*, Py_Identifier*, PyObject*); int PyObject_HasAttrId(PyObject*, Py_Identifier*);
Do we want to make these APIs public? Regards Antoine.

Am 08.10.2011 17:43, schrieb Antoine Pitrou:
On Sat, 08 Oct 2011 16:54:06 +0200 "Martin v. Löwis" <martin@v.loewis.de> wrote:
I find the ad-hoc approach of declaring and initializing variables inadequate, in particular since it is difficult to clean up all those string objects at interpreter shutdown.
I propose to add an explicit API to deal with such identifiers.
That sounds like a good idea.
With this API,
tmp = PyObject_CallMethod(result, "update", "O", other);
would be replaced with
PyObject *tmp; Py_identifier(update); ... tmp = PyObject_CallMethodId(result, &PyId_update, "O", other);
Surely there is something missing to initialize the "const char *" in the structure? Or is "Py_identifier()" actually a macro?
Yes (note the parenthesized usage).
string will be initialized by the compiler, next and object on first use. The new API for that will be
PyObject* PyUnicode_FromId(Py_Identifier*); PyObject* PyObject_CallMethodId(PyObject*, Py_Identifier*, char*, ...); PyObject* PyObject_GetAttrId(PyObject*, Py_Identifier*); int PyObject_SetAttrId(PyObject*, Py_Identifier*, PyObject*); int PyObject_HasAttrId(PyObject*, Py_Identifier*);
Do we want to make these APIs public?
Probably not at first. I'd suggest making them private for Python 3.3, and if the approach proves satisfying, we can think about adding them to the public API in Python 3.4. Georg

On 10/08/2011 04:54 PM, "Martin v. Löwis" wrote:
tmp = PyObject_CallMethod(result, "update", "O", other);
would be replaced with
PyObject *tmp; Py_identifier(update); ... tmp = PyObject_CallMethodId(result,&PyId_update, "O", other);
An alternative I am fond of is to to avoid introducing a new type, and simply initialize a PyObject * and register its address. For example: PyObject *tmp; static PyObject *s_update; // pick a naming convention PY_IDENTIFIER_INIT(update); tmp = PyObject_CallMethodObj(result, s_update, "O", other); (but also PyObject_GetAttr(o, s_update), etc.) PY_IDENTIFIER_INIT(update) might expand to something like: if (!s_update) { s_update = PyUnicode_InternFromString("update"); _Py_IdentifierAdd(&s_update); } _PyIdentifierAdd adds the address of the variable to a global set of C variables that need to be decreffed and zeroed-out at interpreted shutdown. The benefits of this approach is: * you don't need special "identifier" versions of functions such as PyObject_CallMethod. In my example I invented a PyObject_CallMethodObj, but adding that might be useful anyway. * a lot of Python/C code implements similar caching, often leaking strings. Hrvoje

2011/10/11 Hrvoje Niksic <hrvoje.niksic@avl.com>
An alternative I am fond of is to to avoid introducing a new type, and simply initialize a PyObject * and register its address. For example:
PyObject *tmp; static PyObject *s_update; // pick a naming convention
PY_IDENTIFIER_INIT(update); tmp = PyObject_CallMethodObj(result, s_update, "O", other);
(but also PyObject_GetAttr(o, s_update), etc.)
PY_IDENTIFIER_INIT(update) might expand to something like:
if (!s_update) { s_update = PyUnicode_InternFromString("**update"); _Py_IdentifierAdd(&s_update); }
It should also check for errors; in this case the initialization is a bit more verbose: if (PY_IDENTIFIER_INIT(update) < 0) <return NULL or return -1 or goto error>; -- Amaury Forgeot d'Arc

On 10/11/2011 02:45 PM, Amaury Forgeot d'Arc wrote:
It should also check for errors; in this case the initialization is a bit more verbose: if (PY_IDENTIFIER_INIT(update) < 0) <return NULL or return -1 or goto error>;
Error checking is somewhat more controversial because behavior in case of error differs between situations and coding patterns. I think it should be up to the calling code to check for s_update remaining NULL. In my example, I would expect PyObject_CallMethodObj and similar to raise InternalError when passed a NULL pointer. Since their return values are already checked, this should be enought to cover the unlikely case of identifier creation failing. Hrvoje

On Oct 11, 2011, at 02:36 PM, Hrvoje Niksic wrote:
On 10/08/2011 04:54 PM, "Martin v. Löwis" wrote:
tmp = PyObject_CallMethod(result, "update", "O", other);
would be replaced with
PyObject *tmp; Py_identifier(update); ... tmp = PyObject_CallMethodId(result,&PyId_update, "O", other);
An alternative I am fond of is to to avoid introducing a new type, and simply initialize a PyObject * and register its address. For example:
PyObject *tmp; static PyObject *s_update; // pick a naming convention
PY_IDENTIFIER_INIT(update); tmp = PyObject_CallMethodObj(result, s_update, "O", other);
(but also PyObject_GetAttr(o, s_update), etc.)
I like this better too because of the all-caps macro name. Something about seeing "Py_identifier" look like a function call and having it add the magical PyId_update local bugs me. It just looks wrong, whereas the all-caps is more of a cultural clue that something else is going on. -Barry

On Tue, 11 Oct 2011 09:19:43 -0400 Barry Warsaw <barry@python.org> wrote:
On Oct 11, 2011, at 02:36 PM, Hrvoje Niksic wrote:
On 10/08/2011 04:54 PM, "Martin v. Löwis" wrote:
tmp = PyObject_CallMethod(result, "update", "O", other);
would be replaced with
PyObject *tmp; Py_identifier(update); ... tmp = PyObject_CallMethodId(result,&PyId_update, "O", other);
An alternative I am fond of is to to avoid introducing a new type, and simply initialize a PyObject * and register its address. For example:
PyObject *tmp; static PyObject *s_update; // pick a naming convention
PY_IDENTIFIER_INIT(update); tmp = PyObject_CallMethodObj(result, s_update, "O", other);
(but also PyObject_GetAttr(o, s_update), etc.)
I like this better too because of the all-caps macro name. Something about seeing "Py_identifier" look like a function call and having it add the magical PyId_update local bugs me. It just looks wrong, whereas the all-caps is more of a cultural clue that something else is going on.
+1 for something more recognizable. I think "const string" is more accurate than "identifier" as well. Regards Antoine.

I like this better too because of the all-caps macro name. Something about seeing "Py_identifier" look like a function call and having it add the magical PyId_update local bugs me. It just looks wrong, whereas the all-caps is more of a cultural clue that something else is going on.
If people think the macro should be all upper-case, I can go through and replace them (but only once). Let me know what the exact spelling should be. Originally, I meant to make the variable name equal the string (e.g. then having a variable named __init__ point to the "__init__" string). However, I quickly gave up on that idea, since the strings conflict too often with other identifiers in C. In particular, you couldn't use that approach for calling the "fileno", "read" or "write" methods. So I think it needs a prefix. If you don't like PyId_, let me know what the prefix should be instead. If there is no fixed prefix (i.e. if you have to specify variable name and string value separately), and if there is no error checking, there is IMO too little gain to make this usable. I'm particularly worried about the error checking: the tricky part in C is to keep track of all the code paths. This API hides this by putting the initialization into the callee (PyObject_GetAttrId and friends), hence the (unlikely) failure to initialize the string is reported in the same code path as the (very plausible) error that the actual attribute access failed. Regards, Martin

On Thu, 13 Oct 2011 14:00:38 +0200 "Martin v. Löwis" <martin@v.loewis.de> wrote:
I like this better too because of the all-caps macro name. Something about seeing "Py_identifier" look like a function call and having it add the magical PyId_update local bugs me. It just looks wrong, whereas the all-caps is more of a cultural clue that something else is going on.
If people think the macro should be all upper-case, I can go through and replace them (but only once). Let me know what the exact spelling should be.
Py_CONST_STRING or Py_IDENTIFIER would be fine with me. Given that everything else uses "Id" in their name, Py_IDENTIFIER is probably better?
Originally, I meant to make the variable name equal the string (e.g. then having a variable named __init__ point to the "__init__" string). However, I quickly gave up on that idea, since the strings conflict too often with other identifiers in C. In particular, you couldn't use that approach for calling the "fileno", "read" or "write" methods.
So I think it needs a prefix. If you don't like PyId_, let me know what the prefix should be instead.
I agree with that. Regards Antoine.

On Oct 13, 2011, at 03:23 PM, Antoine Pitrou wrote:
Py_CONST_STRING or Py_IDENTIFIER would be fine with me. Given that everything else uses "Id" in their name, Py_IDENTIFIER is probably better?
I agree that either is fine, with a slight preference for Py_IDENTIFIER for the same reasons.
Originally, I meant to make the variable name equal the string (e.g. then having a variable named __init__ point to the "__init__" string). However, I quickly gave up on that idea, since the strings conflict too often with other identifiers in C. In particular, you couldn't use that approach for calling the "fileno", "read" or "write" methods.
So I think it needs a prefix. If you don't like PyId_, let me know what the prefix should be instead.
I agree with that.
I'm fine with that too, as long as it's all well-documented in the C API guide. -Barry

Py_CONST_STRING or Py_IDENTIFIER would be fine with me. Given that everything else uses "Id" in their name, Py_IDENTIFIER is probably better?
I agree that either is fine, with a slight preference for Py_IDENTIFIER for the same reasons.
Ok, so it's Py_IDENTIFIER.
So I think it needs a prefix. If you don't like PyId_, let me know what the prefix should be instead.
I agree with that.
I'm fine with that too, as long as it's all well-documented in the C API guide.
Hmm. People voted that this should be an internal API, so I'm not sure it should be documented at all outside of the header file, or if, in what document. Currently, this very point is documented in the header file. Regards, Martin

On Oct 13, 2011, at 08:08 PM, Martin v. Löwis wrote:
Py_CONST_STRING or Py_IDENTIFIER would be fine with me. Given that everything else uses "Id" in their name, Py_IDENTIFIER is probably better?
I agree that either is fine, with a slight preference for Py_IDENTIFIER for the same reasons.
Ok, so it's Py_IDENTIFIER.
Given below, shouldn't that be _Py_IDENTIFIER?
So I think it needs a prefix. If you don't like PyId_, let me know what the prefix should be instead.
I agree with that.
I'm fine with that too, as long as it's all well-documented in the C API guide.
Hmm. People voted that this should be an internal API, so I'm not sure it should be documented at all outside of the header file, or if, in what document.
Currently, this very point is documented in the header file.
That's fine, if the macro is prefixed with an underscore. -Barry

Am 13.10.11 20:38, schrieb Barry Warsaw:
On Oct 13, 2011, at 08:08 PM, Martin v. Löwis wrote:
Py_CONST_STRING or Py_IDENTIFIER would be fine with me. Given that everything else uses "Id" in their name, Py_IDENTIFIER is probably better?
I agree that either is fine, with a slight preference for Py_IDENTIFIER for the same reasons.
Ok, so it's Py_IDENTIFIER.
Given below, shouldn't that be _Py_IDENTIFIER?
It actually is _Py_IDENTIFIER (and was _Py_identifier). Regards, Martin

Instead of an explicit prefix, how about a macro, such as Py_ID(__string__)?
That wouldn't be instead, but in addition - you need the variable name, anyway. Not sure whether there is actually a gain in readability - people not familiar with this would assume that it's a function call of some kind, which it would not be. Regards, Martin

Martin v. Löwis wrote:
That wouldn't be instead, but in addition - you need the variable name, anyway.
But the details of exactly how the name is constructed could be kept as an implementation detail.
Not sure whether there is actually a gain in readability - people not familiar with this would assume that it's a function call of some kind, which it would not be.
To me the benefit would be that the name you write as the argument would be *exactly* the identifier it represents. If you have to manually add a prefix, there's room for a bit of confusion, especially if the prefix itself ends with an underscore. E.g. if the identifier is "__init__" and the prefix is "PyID_", do you write "PyID__init__" (two underscores) or "PyID___init__" (three underscores?) And can you easily spot the difference in your editor? -- Greg

Am 15.10.2011 01:32, schrieb Greg Ewing:
Martin v. Löwis wrote:
That wouldn't be instead, but in addition - you need the variable name, anyway.
But the details of exactly how the name is constructed could be kept as an implementation detail.
Is there a use case for keeping that detail hidden?
Not sure whether there is actually a gain in readability - people not familiar with this would assume that it's a function call of some kind, which it would not be.
To me the benefit would be that the name you write as the argument would be *exactly* the identifier it represents.
If you have to manually add a prefix, there's room for a bit of confusion, especially if the prefix itself ends with an underscore. E.g. if the identifier is "__init__" and the prefix is "PyID_", do you write "PyID__init__" (two underscores) or "PyID___init__" (three underscores?) And can you easily spot the difference in your editor?
The compiler can, very easily. Georg

"Martin v. Löwis", 08.10.2011 16:54:
In benchmarking PEP 393, I noticed that many UTF-8 decode calls originate from C code with static strings, in particular PyObject_CallMethod. Many of such calls already have been optimized to cache a string object, however, PyObject_CallMethod remains unoptimized since it requires a char*.
Yes, I noticed that in Cython, too. We often use PyObject_CallMethod() as a fallback for optimistically optimised method calls when the expected fast path does not hit, and it always bugged me that this needs to generate a Python string on each call in order to look up the method.
I propose to add an explicit API to deal with such identifiers. With this API,
tmp = PyObject_CallMethod(result, "update", "O", other);
would be replaced with
PyObject *tmp; Py_identifier(update); ... tmp = PyObject_CallMethodId(result, &PyId_update, "O", other);
Py_identifier expands to a struct
typedef struct Py_Identifier { struct Py_Identifier *next; const char* string; PyObject *object; } Py_Identifier;
string will be initialized by the compiler, next and object on first use.
As I understand it, the macro expands to both the ID variable declaration and the init-at-first-call code, right? This is problematic when more than one identifier is used, as some C compilers strictly require declarations to be written *before* any other code. I'm not sure how often users will need more than one identifier in a function, but if it's not too hard to come up with a way that avoids this problem all together, it would be better to do so right from the start. Also note that existing code needs to be changed in order to take advantage of this. It might be possible to optimise PyObject_CallMethod() internally by making the lookup either reuse a number of cached Python strings, or by supporting a lookup of char* values in a dict *somehow*. However, this appears to be substantially more involved than just moving a smaller burden on the users. Stefan

PyObject *tmp; Py_identifier(update);
As I understand it, the macro expands to both the ID variable declaration and the init-at-first-call code, right?
No. The variable will only get static initialization with the char*. The initialization on first call (of the PyObject*) happens in the helper functions, such as PyObject_GetAttrId.
I'm not sure how often users will need more than one identifier in a function
That's actually fairly common.
Also note that existing code needs to be changed in order to take advantage of this. It might be possible to optimise PyObject_CallMethod() internally by making the lookup either reuse a number of cached Python strings, or by supporting a lookup of char* values in a dict *somehow*. However, this appears to be substantially more involved than just moving a smaller burden on the users.
I think this would have to hash the string in any case, since keying by char* pointer value cannot work (there might be a different string at the same memory the next time). So even if this could side-step many of the steps, you'd keep at least one iteration over the characters; if this is hashing, you actually need two iterations (the second one to determine whether it's the right string). The Py_IDENTIFIER API can do the lookup in constant time for all but the first call. Regards, Martin

Le samedi 8 octobre 2011 16:54:06, Martin v. Löwis a écrit :
In benchmarking PEP 393, I noticed that many UTF-8 decode calls originate from C code with static strings, in particular PyObject_CallMethod. Many of such calls already have been optimized to cache a string object, however, PyObject_CallMethod remains unoptimized since it requires a char*.
Because all identifiers are ASCII (in the C code base), another idea is to use a structure similar to PyASCIIObject but with an additional pointer to the constant char* string: typedef struct { PyASCIIObject _base; const char *ascii; } PyConstASCIIObject; Characters don't have to be copied, just the pointer, but you still have to allocate a structure. Because the size of the structure is also constant, we can have an efficient free list. Pseudo-code to create such object: PyObject* create_const_ascii(const char *str) { PyConstASCIIObject *obj; /* ensure maybe that str is ASCII only? */ obj = get_from_freelist(); # reset the object (e.g. hash) if (!obj) { obj = allocate_new_const_ascii(); if (!obj) return NULL; } obj->ascii = str; return obj; } Except PyUnicode_DATA, such structure should be fully compatible with other PEP 383 structures. We would need a new format for Py_BuildValue, e.g. 'a' for ASCII string. Later we can add new functions like _PyDict_GetASCII(). Victor

Le jeudi 13 octobre 2011 00:44:33, Victor Stinner a écrit :
Le samedi 8 octobre 2011 16:54:06, Martin v. Löwis a écrit :
In benchmarking PEP 393, I noticed that many UTF-8 decode calls originate from C code with static strings, in particular PyObject_CallMethod. Many of such calls already have been optimized to cache a string object, however, PyObject_CallMethod remains unoptimized since it requires a char*.
Because all identifiers are ASCII (in the C code base), another idea is to use a structure similar to PyASCIIObject but with an additional pointer to the constant char* string:
Oh, I realized that Martin did already commit its PyIdentifier API, it's maybe too late :-)
We would need a new format for Py_BuildValue, e.g. 'a' for ASCII string. Later we can add new functions like _PyDict_GetASCII().
The main difference between my new "const ASCII" string idea and PyIdentifier is the lifetime of objects. Using "const ASCII" strings, the strings are destroyed quickly (to not waste memory), whereas PyIdentifiers are intern strings and so they are only destroyed at Python exit. I don't know if "const ASCII" strings solve a real issue. I implemented my idea. I will do some benchmarks to check if it's useful or not :-) Victor

Le jeudi 13 octobre 2011 03:34:00, Victor Stinner a écrit :
We would need a new format for Py_BuildValue, e.g. 'a' for ASCII string. Later we can add new functions like _PyDict_GetASCII().
The main difference between my new "const ASCII" string idea and PyIdentifier is the lifetime of objects. Using "const ASCII" strings, the strings are destroyed quickly (to not waste memory), whereas PyIdentifiers are intern strings and so they are only destroyed at Python exit.
I don't know if "const ASCII" strings solve a real issue. I implemented my idea. I will do some benchmarks to check if it's useful or not :-)
Ok, I did some tests: it is slower with my PyConstASCIIObject. I don't understand why, but it means that the idea is not interesting because the code is not faster. It is also difficult to ensure that the string is "constant" (test the scope of the string). At least, I found a nice GCC function: __builtin_constant_p(str) can be used to ensure that the string is constant (e.g. "abc" vs char*). Victor

Am 14.10.2011 00:30, schrieb Victor Stinner:
Le jeudi 13 octobre 2011 03:34:00, Victor Stinner a écrit :
We would need a new format for Py_BuildValue, e.g. 'a' for ASCII string. Later we can add new functions like _PyDict_GetASCII().
The main difference between my new "const ASCII" string idea and PyIdentifier is the lifetime of objects. Using "const ASCII" strings, the strings are destroyed quickly (to not waste memory), whereas PyIdentifiers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
are intern strings and so they are only destroyed at Python exit.
I don't know if "const ASCII" strings solve a real issue. I implemented my idea. I will do some benchmarks to check if it's useful or not :-)
Ok, I did some tests: it is slower with my PyConstASCIIObject. I don't understand why, but it means that the idea is not interesting because the code is not faster.
I think you've already given the answer above... Georg

Le 14/10/2011 07:44, Georg Brandl a écrit :
Am 14.10.2011 00:30, schrieb Victor Stinner:
Le jeudi 13 octobre 2011 03:34:00, Victor Stinner a écrit :
We would need a new format for Py_BuildValue, e.g. 'a' for ASCII string. Later we can add new functions like _PyDict_GetASCII().
The main difference between my new "const ASCII" string idea and PyIdentifier is the lifetime of objects. Using "const ASCII" strings, the strings are destroyed quickly (to not waste memory), whereas PyIdentifiers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
are intern strings and so they are only destroyed at Python exit.
I don't know if "const ASCII" strings solve a real issue. I implemented my idea. I will do some benchmarks to check if it's useful or not :-)
Ok, I did some tests: it is slower with my PyConstASCIIObject. I don't understand why, but it means that the idea is not interesting because the code is not faster.
I think you've already given the answer above...
I tried with and without interned strings. It doesn't change anything. Victor
participants (10)
-
"Martin v. Löwis"
-
Amaury Forgeot d'Arc
-
Antoine Pitrou
-
Barry Warsaw
-
Benjamin Peterson
-
Georg Brandl
-
Greg Ewing
-
Hrvoje Niksic
-
Stefan Behnel
-
Victor Stinner