Sharing docstrings between the Python and C implementations of a module

Recently I helped out on issue16954 which involved filling in docstrings for methods and classes in ElementTree.py While doing so, I tried to test my work in the interpreter like this... >>> from xml.etree.ElementTree import Element >>> help(Element) ...but found that help() showed nothing but empty strings! After some debugging, I found that the culprit was the `from _elementtree import *` near the bottom of the module. Not wanting to copy & paste docstrings around, I thought one solution might be to just reassign Element.__doc__ with the right docstring. But, it seems that you can't do that for C extensions: >>> from _elementtree import Element as cElement >>> cElement.__doc__ = 'correct docstring' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't set attributes of built-in/extension type 'xml.etree.ElementTree.Element' --- Q. Is there way to maintain the same docstring without resorting to copying and pasting it in two places? I tried to find an example in the source which addressed this, but found that the docstrings in similar cases to be largely duplicated. For instance, _datetimemodule.c, decimal_.c and _json.c all seem to exhibit this docstring copy and pastage.

On Mon, Apr 15, 2013 at 9:56 AM, David Lam <david.k.lam1@gmail.com> wrote:
Recently I helped out on issue16954 which involved filling in docstrings for methods and classes in ElementTree.py
While doing so, I tried to test my work in the interpreter like this...
>>> from xml.etree.ElementTree import Element >>> help(Element)
...but found that help() showed nothing but empty strings!
After some debugging, I found that the culprit was the `from _elementtree import *` near the bottom of the module.
Not wanting to copy & paste docstrings around, I thought one solution might be to just reassign Element.__doc__ with the right docstring. But, it seems that you can't do that for C extensions:
>>> from _elementtree import Element as cElement >>> cElement.__doc__ = 'correct docstring' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't set attributes of built-in/extension type 'xml.etree.ElementTree.Element'
---
Q. Is there way to maintain the same docstring without resorting to copying and pasting it in two places?
I tried to find an example in the source which addressed this, but found that the docstrings in similar cases to be largely duplicated. For instance, _datetimemodule.c, decimal_.c and _json.c all seem to exhibit this docstring copy and pastage.
Hi NumPy uses a hack to deal with this problem. It has a small C function that would mutate the docstring under your feet. Personally I would prefer some sort of tagging in C source that can copy-paste stuff instead, honestly. It does sound like a good idea to share docstrings. Seems also relatively trivial to write a test that checks that they stay the same. Cheers, fijal

On Mon, Apr 15, 2013 at 8:17 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Mon, Apr 15, 2013 at 9:56 AM, David Lam <david.k.lam1@gmail.com> wrote:
I tried to find an example in the source which addressed this, but found that the docstrings in similar cases to be largely duplicated. For instance, _datetimemodule.c, decimal_.c and _json.c all seem to exhibit this docstring copy and pastage.
Hi
NumPy uses a hack to deal with this problem. It has a small C function that would mutate the docstring under your feet. Personally I would prefer some sort of tagging in C source that can copy-paste stuff instead, honestly. It does sound like a good idea to share docstrings. Seems also relatively trivial to write a test that checks that they stay the same.
It's actually even worse than that - if a subclass overrides a method, it has to completely duplicate the docstring, even if the original docstring was still valid. So, for example, ABCs can't provide docstrings for abstract methods. So yeah, we end up not only duplicating between the C and Python versions, but sometimes we end up duplicating between different subclasses as well (datetime.datetime, datetime.date and datetime.time are the worst offenders here). I like the idea of at least adding a test that checks the Python docstring and the C docstring are the same in the duplicated cases - that should be a lot easier than adjusting the build process to let the C version use the Python docstrings or vice-versa (even the argument clinic DSL in PEP 434 doesn't try to achieve that - it just tries to cut down on the duplication within the C code itself). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Mon, Apr 15, 2013 at 3:45 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Mon, Apr 15, 2013 at 8:17 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Mon, Apr 15, 2013 at 9:56 AM, David Lam <david.k.lam1@gmail.com> wrote:
I tried to find an example in the source which addressed this, but found that the docstrings in similar cases to be largely duplicated. For instance, _datetimemodule.c, decimal_.c and _json.c all seem to exhibit this docstring copy and pastage.
Hi
NumPy uses a hack to deal with this problem. It has a small C function that would mutate the docstring under your feet. Personally I would prefer some sort of tagging in C source that can copy-paste stuff instead, honestly. It does sound like a good idea to share docstrings. Seems also relatively trivial to write a test that checks that they stay the same.
It's actually even worse than that - if a subclass overrides a method, it has to completely duplicate the docstring, even if the original docstring was still valid. So, for example, ABCs can't provide docstrings for abstract methods.
So yeah, we end up not only duplicating between the C and Python versions, but sometimes we end up duplicating between different subclasses as well (datetime.datetime, datetime.date and datetime.time are the worst offenders here).
I like the idea of at least adding a test that checks the Python docstring and the C docstring are the same in the duplicated cases - that should be a lot easier than adjusting the build process to let the C version use the Python docstrings or vice-versa (even the argument clinic DSL in PEP 434 doesn't try to achieve that - it just tries to cut down on the duplication within the C code itself).
Would it make sense to think about adding this in the scope of the argument clinic work, or is it too unrelated? This seems like a commonly needed thing for large parts of the stdlib (where the C accelerator overrides Python code). Eli

On 15 April 2013 13:31, Eli Bendersky <eliben@gmail.com> wrote:
On Mon, Apr 15, 2013 at 3:45 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Mon, Apr 15, 2013 at 8:17 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Mon, Apr 15, 2013 at 9:56 AM, David Lam <david.k.lam1@gmail.com> wrote:
I tried to find an example in the source which addressed this, but found that the docstrings in similar cases to be largely duplicated. For instance, _datetimemodule.c, decimal_.c and _json.c all seem to exhibit this docstring copy and pastage.
Hi
NumPy uses a hack to deal with this problem. It has a small C function that would mutate the docstring under your feet. Personally I would prefer some sort of tagging in C source that can copy-paste stuff instead, honestly. It does sound like a good idea to share docstrings. Seems also relatively trivial to write a test that checks that they stay the same.
It's actually even worse than that - if a subclass overrides a method, it has to completely duplicate the docstring, even if the original docstring was still valid. So, for example, ABCs can't provide docstrings for abstract methods.
So yeah, we end up not only duplicating between the C and Python versions, but sometimes we end up duplicating between different subclasses as well (datetime.datetime, datetime.date and datetime.time are the worst offenders here).
I like the idea of at least adding a test that checks the Python docstring and the C docstring are the same in the duplicated cases - that should be a lot easier than adjusting the build process to let the C version use the Python docstrings or vice-versa (even the argument clinic DSL in PEP 434 doesn't try to achieve that - it just tries to cut down on the duplication within the C code itself).
Would it make sense to think about adding this in the scope of the argument clinic work, or is it too unrelated? This seems like a commonly needed thing for large parts of the stdlib (where the C accelerator overrides Python code).
+1 It is a problem I was met with when building extensions as well - when one would like to keep the C parts to a minimum and dynamically populate de doc-strings from another source with a Python script, for example.
Eli
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/jsbueno%40python.org.br

Would it make sense to think about adding this in the scope of the argument clinic work, or is it too unrelated? This seems like a commonly needed thing for large parts of the stdlib (where the C accelerator overrides Python code).
Or maybe separate doc strings from both code bases altogether and insert them where appropriate as part of the build process? That said, I haven't any idea how you might accomplish that. Maybe this general problem should be thrown over the python-ideas to cook for awhile... Skip

Skip Montanaro writes:
Would it make sense to think about adding this in the scope of the argument clinic work, or is it too unrelated? This seems like a commonly needed thing for large parts of the stdlib (where the C accelerator overrides Python code).
Or maybe separate doc strings from both code bases altogether and insert them where appropriate as part of the build process?
Experience with gettext vs other kinds of message catalogs for localization suggests that's a really painful approach to take. It's not entirely clear to me that this whole effort isn't a premature optimization. Eventually (next 5 to 15 years? long run, anyway) we'll probably localize, and *most* messages will be shared (from gettext .mo files) anyway. (Yes, I recognize that space is not the most important aspect of sharing docstrings, and that it's likely shared docstrings can be automatically shared by gettext. We should take care that that's so.) The other thing that occurs to me is that maybe something like gettext may be the way to deal with these issues anyway.

On 04/15/2013 09:31 AM, Eli Bendersky wrote:
Would it make sense to think about adding this in the scope of the argument clinic work, or is it too unrelated? This seems like a commonly needed thing for large parts of the stdlib (where the C accelerator overrides Python code).
From my perspective, the C accelerator doesn't override the Python code; the Python code is in charge, and elects to call the C accelerator. To answer your question, Clinic could help but I think it'd kind of suck. We'd have to use Clinic to preprocess the .py file and copy the docstring there, which would mean slurping in the C file. Clumsy but workable. We can talk about it once I finish the revised implementation, which should make processes like this much easier. /arry

On 4/15/2013 4:21 PM, Larry Hastings wrote:
On 04/15/2013 09:31 AM, Eli Bendersky wrote:
Would it make sense to think about adding this in the scope of the argument clinic work, or is it too unrelated? This seems like a commonly needed thing for large parts of the stdlib (where the C accelerator overrides Python code).
From my perspective, the C accelerator doesn't override the Python code; the Python code is in charge, and elects to call the C accelerator.
To answer your question, Clinic could help but I think it'd kind of suck. We'd have to use Clinic to preprocess the .py file and copy the docstring there, which would mean slurping in the C file. Clumsy but workable. We can talk about it once I finish the revised implementation, which should make processes like this much easier.
Since Clinic is mostly aimed at making C usage work with python, isn't the above backwards? Wouldn't it be better to have Clinic slurp in the Python code to find the docstrings, and then put them in the C code? Then no need to preprocess the python file, just read it to obtain the docstrings. You are already preprocessing the C code...

On Mon, Apr 15, 2013 at 12:56 AM, David Lam <david.k.lam1@gmail.com> wrote:
I tried to find an example in the source which addressed this, but found that the docstrings in similar cases to be largely duplicated.
I find this annoying too. It would be nice to have a common way to share docstrings between C and Python implementations of the same interface. One roadblock though is functions in C modules often document their parameters in their docstring. >>> import _json >>> help(_json.scanstring) scanstring(...) scanstring(basestring, end, encoding, strict=True) -> (str, end) Scan the string s for a JSON string. End is the index of the character in s after the quote that started the JSON string. [...] Argument clinic will hopefully lift this roadblock soon. Perhaps, we could add something to the clinic DSL a way to fetch the docstring directly from the Python implementation. And as an extra, it would be easy to add verification step as well that checks the both implementations provide a similar interfaces once we have this in place.
participants (10)
-
Alexandre Vassalotti
-
David Lam
-
Eli Bendersky
-
Glenn Linderman
-
Joao S. O. Bueno
-
Larry Hastings
-
Maciej Fijalkowski
-
Nick Coghlan
-
Skip Montanaro
-
Stephen J. Turnbull