Add a __cite__ method for scientific packages

Over the last 10 years, Python has slowly inched towards becoming the most popular scientific computing language, beating or seriously challenging Matlab, R, Mathematica and many specialized languages (S, SAS, ...) in numerous applications. A large part of this growth is driven by amazing community packages, such as numpy, scipy, scikits-learn, scikits-image, seaborn or pandas, just to name a few. Development of such packages represents a significant time investment by people working in academic environments. To be able to justify the investment of time into such package development and support, the developers usually associated them with a scientific article. The number of citations of those articles are considered as measures of the usefulness of articles and are required to justify the time spent on them. Unfortunately, as of now, a significant issue is that such packages are not cited despite being extensively used. Part of this is due to the difficulties with compiling the list of proper citations for each module (and, for libraries associated with multiple update publications, selecting the relevant citation). Part of this is due to users not realizing which of the modules they are using have associated publications and should be cited. To remediate to that situation, I suggest a __citation__ method associated to each package installation and import. Called from the __main__, __citation__() would scan __citation__ of all imported packages and return the list of all relevant top-level citations associated to the packages. As a scientific package developer working in academia, the problem is quite serious, and the solution seems relatively straightforward. What does Python core team think about addition and long-term maintenance of such a feature to the import and setup mechanisms? What do other users and scientific package developers think of such a mechanism for citations retrieval? Best, *Andrei Kucharavy*Post-Doc @ *Joel S. Bader* * Lab*Johns Hopkins University, Baltimore, USA.

While I'm not personally in need of citations (and never felt I was) I can easily understand the point -- sometimes citations can make or break a career and having written a popular software package should be acknowledged. Are there other languages or software communities that do something like this? It would be nice not to have to invent this wheel. Eventually a PEP and an implementation should be presented, but first the idea needs to be explored more. --Guido On Wed, Jun 27, 2018 at 3:30 PM Andrei Kucharavy <> wrote:
-- --Guido van Rossum (

This is an interesting proposal. Speaking as a developer of scientific software packages it would be really cool to have support for something like this in the language itself. The software sustainability institute in the UK have written several blog posts advocating the use of CITATION files containing this sort of metadata: A github code search for __citation__ also gets 127 hits that mostly seem to be research software that are using this attribute more or less as suggested here: It's also worth pointing out which is sort of a citation search engine for software projects. It uses a number of heuristics to figure out what the appropriate citation for a piece of software is. On Wed, Jun 27, 2018 at 5:49 PM, Guido van Rossum <> wrote:

On 28/06/2018 00:00, Nathan Goldbaum wrote:
I just thought that it might be worth pointing out that this should actually work both ways i.e. if a specific package, module or function is inspired by or directly implements the methods included in a specific publication then any __citation__ entries within it should also cite that/those or allow references to them to be recovered. The general principle is if you are expecting to be cited you also have to cite. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG.

I think this is a fine idea, but could be achieved by convention, like __version__, rather than by fiat. And it’s certainly not a language feature. So Nathaniel’s right — the thing to do now is work out the convention, and then advocate for it. -CHB

> Are there other languages or software communities that do something like this? It would be nice not to have to invent this wheel. While I do not use R regularly, I understand their community is largely academic-driven, and citations are strongly encouraged as seen in their documentation: Here is an example use of their `citation()` function: > citation() To cite R in publications use: R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL A BibTeX entry for LaTeX users is @Manual{, title = {R: A Language and Environment for Statistical Computing}, author = {{R Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2013}, url = {}, } Calling the `citation()` function generates a BibTex output (, which is one of the most common citation conventions. For reference, I believe this is the source code:

That's a lot of responses, thanks for the interest and the suggestions! Are there other languages or software communities that do something like
To my knowledge, R is the only language that implements such a feature. Package developers add a CITATION text file containing a text with whatever text citation format for their package. A specialized citation() built-in function can be called from the REPL that would return a citation for the R itself, including a BibTex file for LateX users. When citation is called on a package instead, it returns the contents of CITATION for that package specifically (eg. citation("ggplot2")) or alternatively uses package metadata to build a sane citation. Given that most of work with R is done within a REPL and packages are installed/imported with commands such as install.package("ggplot2")/import("ggplot2"), this approach makes sense in that context. This, however, didn't feel terribly Pythonic to me. As for PEP and a reference implementation, I will gladly take care of them if the idea gets enough traction, but there seems to be already a PEP draft as well as an attempt at implementation by one of the AstroPy/AstroML maintainers, using the __citation__ field and citation() function to unpack it: There also seem some packages in the community using __bibtex__ rather than __citation__ to store BibTeX entries but I haven't found yet any large project implementing it or PEP drafts associated to it. The software sustainability institute in the UK have written several blog
posts advocating the use of CITATION files containing this sort of metadata:
Yes, that's the R approach I presented above. It is viable, especially if hooked to something accessible from the REPL directly, such as __cite__ or __citation__ attribute/method for modules. I would, however, advocate for a more structured approach - perhaps JSON or BibTeX that would get parsed and converted to suitable citation format by the __cite__, if it was implemented as a method. A github code search for __citation__ also gets 127 hits that mostly seem
Most of them are from the AstroPy universe or from the CitationPEP draft I've referenced above. This is indeed a serious problem. I suspect python-ideas isn't the
There has been localized discussion popping up among the large scientific package maintainers and some attempts to solve the problem at the local level. Until now they seemed to be winding down due to a lack of a large-scale citation mechanism and a discussion about what is concretely doable at the scale of the language is likely to finalize As for the list, reserving a __citation__/__cite__ for packages at the same level as __version__ is now reserved and adding a citation()/cite() function to the standard library seemed large enough modifications to warrant searching a buy-in from the maintainers and the community at large. You'll want to check out the duecredit project:
Due credit looks amazing - I will definitely check it out. The idea was, however, to bring the barrier for adoption and usage as low as possible. In my experience, the vast majority of Python users in academic environment who aren't citing the packages properly are beginners. As such they are unlikely to search for third-party libraries beyond those they've found and used to solve their specific problem. who just assembled a pipeline based on widely-used libraries and would need to generate a citation list for it to pass on to their colleagues responsible for the paper assembly and submission. I'd actually like to see a more general solution that isn't restricted
Thanks for the reference to the distutils-sig list. I will talk to them if the idea gets traction here I am not entirely convinced for the multi-language pipelines. In bioinformatics, often the heavy lifting is done by a single package (for instance bowtie for RNA-seq alignment) and the output is piped to the custom script, mostly in R or Python. The citations for the library doing the heavy-lifting is often well-known and widely cited and the issues arise in the custom scripts importing and using libraries that should be cited without citing them. One challenge in standardizing this kind of thing is choosing a
CLS-JSON represented as a dict to be supplied to the setup file is definitely one way of doing it. I was, however, thinking more about the BibTeX format, given that CLS-JSON is more closely affiliated with Mendeley Why does this have to be a dunder method? In general, application code shouldn't be calling dunders directly, they're reserved for Python. I was under the impression that sometimes the dunders are used to store relevant information that would not be of use to the most users, such as __version__ and sometimes to better control the execution flow (for instance the if __name__== "main") I think your description of what this method should do is not
My initial idea was to have a __cite__ method embedded in the import mechanism that would parse data from config and upon a call on a package, return the citation developers want to see associated to the current package version in the format user needs. (for instance numpy.__cite__('bibtex') would return a citation for the current numpy version in BibTeX format). If called on the script itself __cite__('bibtex') would iterate through all the imported modules and retrieve their citations one by one, at least for those that modules that have associated citation. After reading the feedback in this thread, I believe that a __citation__ reserved field that pulls the data from the setup script and a cite() script in the standard library would be a better approach. In the end, I believe the best would be to implement both of them and see which one feels more pythonic. I do think you have identified an important feature, but I think this is
Yes, that's the idea! The biggest reason for me to send the discussion to this list is to check if it would be acceptable to reserve the __citation__ data field in each module and include the cite() script in the standard library. Presumably you would need to be able to specify which citation style to
Yes, but to avoid building a configurable citation engine for the thousands of formats there are in the wild, it would take a couple of standard formats and interchangeable formats, such as bibtex or EndNote xref - both text formats that are simple to use. I was thinking about the approach taken by Google Scholar from that perspective.
The implementation I was thinking about would have required __citation__/__cite__ dunder reservation or implementation of a function that would be injected into installed packages. For setup I was thinking about adding the citation field to the distutils setup. I was not really aware of the distutils-sig discussion list that would be more appropriate with that regards. A long time ago, I added a feature request for a page in the
That's a good point. Unfortunately, I have not thought about how to cite code that would not have an associated publication. From what I see by checking google scholar, as of now people are citing the Python language reference manual if they want to cite Python itself in a scientific publication. GVM didn't seem interested in citations for Python and from what I understand the vast majority of non-scientific package developer, given citations are not essential for their career advancement. Should Python have a standard sys.__citation__ field that provides the
The idea for Python itself seems good! However, rather than using a named tuple, I was thinking about using a dict consistent with CSL-JSON or BibTeX. And writing a citation generating engine that would be consistent with hundreds if not thousands journal-specific formats is a bit of the scope of the proposal for now - most of the time people just want something their citation/bibliography engine can ingest and generate a citation from there in their Word/LaTeX documents. Bibtex/EndNote export formats are perfect for that task in my experience.
The general convention is to cite the top-level publication. While some methods definitely deserve a citation on their own (such as Sobol filter in Scikits-image), they provide a link to the relevant citation in their documentation to them and would normally cite them in their master publication. That's definitely an idea to look at but I don't see a straightforward of implementing this so far. I think this is a fine idea, but could be achieved by convention, like
This already seems to be an idea floating in the air - AstroPy is inching towards that implementation. The idea is to modify the language to make citing as straightforward as possible and create a universal mechanism for that. Best, *Andrei Kucharavy* Post-Doc @ *Joel S. Bader** Lab* Johns Hopkins University, Baltimore, USA. On Thu, Jun 28, 2018 at 11:48 AM Chris Barker - NOAA Federal via Python-ideas <> wrote:

credits Thanks to CWI, CNRI,, Zope Corporation and a cast of
One more thing. There's precedent for this: when you start an interactive Python interpreter it tells you how to get help, but also how to get copyright, credits and license information: $ python3 Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. thousands for supporting Python development. See for more information.
It makes total sense to add citations/references to this list (and those should probably print a reference for Python followed by instructions on how to get references for other packages and how to properly add a reference to your own code). -- --Guido van Rossum (

On Thu, Jun 28, 2018 at 05:25:00PM -0400, Andrei Kucharavy wrote:
I think that an approach similar to help/quit/exit is warranted. The cite()/citation() function need not be *literally* built into the language, it could be an external function written in Python and added to builtins by the module. -- Steve

For me, it's about setting a standard that is endorsed by the language, and setting expectations for users. There currently is no standard, which is why packages use __citation__, __cite__, __bibtex__, etc., and as a user I don't immediately know where to look for citation information (without going to the source). My feeling is that adopting __citation__ or some dunder name could be implemented on classes, functions, etc. with less of a chance of naming conflicts, but am open to discussion. I have some notes here about various ideas for more advanced functionality that would support automatically keeping track of citation information for imported packages, classes, functions: On Thu, Jun 28, 2018 at 10:57 PM, Alex Walters <> wrote:
-- Adrian M. Price-Whelan Lyman Spitzer, Jr. Postdoctoral Fellow Princeton University

But don't all the users who care about citing modules already use the scientific python packages, with scipy itself at it's center? Wouldn't those engaging in science or in academia be better stewards of this than systems programmers? Since you're not asking for anything that can't be done in a third party module, and there is a third party module that most of the target audience of this standard would already have, there is zero reason to take up four names in the python runtime to serve those users.

On Thu, Jun 28, 2018 at 11:26 PM, Alex Walters <> wrote:
Not all scientific software in Python depends on scipy or even numpy. However, it does all depend on Python. Although perhaps that argues for a cross-language solution :) I still think it would be very nice to have an official standard for citation information in Python packages as codified in a PEP. That would reduce ambiguity and make it much easier for tool-writers who want to parse citation information.

If that's possible, that would be great!
I was not aware this was a possibility - it does seem like a good option! If I were you, I'd try organizing a birds-of-a-feather at the next
Not all packages are within the numpy/scipy universe - Pandas and Seaborn are notable examples. I bought this thread to the attention of some major scientific package maintainers as well as the main citationPEP author. I am not entirely sure where this conversations could be moved outside python-ideas given we are talking about something universal across packages, but would gladly take any suggestions. There isn't actually any formal method for registering special names
Thanks for the explanation - Python development and maintenance do seem to be a complex process from the outside and this kind of subtleties are not always easy to distinguish :). The way to do this is to first get your solution implemented as a
Got it. I think you misunderstand how these lists work :-). (Which is fine --
Got it as well - that does indeed seem a reasonable way of doing things, although I believe there have been precedents where GVM implemented a feature from scratch after studying existing libraries (I am thinking notably about asyncio, which is orders of magnitude more complex and involved than anything we are talking here). And often the custom scripts are a mix of R and Python, and maybe some
In my experience, people tend to go with either one or other or use Julia. I am not very familiar with Fortran ecosystem - as far as I've seen, those are extremely efficient libraries that get wrapped and used in most modern scientific computing languages, but very rarely directly. In addition to that, while I see how granular citations could be implemented in Python, I have a bit more trouble understanding how calls to R, Python, Perl, C, C++ or Fortran from command line scripts can be analyzed on the fly to get metadata about citations. I have even more trouble imagining how it would be possible to bring developers across all the separate language communities to agree on a single standard.
Hm - was not aware Zotero uses it as well - it's definitely a good sign and I will have to look into CLS-JSON it more in depth. Why not scipy.cite() or scipy.citation()? I don't see any reason for these
functions to ship with standard python at all.
There are packages that do not depend on scipy and even for those that do - most users writing analysis pipelines for scientific packages are unaware that they are using scipy/numpy underneath the packages that do what they want at the highest level. I don't think that this is a very useful idea, because most people that
Thanks for your opinion Gael - as maintainer of scikits-learn you have more experience with this issue more than most of us. In my field (computational biology in molecular biology labs) the situation is somewhat different - most of the custom scripts are implemented by people who often have learned Python or programming at all in the last couple of years. Most of the time they get asked by the corresponding author to provide 1-5 citations for their analytical pipeline and to describe what they did in the supplementary material and I had several junior developers in my labs come forwards to me asking what they were supposed to cite and where to find the citations. We aren't likely to convince everyone to cite code overnight, but making citing as easy as possible does seem like a step in the right direction to me. I still think it would be very nice to have an official standard for
That's my opinion as well. To summarize the conversation until now, it seems that __citation__ data field and a cite() script seem to be the preferred option. If the proposal gets traction and is accepted, the citation for Python as well as the instructions to get citation for a package can be added as a top-level command, similar to credits, copyright or license. As of now, it seems like the next steps would be to: - draft a PEP (or complete the existing one) and implement the cite() script as well as a show-case package using __citation__ - talk to major package maintainers to see if they have any objections to the method or suggestions with regards to pep/implementation - talk to the distutils-sig list to see if we could add the __citation__ metadata to - submit a proper PEP (Would a pull request to be an acceptable way of doing it?) Is there something I might be missing so far? Best, *Andrei Kucharavy* Post-Doc @ *Joel S. Bader** Lab* Johns Hopkins University, Baltimore, USA. On Fri, Jun 29, 2018 at 10:51 AM Nathan Goldbaum <> wrote:

On Fri, Jun 29, 2018, 8:14 PM Andrei Kucharavy <> wrote:
Not all packages are within the numpy/scipy universe - Pandas and Seaborn are notable examples.
Huh?! Pandas is a thin wrapper around NumPy. To be fair, it is a wrapper that adds a huge number of wrapping methods and classes. Seaborn in turn has at least a soft dependency on Pandas (some of the charts really need a DataFrame to work from). I like the idea of standardizing curation information. But it has little to do with Python itself. Getting the authors of scientific packages to agree on conventions is what needed, and doing that requires accurately determining their needs, not some mandate from Python itself. Nothing in the language needs to change to agree on some certain collection of names (perhaps dunders, perhaps not), and some certain formats for the data that might live inside them. Down the road, if there gets to be widespread acceptance of these conventions, Python standard library might include a function or two to work with them. But the horse should go before the cart.

On Fri, Jun 29, 2018, 17:14 Andrei Kucharavy <> wrote:
This is thin justification to add something to core. It seems like the very small percentage of academic users whose careers depend on this cannot resolve the political issue of forming a standards body. I don't see how externalizing the standard development will help. Kudos for shortcutting the process in a practical way to just get it done, but this just puts core devs in the middle of silly academic spats. A language endorsed citation method isn't a 'correct' method, and without the broad consensus that currently doesn't exist, this becomes _your_ method, a picked winner but ultimately a lightning rod for bored tenured professors with personal axes to grind. If this were about implementing an existing correct method I'm sure a grad student would be tasked with it for an afternoon. This is insanely easy to implement in docstrings, or a standard import, or mandatory include, or decorator, or anywhere else, it's just a parsing protocol. I believe 3.7 now exposes docstrings in the AST, meaning a simple static analyzer can handle all of PyPi, giving you crazy granularity if citations existed. Don't you want to cite the exact algorithm used in an imported method, not just lump them all into one call? Heck, I bet you could use type annotations. This really feels like you've got an amazing multi-tool but you want to turn the world, not the screw. This isn't a tool the majority of people will use, even if the citations exist. Don't get me wrong, I love designing standards and protocols, but this is pretty niche. I assume it won't be mandatory so I'm tilting at windmills, but then if it's not mandatory, what's the point of putting it in core? Just create a jstor style git server where obeying the citation protocol is mandatory. Of course, enforcing a missing citation is impossible, but it does mean citations can be generated by parsing imports. This is how it will evolve over time, by employing core devs on that server framework.

On Fri, Jun 29, 2018 at 8:58 PM, Matt Arcidy <> wrote:
[...] Just create a jstor style git server where obeying the citation
protocol is mandatory.
I don't know if it constitutes a standards body, but there are a couple journals out there that are meant to serve as mechanisms for turning a repo into a published/citable thing, they might be good to look at for prior art as well as to what metadata should be included: * (sponsored by NumFOCUS) * pkg_resources that could pull out a short citation string from some package metadata (a hypothetical `pkg_resources.get_distribution("numpy").citation` that could be wrapped by some helper function if desired)? The actual mechanism to convert metadata into something in the repo (a dunder cite string in the root module, a separate metadata file, etc.) into the package metadata isn't as important as rolling said metadata into something part of the distribution package like the version or long_description fields. Once the schema of the citation data is defined, you could add it to the metadata spec (outgrowth of PEP-566)

Putting citation information into pyproject.toml makes a lot more sense than putting it in the modules themselves, where they would have to be introspected to be extracted. * It puts zero burden on the core developers * It puts near zero burden on the distutils special interest group * It doesn't consume names from the package namespace * It's just a TOML file - you can add sections to it willy-nilly * It's just a TOML file - there's libraries in almost all ecosystems to handle it. Nothing has to go into the core metadata specification unless part of your suggestion is that Pypi show the citations. I don't think that is a good idea for the scope of Pypi and the workload of the warehouse developers. I don't think it's too much to ask for the scientific community to figure out the solution that works for most people before bringing it back here. I also don't think its out of scope to suggest taking this to SciPy - yes, not everything depends on SciPy, but you don't need everything, you just momentum.

On Thu, Jun 28, 2018 at 2:25 PM, Andrei Kucharavy <> wrote:
Those are the people with the most motivation and expertise to solve this, and whose buy-in you'll need on any solution. If they haven't solved it yet themselves, then there are basically two reasons why that happens: either because they're busy and no-one's had enough time to work on it, or else because they're uncertain about the best path forward. Neither of these is a problem that python-ideas can help with. If you want to be effective here, you need to talk to them to figure out how you can help them move forward. If I were you, I'd try organizing a birds-of-a-feather at the next SciPy conference, or start getting in touch with others working on this (duecredit devs, the folks listed on that citationPEP thing, etc.), and go from there. (Feel free to CC me if you do start up some effort like this.)
There isn't actually any formal method for registering special names like __version__, and they aren't treated specially by the language. They're just variables that happen to have a funny name. You shouldn't start using them willy-nilly, but you don't actually have to ask permission or anything. And it's not very likely that someone else will come along and propose using the name __citation__ for something that *isn't* a citation :-).
The way to do this is to first get your solution implemented as a third-party library and adopted by the scientific packages, and then start thinking about whether it would make sense to move the library into the standard library. It's relatively easy to move things into the standard library. The hard part is making sure that you implemented the right thing in the first place, and that's MUCH more likely if you start out as a third-party package.
I think you misunderstand how these lists work :-). (Which is fine -- it's actually pretty opaque and confusing if you don't already know!) Generally, distutils-sig operates totally independently from python-{ideas,dev} -- if you have a packaging proposal, it goes there and not here; if you have a language proposal, it goes here and not there. *If* what you want to do is add some static metadata to python packages through, then python-ideas is irrelevant and distutils-sig is who you'll have to convince. (But they'll also want to see that your proposal has buy-in from established packages, because they don't understand the intricacies of software citation and will want people they trust to tell them whether the proposal makes sense.)
And often the custom scripts are a mix of R and Python, and maybe some Fortran, ... Plus, if it works for multiple languages, it means you get to share part of the work with other ecosystems, instead of everyone reinventing the wheel. Also, if you want to go down the dynamic route (which is the only way to get accurate fine-grained citations), then it's just as easy to solve the problem in a language independent way.
Huh, is it? I only know it from Zotero. -n -- Nathaniel J. Smith --

On 29 June 2018 at 12:14, Nathaniel Smith <> wrote:
The one caveat on dunder names is that we expressly exempt them from our usual backwards compatibility guarantees, so it's worth getting some level of "No, we're not going to do anything that would conflict with your proposed convention" at the language design level.
Aye, in this case I think you can comfortably assume that we'll happily leave the "__citation__" and "__cite__" dunder names alone unless/until there's a clear consensus in the scientific Python community to use them a particular way. And even then, it would likely be Python package installers like pip, Python environment managers like pipenv, and data analysis environment managers like conda that would handle the task of actually consuming that metadata (in whatever form it may appear). Having your citation management support depend on which version of Python you were using seems like it would be mostly a source of pain rather than beneficial. Cheers, Nick. -- Nick Coghlan | | Brisbane, Australia

On Wed, Jun 27, 2018 at 2:20 PM, Andrei Kucharavy <> wrote:
This is indeed a serious problem. I suspect python-ideas isn't the best venue for addressing it though – there's nothing here that needs changes to the Python interpreter itself (I think), and the people who understand this problem the best and who are most affected by it, mostly aren't here. You'll want to check out the duecredit project: One of the things they've thought about is the ability to track citation information at a more fine-grained way than per-package – for example, there might be a paper that should be cited by anyone who calls a particular method (or even passes a specific argument to some specific method, when that turns on some fancy algorithm). The R world also has some prior art -- in particular I know they have citations as part of the standard metadata in every package. I'd actually like to see a more general solution that isn't restricted to any one language, because multi-language analysis pipelines are very common. For example, we could standardize a convention where if a certain environment variable is set, then the software writes out citation information to a certain location, and then implement libraries that do this in multiple languages. Of course, that's a "dynamic" solution that requires running the software -- which is probably necessary if you want to do fine-grained citations, but it might be useful to also have static metadata, e.g. as part of the package metadata that goes into sdists, wheels, and on PyPI. That would be a discussion for the distutils-sig mailing list, which manages that metadata. One challenge in standardizing this kind of thing is choosing a standard way to represent citation information. Maybe CSL-JSON? There's a lot of complexity as you dig into this, though of course one shouldn't let the perfect be the enemy of the good... -n -- Nathaniel J. Smith --

On 28 June 2018 at 01:19, Nathaniel Smith <> wrote:
I actually think the opposite. If this is not fixed in a PEP it will stay in the current state. Writing a PEP (and officially accepting it) for this purpose will give a signal that it is a standard practice

I think a __citation__ *method* is a bad idea. This yells out "attribute" to me. A function or two that parses those attributes in some manner is a better idea... And there's no reason that function or two need to be dunders. There's also no reason they need to be in the standard library... There might be many citation/writing applications that process the data to their own needs. But assuming there is an attribute, WHAT goes inside it? Is it a string? And if so, in what markup format? Is it a dictionary? A list? A custom class? Does some wrapper function deal with different formats. Does the wrapper also scan for __author__, __copyright__, and friends? We also need to decide what __citation__ is an attribute OF. Only modules? Classes? Methods? Functions? All of the above? If multiple, how are the attributes at different places synthesized or processed? Can one object have multiple citations (e.g. what if a class or method implements multiple algorithms depending on a switch... Or depending on the shape of the data being processed? The different algorithms might need different citations). These are all questions that could have good answers. But I don't know what the answers are. I've worked in scientific computing for a good while, but not as an academic. And when I was an academic it wasn't in scientific computing. This list is not mostly composed of the relevant experts. Those are the authors and users of SciPy and statsmodels, and scikit-learn, and xarray, and Tensorflow, and astropy, and so on. There's absolutely nothing in the idea that requires a change in Python, and Python developers or users are not, as such, the relevant experts. In the future, AFTER there is widespread acceptance of what goes on a __citation__ attribute, it would be easy and obvious to add minimal support in Python itself for displaying citation content. But this is the wrong group to mandate what the actual academic needs are here. On Sun, Jul 1, 2018, 9:07 AM Ivan Levkivskyi <> wrote:

On Sun, Jul 1, 2018 at 9:45 AM David Mertz <> wrote:
This is not entirely true. If some variant of __citation__ is endorsed by the community, I would expect that pydoc would extract this information to fill an appropriate section in the documentation page. Note that pydoc already treats a number of dunder variables specially: '__author__', '__credits__', and '__version__' are a few that come to mind, so I don't think the threshold for adding one more should be too high. On the other hand, maybe '__author__', '__credits__', and '__citation__' should be merged in one structured variable (a dict?) with format designed with some extendability in mind. CreativeWork has a field with a range of {CreativeWork, Text} There's also a attribute with a domain of CreativeWork and a range of {Organization, Person} - BibTeX is actually somewhat ill-specified, TBH. - There is a repository of CSL styles at . - CSL is sponsored by both Zotero and Mendeley. - A number of search engines support (and JSONLD) - The RDFS vocabulary is designed to describe a graph of resources (CreativeWork, Code, SoftwareApplication, ScholarlyArticle, MedicalScholarlyArticle). __citation__ = [{}, ] __citation__ = { '@type': ['schema:ScholarlyArticle'], 'schema:name': '', 'schema:author': [{ '@type': 'schema:Person', '...': '...'}] } JSONLD is ideal for describing a graph of resources with varied types. If the overhead of __citation__ for every import is unjustified, a lookup of methods with dotted names that finds entries for root modules as well would be great:
citations('json.loads') citations('list.sort')
A tracing debugger could lookup each and every package, module, function, and method each ScholarlyArticle SoftwareApplication executes (from a registry in e.g. a or a _citations_.jsonld.json). It'd be a shame to need to manually format citations for a particular Journal's CSL bibliographic metadata template preference. sphinxcontrib-bibtex is a Sphinx extension for BibTeX support (with a bibliography directive and a cite role) - Src: Jupyter notebooks support document-level metadata (in JSON that's currently only similar to JSONLD). is search engine indexable. On Wednesday, July 4, 2018, Alexander Belopolsky <> wrote:

... a schema:Dataset may be part of a Creative work. #LinkedReproducibility #nbmeta On Wednesday, July 4, 2018, Wes Turner <> wrote:

Typeshed contains external type annotations for the Python standard
typeshed, dotted lookup, ScholarlyArticle semantic graphs with classes, properties, and URIs Would external metadata (similar to how typeshed is defined in a 'shadow naming scheme' (?)) be advantageous for dotted name lookup of citation metadata? library and Python builtins, as well as third party packages.
This data can e.g. be used for static analysis, type checking or type inference. stdlib/{2, 2and3, 3, 3.5, 3.6, 3.7} third_party/{2, 2and3, 3}/{jinja2,} Ideally, a ScholarlyArticle can also be published as HTML with RDFa and/or JSONLD (in addition to two column LaTeX/PDF which is lossy in regards to structured data / linked data) with its own document-level metadata simply as part of a graph of resources (such as schema:citation and schema:Datasets) described using a search-indexed vocabulary such as the RDFS vocabulary. An aside: has a range of {Text, URL} where the Text should be a 3 character UN/CEFACT Common Code; but there's also QUDT for unit URIs; fortunately, RDF allows repeated property values, so we can just add both. On Wednesday, July 4, 2018, Wes Turner <> wrote:

On Wed, Jun 27, 2018 at 05:20:01PM -0400, Andrei Kucharavy wrote: [...]
Why does this have to be a dunder method? In general, application code shouldn't be calling dunders directly, they're reserved for Python. I think your description of what this method should do is not really coherent. On the one hand, you have __citation__() be a method that you call (how?) but on the other hand you have it being a data field __citation__ that you scan. Which is it? I do think you have identified an important feature, but I think this is a *tool*, not a *language feature*. My spur of the moment thought is: - we could have a script (a third party script? or in the std lib?) which the user calls, giving the name of their module or package as argument e.g. "python -m cite" - this script knows how to analyse for a list of dependencies, perhaps filtering out standard library packages; - it interrogates myapplication, and each dependency, for a citation; - this might involve reserving a standard __citation__ data field in each module, or a __citation__.xml file in the package, or some other protocol; - or perhaps the cite script nows how to generate the appropriate citation itself, from any of the standard formatted data fields found in many common modules, like __author__, __version__ etc. - either way, the script would generate a list of packages and modules used by myapplication, plus citations for them. Presumably you would need to be able to specify which citation style to use. The point is, the *grunt work* of generating the citations is just a script. It isn't a language feature. It might not even be in the std lib (although perhaps we could ship it as a standard Python script, like the compileall module and a few other tools, starting in version 3.8). The protocol of how the script works out the citations can be developed. Perhaps we could reserve a __citation__ dunder as a de facto standard data field, like people already use __author__ and __version__ and similar. Or it could look for a separate XML or TXT file in the package directory.
What does this have to do with either import or setup?
A long time ago, I added a feature request for a page in the documentation to show how to cite Python in various formats: I don't believe there has been any progress on this. (I certainly don't know the right way to cite software.) Perhaps this can be merged with your idea. Should Python have a standard sys.__citation__ field that provides the relevant detail in some format-independent, machine-readable object like a named tuple? Then this hypothetical tool could read the tuple and format it according to any citation style. -- Steve

While I'm not personally in need of citations (and never felt I was) I can easily understand the point -- sometimes citations can make or break a career and having written a popular software package should be acknowledged. Are there other languages or software communities that do something like this? It would be nice not to have to invent this wheel. Eventually a PEP and an implementation should be presented, but first the idea needs to be explored more. --Guido On Wed, Jun 27, 2018 at 3:30 PM Andrei Kucharavy <> wrote:
-- --Guido van Rossum (

This is an interesting proposal. Speaking as a developer of scientific software packages it would be really cool to have support for something like this in the language itself. The software sustainability institute in the UK have written several blog posts advocating the use of CITATION files containing this sort of metadata: A github code search for __citation__ also gets 127 hits that mostly seem to be research software that are using this attribute more or less as suggested here: It's also worth pointing out which is sort of a citation search engine for software projects. It uses a number of heuristics to figure out what the appropriate citation for a piece of software is. On Wed, Jun 27, 2018 at 5:49 PM, Guido van Rossum <> wrote:

On 28/06/2018 00:00, Nathan Goldbaum wrote:
I just thought that it might be worth pointing out that this should actually work both ways i.e. if a specific package, module or function is inspired by or directly implements the methods included in a specific publication then any __citation__ entries within it should also cite that/those or allow references to them to be recovered. The general principle is if you are expecting to be cited you also have to cite. -- Steve (Gadget) Barnes Any opinions in this message are my personal opinions and do not reflect those of my employer. --- This email has been checked for viruses by AVG.

I think this is a fine idea, but could be achieved by convention, like __version__, rather than by fiat. And it’s certainly not a language feature. So Nathaniel’s right — the thing to do now is work out the convention, and then advocate for it. -CHB

> Are there other languages or software communities that do something like this? It would be nice not to have to invent this wheel. While I do not use R regularly, I understand their community is largely academic-driven, and citations are strongly encouraged as seen in their documentation: Here is an example use of their `citation()` function: > citation() To cite R in publications use: R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL A BibTeX entry for LaTeX users is @Manual{, title = {R: A Language and Environment for Statistical Computing}, author = {{R Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2013}, url = {}, } Calling the `citation()` function generates a BibTex output (, which is one of the most common citation conventions. For reference, I believe this is the source code:

That's a lot of responses, thanks for the interest and the suggestions! Are there other languages or software communities that do something like
To my knowledge, R is the only language that implements such a feature. Package developers add a CITATION text file containing a text with whatever text citation format for their package. A specialized citation() built-in function can be called from the REPL that would return a citation for the R itself, including a BibTex file for LateX users. When citation is called on a package instead, it returns the contents of CITATION for that package specifically (eg. citation("ggplot2")) or alternatively uses package metadata to build a sane citation. Given that most of work with R is done within a REPL and packages are installed/imported with commands such as install.package("ggplot2")/import("ggplot2"), this approach makes sense in that context. This, however, didn't feel terribly Pythonic to me. As for PEP and a reference implementation, I will gladly take care of them if the idea gets enough traction, but there seems to be already a PEP draft as well as an attempt at implementation by one of the AstroPy/AstroML maintainers, using the __citation__ field and citation() function to unpack it: There also seem some packages in the community using __bibtex__ rather than __citation__ to store BibTeX entries but I haven't found yet any large project implementing it or PEP drafts associated to it. The software sustainability institute in the UK have written several blog
posts advocating the use of CITATION files containing this sort of metadata:
Yes, that's the R approach I presented above. It is viable, especially if hooked to something accessible from the REPL directly, such as __cite__ or __citation__ attribute/method for modules. I would, however, advocate for a more structured approach - perhaps JSON or BibTeX that would get parsed and converted to suitable citation format by the __cite__, if it was implemented as a method. A github code search for __citation__ also gets 127 hits that mostly seem
Most of them are from the AstroPy universe or from the CitationPEP draft I've referenced above. This is indeed a serious problem. I suspect python-ideas isn't the
There has been localized discussion popping up among the large scientific package maintainers and some attempts to solve the problem at the local level. Until now they seemed to be winding down due to a lack of a large-scale citation mechanism and a discussion about what is concretely doable at the scale of the language is likely to finalize As for the list, reserving a __citation__/__cite__ for packages at the same level as __version__ is now reserved and adding a citation()/cite() function to the standard library seemed large enough modifications to warrant searching a buy-in from the maintainers and the community at large. You'll want to check out the duecredit project:
Due credit looks amazing - I will definitely check it out. The idea was, however, to bring the barrier for adoption and usage as low as possible. In my experience, the vast majority of Python users in academic environment who aren't citing the packages properly are beginners. As such they are unlikely to search for third-party libraries beyond those they've found and used to solve their specific problem. who just assembled a pipeline based on widely-used libraries and would need to generate a citation list for it to pass on to their colleagues responsible for the paper assembly and submission. I'd actually like to see a more general solution that isn't restricted
Thanks for the reference to the distutils-sig list. I will talk to them if the idea gets traction here I am not entirely convinced for the multi-language pipelines. In bioinformatics, often the heavy lifting is done by a single package (for instance bowtie for RNA-seq alignment) and the output is piped to the custom script, mostly in R or Python. The citations for the library doing the heavy-lifting is often well-known and widely cited and the issues arise in the custom scripts importing and using libraries that should be cited without citing them. One challenge in standardizing this kind of thing is choosing a
CLS-JSON represented as a dict to be supplied to the setup file is definitely one way of doing it. I was, however, thinking more about the BibTeX format, given that CLS-JSON is more closely affiliated with Mendeley Why does this have to be a dunder method? In general, application code shouldn't be calling dunders directly, they're reserved for Python. I was under the impression that sometimes the dunders are used to store relevant information that would not be of use to the most users, such as __version__ and sometimes to better control the execution flow (for instance the if __name__== "main") I think your description of what this method should do is not
My initial idea was to have a __cite__ method embedded in the import mechanism that would parse data from config and upon a call on a package, return the citation developers want to see associated to the current package version in the format user needs. (for instance numpy.__cite__('bibtex') would return a citation for the current numpy version in BibTeX format). If called on the script itself __cite__('bibtex') would iterate through all the imported modules and retrieve their citations one by one, at least for those that modules that have associated citation. After reading the feedback in this thread, I believe that a __citation__ reserved field that pulls the data from the setup script and a cite() script in the standard library would be a better approach. In the end, I believe the best would be to implement both of them and see which one feels more pythonic. I do think you have identified an important feature, but I think this is
Yes, that's the idea! The biggest reason for me to send the discussion to this list is to check if it would be acceptable to reserve the __citation__ data field in each module and include the cite() script in the standard library. Presumably you would need to be able to specify which citation style to
Yes, but to avoid building a configurable citation engine for the thousands of formats there are in the wild, it would take a couple of standard formats and interchangeable formats, such as bibtex or EndNote xref - both text formats that are simple to use. I was thinking about the approach taken by Google Scholar from that perspective.
The implementation I was thinking about would have required __citation__/__cite__ dunder reservation or implementation of a function that would be injected into installed packages. For setup I was thinking about adding the citation field to the distutils setup. I was not really aware of the distutils-sig discussion list that would be more appropriate with that regards. A long time ago, I added a feature request for a page in the
That's a good point. Unfortunately, I have not thought about how to cite code that would not have an associated publication. From what I see by checking google scholar, as of now people are citing the Python language reference manual if they want to cite Python itself in a scientific publication. GVM didn't seem interested in citations for Python and from what I understand the vast majority of non-scientific package developer, given citations are not essential for their career advancement. Should Python have a standard sys.__citation__ field that provides the
The idea for Python itself seems good! However, rather than using a named tuple, I was thinking about using a dict consistent with CSL-JSON or BibTeX. And writing a citation generating engine that would be consistent with hundreds if not thousands journal-specific formats is a bit of the scope of the proposal for now - most of the time people just want something their citation/bibliography engine can ingest and generate a citation from there in their Word/LaTeX documents. Bibtex/EndNote export formats are perfect for that task in my experience.
The general convention is to cite the top-level publication. While some methods definitely deserve a citation on their own (such as Sobol filter in Scikits-image), they provide a link to the relevant citation in their documentation to them and would normally cite them in their master publication. That's definitely an idea to look at but I don't see a straightforward of implementing this so far. I think this is a fine idea, but could be achieved by convention, like
This already seems to be an idea floating in the air - AstroPy is inching towards that implementation. The idea is to modify the language to make citing as straightforward as possible and create a universal mechanism for that. Best, *Andrei Kucharavy* Post-Doc @ *Joel S. Bader** Lab* Johns Hopkins University, Baltimore, USA. On Thu, Jun 28, 2018 at 11:48 AM Chris Barker - NOAA Federal via Python-ideas <> wrote:

credits Thanks to CWI, CNRI,, Zope Corporation and a cast of
One more thing. There's precedent for this: when you start an interactive Python interpreter it tells you how to get help, but also how to get copyright, credits and license information: $ python3 Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. thousands for supporting Python development. See for more information.
It makes total sense to add citations/references to this list (and those should probably print a reference for Python followed by instructions on how to get references for other packages and how to properly add a reference to your own code). -- --Guido van Rossum (

On Thu, Jun 28, 2018 at 05:25:00PM -0400, Andrei Kucharavy wrote:
I think that an approach similar to help/quit/exit is warranted. The cite()/citation() function need not be *literally* built into the language, it could be an external function written in Python and added to builtins by the module. -- Steve

For me, it's about setting a standard that is endorsed by the language, and setting expectations for users. There currently is no standard, which is why packages use __citation__, __cite__, __bibtex__, etc., and as a user I don't immediately know where to look for citation information (without going to the source). My feeling is that adopting __citation__ or some dunder name could be implemented on classes, functions, etc. with less of a chance of naming conflicts, but am open to discussion. I have some notes here about various ideas for more advanced functionality that would support automatically keeping track of citation information for imported packages, classes, functions: On Thu, Jun 28, 2018 at 10:57 PM, Alex Walters <> wrote:
-- Adrian M. Price-Whelan Lyman Spitzer, Jr. Postdoctoral Fellow Princeton University

But don't all the users who care about citing modules already use the scientific python packages, with scipy itself at it's center? Wouldn't those engaging in science or in academia be better stewards of this than systems programmers? Since you're not asking for anything that can't be done in a third party module, and there is a third party module that most of the target audience of this standard would already have, there is zero reason to take up four names in the python runtime to serve those users.

On Thu, Jun 28, 2018 at 11:26 PM, Alex Walters <> wrote:
Not all scientific software in Python depends on scipy or even numpy. However, it does all depend on Python. Although perhaps that argues for a cross-language solution :) I still think it would be very nice to have an official standard for citation information in Python packages as codified in a PEP. That would reduce ambiguity and make it much easier for tool-writers who want to parse citation information.

If that's possible, that would be great!
I was not aware this was a possibility - it does seem like a good option! If I were you, I'd try organizing a birds-of-a-feather at the next
Not all packages are within the numpy/scipy universe - Pandas and Seaborn are notable examples. I bought this thread to the attention of some major scientific package maintainers as well as the main citationPEP author. I am not entirely sure where this conversations could be moved outside python-ideas given we are talking about something universal across packages, but would gladly take any suggestions. There isn't actually any formal method for registering special names
Thanks for the explanation - Python development and maintenance do seem to be a complex process from the outside and this kind of subtleties are not always easy to distinguish :). The way to do this is to first get your solution implemented as a
Got it. I think you misunderstand how these lists work :-). (Which is fine --
Got it as well - that does indeed seem a reasonable way of doing things, although I believe there have been precedents where GVM implemented a feature from scratch after studying existing libraries (I am thinking notably about asyncio, which is orders of magnitude more complex and involved than anything we are talking here). And often the custom scripts are a mix of R and Python, and maybe some
In my experience, people tend to go with either one or other or use Julia. I am not very familiar with Fortran ecosystem - as far as I've seen, those are extremely efficient libraries that get wrapped and used in most modern scientific computing languages, but very rarely directly. In addition to that, while I see how granular citations could be implemented in Python, I have a bit more trouble understanding how calls to R, Python, Perl, C, C++ or Fortran from command line scripts can be analyzed on the fly to get metadata about citations. I have even more trouble imagining how it would be possible to bring developers across all the separate language communities to agree on a single standard.
Hm - was not aware Zotero uses it as well - it's definitely a good sign and I will have to look into CLS-JSON it more in depth. Why not scipy.cite() or scipy.citation()? I don't see any reason for these
functions to ship with standard python at all.
There are packages that do not depend on scipy and even for those that do - most users writing analysis pipelines for scientific packages are unaware that they are using scipy/numpy underneath the packages that do what they want at the highest level. I don't think that this is a very useful idea, because most people that
Thanks for your opinion Gael - as maintainer of scikits-learn you have more experience with this issue more than most of us. In my field (computational biology in molecular biology labs) the situation is somewhat different - most of the custom scripts are implemented by people who often have learned Python or programming at all in the last couple of years. Most of the time they get asked by the corresponding author to provide 1-5 citations for their analytical pipeline and to describe what they did in the supplementary material and I had several junior developers in my labs come forwards to me asking what they were supposed to cite and where to find the citations. We aren't likely to convince everyone to cite code overnight, but making citing as easy as possible does seem like a step in the right direction to me. I still think it would be very nice to have an official standard for
That's my opinion as well. To summarize the conversation until now, it seems that __citation__ data field and a cite() script seem to be the preferred option. If the proposal gets traction and is accepted, the citation for Python as well as the instructions to get citation for a package can be added as a top-level command, similar to credits, copyright or license. As of now, it seems like the next steps would be to: - draft a PEP (or complete the existing one) and implement the cite() script as well as a show-case package using __citation__ - talk to major package maintainers to see if they have any objections to the method or suggestions with regards to pep/implementation - talk to the distutils-sig list to see if we could add the __citation__ metadata to - submit a proper PEP (Would a pull request to be an acceptable way of doing it?) Is there something I might be missing so far? Best, *Andrei Kucharavy* Post-Doc @ *Joel S. Bader** Lab* Johns Hopkins University, Baltimore, USA. On Fri, Jun 29, 2018 at 10:51 AM Nathan Goldbaum <> wrote:

On Fri, Jun 29, 2018, 8:14 PM Andrei Kucharavy <> wrote:
Not all packages are within the numpy/scipy universe - Pandas and Seaborn are notable examples.
Huh?! Pandas is a thin wrapper around NumPy. To be fair, it is a wrapper that adds a huge number of wrapping methods and classes. Seaborn in turn has at least a soft dependency on Pandas (some of the charts really need a DataFrame to work from). I like the idea of standardizing curation information. But it has little to do with Python itself. Getting the authors of scientific packages to agree on conventions is what needed, and doing that requires accurately determining their needs, not some mandate from Python itself. Nothing in the language needs to change to agree on some certain collection of names (perhaps dunders, perhaps not), and some certain formats for the data that might live inside them. Down the road, if there gets to be widespread acceptance of these conventions, Python standard library might include a function or two to work with them. But the horse should go before the cart.

On Fri, Jun 29, 2018, 17:14 Andrei Kucharavy <> wrote:
This is thin justification to add something to core. It seems like the very small percentage of academic users whose careers depend on this cannot resolve the political issue of forming a standards body. I don't see how externalizing the standard development will help. Kudos for shortcutting the process in a practical way to just get it done, but this just puts core devs in the middle of silly academic spats. A language endorsed citation method isn't a 'correct' method, and without the broad consensus that currently doesn't exist, this becomes _your_ method, a picked winner but ultimately a lightning rod for bored tenured professors with personal axes to grind. If this were about implementing an existing correct method I'm sure a grad student would be tasked with it for an afternoon. This is insanely easy to implement in docstrings, or a standard import, or mandatory include, or decorator, or anywhere else, it's just a parsing protocol. I believe 3.7 now exposes docstrings in the AST, meaning a simple static analyzer can handle all of PyPi, giving you crazy granularity if citations existed. Don't you want to cite the exact algorithm used in an imported method, not just lump them all into one call? Heck, I bet you could use type annotations. This really feels like you've got an amazing multi-tool but you want to turn the world, not the screw. This isn't a tool the majority of people will use, even if the citations exist. Don't get me wrong, I love designing standards and protocols, but this is pretty niche. I assume it won't be mandatory so I'm tilting at windmills, but then if it's not mandatory, what's the point of putting it in core? Just create a jstor style git server where obeying the citation protocol is mandatory. Of course, enforcing a missing citation is impossible, but it does mean citations can be generated by parsing imports. This is how it will evolve over time, by employing core devs on that server framework.

On Fri, Jun 29, 2018 at 8:58 PM, Matt Arcidy <> wrote:
[...] Just create a jstor style git server where obeying the citation
protocol is mandatory.
I don't know if it constitutes a standards body, but there are a couple journals out there that are meant to serve as mechanisms for turning a repo into a published/citable thing, they might be good to look at for prior art as well as to what metadata should be included: * (sponsored by NumFOCUS) * pkg_resources that could pull out a short citation string from some package metadata (a hypothetical `pkg_resources.get_distribution("numpy").citation` that could be wrapped by some helper function if desired)? The actual mechanism to convert metadata into something in the repo (a dunder cite string in the root module, a separate metadata file, etc.) into the package metadata isn't as important as rolling said metadata into something part of the distribution package like the version or long_description fields. Once the schema of the citation data is defined, you could add it to the metadata spec (outgrowth of PEP-566)

Putting citation information into pyproject.toml makes a lot more sense than putting it in the modules themselves, where they would have to be introspected to be extracted. * It puts zero burden on the core developers * It puts near zero burden on the distutils special interest group * It doesn't consume names from the package namespace * It's just a TOML file - you can add sections to it willy-nilly * It's just a TOML file - there's libraries in almost all ecosystems to handle it. Nothing has to go into the core metadata specification unless part of your suggestion is that Pypi show the citations. I don't think that is a good idea for the scope of Pypi and the workload of the warehouse developers. I don't think it's too much to ask for the scientific community to figure out the solution that works for most people before bringing it back here. I also don't think its out of scope to suggest taking this to SciPy - yes, not everything depends on SciPy, but you don't need everything, you just momentum.

On Thu, Jun 28, 2018 at 2:25 PM, Andrei Kucharavy <> wrote:
Those are the people with the most motivation and expertise to solve this, and whose buy-in you'll need on any solution. If they haven't solved it yet themselves, then there are basically two reasons why that happens: either because they're busy and no-one's had enough time to work on it, or else because they're uncertain about the best path forward. Neither of these is a problem that python-ideas can help with. If you want to be effective here, you need to talk to them to figure out how you can help them move forward. If I were you, I'd try organizing a birds-of-a-feather at the next SciPy conference, or start getting in touch with others working on this (duecredit devs, the folks listed on that citationPEP thing, etc.), and go from there. (Feel free to CC me if you do start up some effort like this.)
There isn't actually any formal method for registering special names like __version__, and they aren't treated specially by the language. They're just variables that happen to have a funny name. You shouldn't start using them willy-nilly, but you don't actually have to ask permission or anything. And it's not very likely that someone else will come along and propose using the name __citation__ for something that *isn't* a citation :-).
The way to do this is to first get your solution implemented as a third-party library and adopted by the scientific packages, and then start thinking about whether it would make sense to move the library into the standard library. It's relatively easy to move things into the standard library. The hard part is making sure that you implemented the right thing in the first place, and that's MUCH more likely if you start out as a third-party package.
I think you misunderstand how these lists work :-). (Which is fine -- it's actually pretty opaque and confusing if you don't already know!) Generally, distutils-sig operates totally independently from python-{ideas,dev} -- if you have a packaging proposal, it goes there and not here; if you have a language proposal, it goes here and not there. *If* what you want to do is add some static metadata to python packages through, then python-ideas is irrelevant and distutils-sig is who you'll have to convince. (But they'll also want to see that your proposal has buy-in from established packages, because they don't understand the intricacies of software citation and will want people they trust to tell them whether the proposal makes sense.)
And often the custom scripts are a mix of R and Python, and maybe some Fortran, ... Plus, if it works for multiple languages, it means you get to share part of the work with other ecosystems, instead of everyone reinventing the wheel. Also, if you want to go down the dynamic route (which is the only way to get accurate fine-grained citations), then it's just as easy to solve the problem in a language independent way.
Huh, is it? I only know it from Zotero. -n -- Nathaniel J. Smith --

On 29 June 2018 at 12:14, Nathaniel Smith <> wrote:
The one caveat on dunder names is that we expressly exempt them from our usual backwards compatibility guarantees, so it's worth getting some level of "No, we're not going to do anything that would conflict with your proposed convention" at the language design level.
Aye, in this case I think you can comfortably assume that we'll happily leave the "__citation__" and "__cite__" dunder names alone unless/until there's a clear consensus in the scientific Python community to use them a particular way. And even then, it would likely be Python package installers like pip, Python environment managers like pipenv, and data analysis environment managers like conda that would handle the task of actually consuming that metadata (in whatever form it may appear). Having your citation management support depend on which version of Python you were using seems like it would be mostly a source of pain rather than beneficial. Cheers, Nick. -- Nick Coghlan | | Brisbane, Australia

On Wed, Jun 27, 2018 at 2:20 PM, Andrei Kucharavy <> wrote:
This is indeed a serious problem. I suspect python-ideas isn't the best venue for addressing it though – there's nothing here that needs changes to the Python interpreter itself (I think), and the people who understand this problem the best and who are most affected by it, mostly aren't here. You'll want to check out the duecredit project: One of the things they've thought about is the ability to track citation information at a more fine-grained way than per-package – for example, there might be a paper that should be cited by anyone who calls a particular method (or even passes a specific argument to some specific method, when that turns on some fancy algorithm). The R world also has some prior art -- in particular I know they have citations as part of the standard metadata in every package. I'd actually like to see a more general solution that isn't restricted to any one language, because multi-language analysis pipelines are very common. For example, we could standardize a convention where if a certain environment variable is set, then the software writes out citation information to a certain location, and then implement libraries that do this in multiple languages. Of course, that's a "dynamic" solution that requires running the software -- which is probably necessary if you want to do fine-grained citations, but it might be useful to also have static metadata, e.g. as part of the package metadata that goes into sdists, wheels, and on PyPI. That would be a discussion for the distutils-sig mailing list, which manages that metadata. One challenge in standardizing this kind of thing is choosing a standard way to represent citation information. Maybe CSL-JSON? There's a lot of complexity as you dig into this, though of course one shouldn't let the perfect be the enemy of the good... -n -- Nathaniel J. Smith --

On 28 June 2018 at 01:19, Nathaniel Smith <> wrote:
I actually think the opposite. If this is not fixed in a PEP it will stay in the current state. Writing a PEP (and officially accepting it) for this purpose will give a signal that it is a standard practice

I think a __citation__ *method* is a bad idea. This yells out "attribute" to me. A function or two that parses those attributes in some manner is a better idea... And there's no reason that function or two need to be dunders. There's also no reason they need to be in the standard library... There might be many citation/writing applications that process the data to their own needs. But assuming there is an attribute, WHAT goes inside it? Is it a string? And if so, in what markup format? Is it a dictionary? A list? A custom class? Does some wrapper function deal with different formats. Does the wrapper also scan for __author__, __copyright__, and friends? We also need to decide what __citation__ is an attribute OF. Only modules? Classes? Methods? Functions? All of the above? If multiple, how are the attributes at different places synthesized or processed? Can one object have multiple citations (e.g. what if a class or method implements multiple algorithms depending on a switch... Or depending on the shape of the data being processed? The different algorithms might need different citations). These are all questions that could have good answers. But I don't know what the answers are. I've worked in scientific computing for a good while, but not as an academic. And when I was an academic it wasn't in scientific computing. This list is not mostly composed of the relevant experts. Those are the authors and users of SciPy and statsmodels, and scikit-learn, and xarray, and Tensorflow, and astropy, and so on. There's absolutely nothing in the idea that requires a change in Python, and Python developers or users are not, as such, the relevant experts. In the future, AFTER there is widespread acceptance of what goes on a __citation__ attribute, it would be easy and obvious to add minimal support in Python itself for displaying citation content. But this is the wrong group to mandate what the actual academic needs are here. On Sun, Jul 1, 2018, 9:07 AM Ivan Levkivskyi <> wrote:

On Sun, Jul 1, 2018 at 9:45 AM David Mertz <> wrote:
This is not entirely true. If some variant of __citation__ is endorsed by the community, I would expect that pydoc would extract this information to fill an appropriate section in the documentation page. Note that pydoc already treats a number of dunder variables specially: '__author__', '__credits__', and '__version__' are a few that come to mind, so I don't think the threshold for adding one more should be too high. On the other hand, maybe '__author__', '__credits__', and '__citation__' should be merged in one structured variable (a dict?) with format designed with some extendability in mind. CreativeWork has a field with a range of {CreativeWork, Text} There's also a attribute with a domain of CreativeWork and a range of {Organization, Person} - BibTeX is actually somewhat ill-specified, TBH. - There is a repository of CSL styles at . - CSL is sponsored by both Zotero and Mendeley. - A number of search engines support (and JSONLD) - The RDFS vocabulary is designed to describe a graph of resources (CreativeWork, Code, SoftwareApplication, ScholarlyArticle, MedicalScholarlyArticle). __citation__ = [{}, ] __citation__ = { '@type': ['schema:ScholarlyArticle'], 'schema:name': '', 'schema:author': [{ '@type': 'schema:Person', '...': '...'}] } JSONLD is ideal for describing a graph of resources with varied types. If the overhead of __citation__ for every import is unjustified, a lookup of methods with dotted names that finds entries for root modules as well would be great:
citations('json.loads') citations('list.sort')
A tracing debugger could lookup each and every package, module, function, and method each ScholarlyArticle SoftwareApplication executes (from a registry in e.g. a or a _citations_.jsonld.json). It'd be a shame to need to manually format citations for a particular Journal's CSL bibliographic metadata template preference. sphinxcontrib-bibtex is a Sphinx extension for BibTeX support (with a bibliography directive and a cite role) - Src: Jupyter notebooks support document-level metadata (in JSON that's currently only similar to JSONLD). is search engine indexable. On Wednesday, July 4, 2018, Alexander Belopolsky <> wrote:

... a schema:Dataset may be part of a Creative work. #LinkedReproducibility #nbmeta On Wednesday, July 4, 2018, Wes Turner <> wrote:

Typeshed contains external type annotations for the Python standard
typeshed, dotted lookup, ScholarlyArticle semantic graphs with classes, properties, and URIs Would external metadata (similar to how typeshed is defined in a 'shadow naming scheme' (?)) be advantageous for dotted name lookup of citation metadata? library and Python builtins, as well as third party packages.
This data can e.g. be used for static analysis, type checking or type inference. stdlib/{2, 2and3, 3, 3.5, 3.6, 3.7} third_party/{2, 2and3, 3}/{jinja2,} Ideally, a ScholarlyArticle can also be published as HTML with RDFa and/or JSONLD (in addition to two column LaTeX/PDF which is lossy in regards to structured data / linked data) with its own document-level metadata simply as part of a graph of resources (such as schema:citation and schema:Datasets) described using a search-indexed vocabulary such as the RDFS vocabulary. An aside: has a range of {Text, URL} where the Text should be a 3 character UN/CEFACT Common Code; but there's also QUDT for unit URIs; fortunately, RDF allows repeated property values, so we can just add both. On Wednesday, July 4, 2018, Wes Turner <> wrote:

On Wed, Jun 27, 2018 at 05:20:01PM -0400, Andrei Kucharavy wrote: [...]
Why does this have to be a dunder method? In general, application code shouldn't be calling dunders directly, they're reserved for Python. I think your description of what this method should do is not really coherent. On the one hand, you have __citation__() be a method that you call (how?) but on the other hand you have it being a data field __citation__ that you scan. Which is it? I do think you have identified an important feature, but I think this is a *tool*, not a *language feature*. My spur of the moment thought is: - we could have a script (a third party script? or in the std lib?) which the user calls, giving the name of their module or package as argument e.g. "python -m cite" - this script knows how to analyse for a list of dependencies, perhaps filtering out standard library packages; - it interrogates myapplication, and each dependency, for a citation; - this might involve reserving a standard __citation__ data field in each module, or a __citation__.xml file in the package, or some other protocol; - or perhaps the cite script nows how to generate the appropriate citation itself, from any of the standard formatted data fields found in many common modules, like __author__, __version__ etc. - either way, the script would generate a list of packages and modules used by myapplication, plus citations for them. Presumably you would need to be able to specify which citation style to use. The point is, the *grunt work* of generating the citations is just a script. It isn't a language feature. It might not even be in the std lib (although perhaps we could ship it as a standard Python script, like the compileall module and a few other tools, starting in version 3.8). The protocol of how the script works out the citations can be developed. Perhaps we could reserve a __citation__ dunder as a de facto standard data field, like people already use __author__ and __version__ and similar. Or it could look for a separate XML or TXT file in the package directory.
What does this have to do with either import or setup?
A long time ago, I added a feature request for a page in the documentation to show how to cite Python in various formats: I don't believe there has been any progress on this. (I certainly don't know the right way to cite software.) Perhaps this can be merged with your idea. Should Python have a standard sys.__citation__ field that provides the relevant detail in some format-independent, machine-readable object like a named tuple? Then this hypothetical tool could read the tuple and format it according to any citation style. -- Steve
participants (18)
Adrian Price-Whelan
Alex Walters
Alexander Belopolsky
Andrei Kucharavy
Antoine Pitrou
Chris Barker - NOAA Federal
David Mertz
Guido van Rossum
Ivan Levkivskyi
Matt Arcidy
Nathan Goldbaum
Nathaniel Smith
Nick Coghlan
Nick Timkovich
Steve Barnes
Steven D'Aprano
Wes Turner