[Python-ideas] Add a __cite__ method for scientific packages

Andrei Kucharavy andrei.kucharavy at gmail.com
Thu Jun 28 17:25:00 EDT 2018


That's a lot of responses, thanks for the interest and the suggestions!


Are there other languages or software communities that do something like
> this? It would be nice not to have to invent this wheel. Eventually a PEP
> and an implementation should be presented, but first the idea needs to be
> explored more.


To my knowledge, R is the only language that implements such a feature.
Package developers add a CITATION text file containing a text with whatever
text citation format for their package. A specialized citation() built-in
function can be called from the REPL that would return a citation for the R
itself, including a BibTex file for LateX users. When citation is called on
a package instead, it returns the contents of CITATION for that package
specifically (eg. citation("ggplot2")) or alternatively uses package
metadata to build a sane citation. Given that most of work with R is done
within a REPL and packages are installed/imported with commands such as
install.package("ggplot2")/import("ggplot2"), this approach makes sense in
that context. This, however, didn't feel terribly Pythonic to me.

As for PEP and a reference implementation, I will gladly take care of them
if the idea gets enough traction, but there seems to be already a PEP draft
as well as an attempt at implementation by one of the AstroPy/AstroML
maintainers, using the __citation__ field and citation() function to unpack
it:

https://github.com/adrn/CitationPEP

There also seem some packages in the community using __bibtex__ rather than
__citation__ to store BibTeX entries but I haven't found yet any large
project implementing it or PEP drafts associated to it.


The software sustainability institute in the UK have written several blog
> posts advocating the use of CITATION files containing this sort of metadata:
> https://software.ac.uk/blog/2017-12-12-standard-format-citation-files


Yes, that's the R approach I presented above. It is viable, especially if
hooked to something accessible from the REPL directly, such as __cite__ or
__citation__ attribute/method for modules. I would, however, advocate for a
more structured approach - perhaps JSON or BibTeX that would get parsed and
converted to suitable citation format by the __cite__, if it was
implemented as a method.

A github code search for __citation__ also gets 127 hits that mostly seem
> to be research software that are using this attribute more or less as
> suggested here:
> https://github.com/search?q=__citation__&type=Code


Most of them are from the AstroPy universe or from the CitationPEP draft
I've referenced above.

This is indeed a serious problem. I suspect python-ideas isn't the
> best venue for addressing it though – there's nothing here that needs
> changes to the Python interpreter itself (I think), and the people who
> understand this problem the best and who are most affected by it,
> mostly aren't here.


There has been localized discussion popping up among the large scientific
package maintainers and some attempts to solve the problem at the local
level. Until now they seemed to be winding down due to a lack of a
large-scale citation mechanism and a discussion about what is concretely
doable at the scale of the language is likely to finalize

As for the list, reserving a __citation__/__cite__ for packages at the same
level as __version__ is now reserved and adding a citation()/cite()
function to the standard library seemed large enough modifications to
warrant searching a buy-in from the maintainers and the community at large.

You'll want to check out the duecredit project:
> https://github.com/duecredit/duecredit
> One of the things they've thought about is the ability to track
> citation information at a more fine-grained way than per-package – for
> example, there might be a paper that should be cited by anyone who
> calls a particular method (or even passes a specific argument to some
> specific method, when that turns on some fancy algorithm).


Due credit looks amazing - I will definitely check it out. The idea was,
however, to bring the barrier for adoption and usage as low as possible. In
my experience, the vast majority of Python users in academic environment
who aren't citing the packages properly are beginners. As such they are
unlikely to search for third-party libraries beyond those they've found and
used to solve their specific problem.

 who just assembled a pipeline based on widely-used libraries and would
need to generate a citation list for it to pass on to their colleagues
responsible for the paper assembly and submission.

I'd actually like to see a more general solution that isn't restricted
> to any one language, because multi-language analysis pipelines are
> very common. For example, we could standardize a convention where if a
> certain environment variable is set, then the software writes out
> citation information to a certain location, and then implement
> libraries that do this in multiple languages. Of course, that's a
> "dynamic" solution that requires running the software -- which is
> probably necessary if you want to do fine-grained citations, but it
> might be useful to also have static metadata, e.g. as part of the
> package metadata that goes into sdists, wheels, and on PyPI. That
> would be a discussion for the distutils-sig mailing list, which
> manages that metadata.


Thanks for the reference to the distutils-sig list. I will talk to them if
the idea gets traction here

I am not entirely convinced for the multi-language pipelines. In
bioinformatics, often the heavy lifting is done by a single package (for
instance bowtie for RNA-seq alignment) and the output is piped to the
custom script, mostly in R or Python. The citations for the library doing
the heavy-lifting is often well-known and widely cited and the issues arise
in the custom scripts importing and using libraries that should be cited
without citing them.

One challenge in standardizing this kind of thing is choosing a
> standard way to represent citation information. Maybe CSL-JSON?
> There's a lot of complexity as you dig into this, though of course one
> shouldn't let the perfect be the enemy of the good...


CLS-JSON represented as a dict to be supplied to the setup file is
definitely one way of doing it. I was, however, thinking more about the
BibTeX format, given that CLS-JSON is more closely affiliated with Mendeley

Why does this have to be a dunder method? In general, application code

shouldn't be calling dunders directly, they're reserved for Python.


I was under the impression that sometimes the dunders are used to store
relevant information that would not be of use to the most users, such as
__version__ and sometimes to better control the execution flow (for
instance the if __name__== "main")

I think your description of what this method should do is not
> really coherent. On the one hand, you have __citation__() be a method
> that you call (how?) but on the other hand you have it being a data
> field __citation__ that you scan.


My initial idea was to have a __cite__ method embedded in the import
mechanism that would parse data from config and upon a call on a
package, return the citation developers want to see associated to the
current package version in the format user needs. (for instance
numpy.__cite__('bibtex') would return a citation for the current numpy
version in BibTeX format). If called on the script itself __cite__('bibtex')
would iterate through all the imported modules and retrieve their citations
one by one, at least for those that modules that have associated citation.

After reading the feedback in this thread, I believe that a __citation__
reserved field that pulls the data from the setup script and a cite()
script in the standard library would be a better approach.

In the end, I believe the best would be to implement both of them and see
which one feels more pythonic.

I do think you have identified an important feature, but I think this is
> a *tool*, not a *language feature*. My spur of the moment thought is:
> - we could have a script (a third party script? or in the std lib?)
>   which the user calls, giving the name of their module or package as
>   argument
>   e.g. "python -m cite myapplication.py"
> - this script knows how to analyse myapplication.py for a list of
>   dependencies, perhaps filtering out standard library packages;
> - it interrogates myapplication, and each dependency, for a citation;
> - this might involve reserving a standard __citation__ data field
>   in each module, or a __citation__.xml file in the package, or
>   some other protocol;
> - or perhaps the cite script nows how to generate the appropriate
>   citation itself, from any of the standard formatted data fields
>   found in many common modules, like __author__, __version__ etc.
> - either way, the script would generate a list of packages and
>   modules used by myapplication, plus citations for them.


Yes, that's the idea! The biggest reason for me to send the discussion to
this list is to check if it would be acceptable to reserve the __citation__
data field in each module and include the cite() script in the standard
library.

Presumably you would need to be able to specify which citation style to
> use.


Yes, but to avoid building a configurable citation engine for the thousands
of formats there are in the wild,
it would take a couple of standard formats and interchangeable formats,
such as bibtex or EndNote xref - both text
formats that are simple to use. I was thinking about the approach taken by
Google Scholar from that perspective.

> What does Python core team think about addition and long-term maintenance
> > of such a feature to the import and setup mechanisms?
> What does this have to do with either import or setup?


The implementation I was thinking about would have required
__citation__/__cite__ dunder reservation or implementation of a function
that would be injected into installed packages. For setup I was thinking
about adding the citation field to the distutils setup. I was not really
aware of the distutils-sig discussion list that would be more appropriate
with that regards.

A long time ago, I added a feature request for a page in the
> documentation to show how to cite Python in various formats:
> https://bugs.python.org/issue26597
> I don't believe there has been any progress on this. (I certainly don't
> know the right way to cite software.) Perhaps this can be merged with
> your idea.


That's a good point. Unfortunately, I have not thought about how to cite
code that would not have an associated publication. From what I see by
checking google scholar, as of now people are citing the Python language
reference manual if they want to cite Python itself in a scientific
publication. GVM didn't seem interested in citations for Python and from
what I understand the vast majority of non-scientific package developer,
given citations are not essential for their career advancement.

Should Python have a standard sys.__citation__ field that provides the
> relevant detail in some format-independent, machine-readable object like
> a named tuple? Then this hypothetical cite.py tool could read the tuple
> and format it according to any citation style.


The idea for Python itself seems good! However, rather than using a named
tuple, I was thinking about using a dict consistent with CSL-JSON or
BibTeX. And writing a citation generating engine that would be consistent
with hundreds if not thousands journal-specific formats is a bit of the
scope of the proposal for now - most of the time people just want something
their citation/bibliography engine can ingest and generate a citation from
there in their Word/LaTeX documents. Bibtex/EndNote export formats are
perfect for that task in my experience.

>
> just thought that it might be worth pointing out that this should
> actually work both ways i.e. if a specific package, module or function
> is inspired by or directly implements the methods included in a specific
> publication then any __citation__ entries within it should also cite
> that/those or allow references to them to be recovered.
> The general principle is if you are expecting to be cited you also have
> to cite.


The general convention is to cite the top-level publication. While some
methods definitely deserve a citation on their own (such as Sobol filter in
Scikits-image), they provide a link to the relevant citation in their
documentation to them and would normally cite them in their master
publication. That's definitely an idea to look at but I don't see a
straightforward of implementing this so far.

I think this is a fine idea, but could be achieved by convention, like
> __version__, rather than by fiat.
> And it’s certainly not a language feature.
> So Nathaniel’s right — the thing to do now is work out the convention,
> and then advocate for it.


This already seems to be an idea floating in the air - AstroPy is inching
towards that implementation. The idea is to modify the language to make
citing as straightforward as possible and create a universal mechanism for
that.

Best,

*Andrei Kucharavy*

Post-Doc @ *Joel S. Bader** Lab*

Johns Hopkins University, Baltimore, USA.


On Thu, Jun 28, 2018 at 11:48 AM Chris Barker - NOAA Federal via
Python-ideas <python-ideas at python.org> wrote:

> I think this is a fine idea, but could be achieved by convention, like
> __version__, rather than by fiat.
>
> And it’s certainly not a language feature.
>
> So Nathaniel’s right — the thing to do now is work out the convention,
> and then advocate for it.
>
> -CHB
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180628/7c8840fe/attachment-0001.html>


More information about the Python-ideas mailing list