[Python-ideas] Add a __cite__ method for scientific packages

Nathaniel Smith njs at pobox.com
Thu Jun 28 22:14:33 EDT 2018

On Thu, Jun 28, 2018 at 2:25 PM, Andrei Kucharavy
<andrei.kucharavy at gmail.com> wrote:
>> This is indeed a serious problem. I suspect python-ideas isn't the
>> best venue for addressing it though – there's nothing here that needs
>> changes to the Python interpreter itself (I think), and the people who
>> understand this problem the best and who are most affected by it,
>> mostly aren't here.
> There has been localized discussion popping up among the large scientific
> package maintainers and some attempts to solve the problem at the local
> level. Until now they seemed to be winding down due to a lack of a
> large-scale citation mechanism and a discussion about what is concretely
> doable at the scale of the language is likely to finalize

Those are the people with the most motivation and expertise to solve
this, and whose buy-in you'll need on any solution. If they haven't
solved it yet themselves, then there are basically two reasons why
that happens: either because they're busy and no-one's had enough time
to work on it, or else because they're uncertain about the best path
forward. Neither of these is a problem that python-ideas can help
with. If you want to be effective here, you need to talk to them to
figure out how you can help them move forward.

If I were you, I'd try organizing a birds-of-a-feather at the next
SciPy conference, or start getting in touch with others working on
this (duecredit devs, the folks listed on that citationPEP thing,
etc.), and go from there. (Feel free to CC me if you do start up some
effort like this.)

> As for the list, reserving a __citation__/__cite__ for packages at the same
> level as __version__ is now reserved and adding a citation()/cite() function
> to the standard library seemed large enough modifications to warrant
> searching a buy-in from the maintainers and the community at large.

There isn't actually any formal method for registering special names
like __version__, and they aren't treated specially by the language.
They're just variables that happen to have a funny name. You shouldn't
start using them willy-nilly, but you don't actually have to ask
permission or anything. And it's not very likely that someone else
will come along and propose using the name __citation__ for something
that *isn't* a citation :-).

>> You'll want to check out the duecredit project:
>> https://github.com/duecredit/duecredit
>> One of the things they've thought about is the ability to track
>> citation information at a more fine-grained way than per-package – for
>> example, there might be a paper that should be cited by anyone who
>> calls a particular method (or even passes a specific argument to some
>> specific method, when that turns on some fancy algorithm).
> Due credit looks amazing - I will definitely check it out. The idea was,
> however, to bring the barrier for adoption and usage as low as possible. In
> my experience, the vast majority of Python users in academic environment who
> aren't citing the packages properly are beginners. As such they are unlikely
> to search for third-party libraries beyond those they've found and used to
> solve their specific problem.
>  who just assembled a pipeline based on widely-used libraries and would need
> to generate a citation list for it to pass on to their colleagues
> responsible for the paper assembly and submission.

The way to do this is to first get your solution implemented as a
third-party library and adopted by the scientific packages, and then
start thinking about whether it would make sense to move the library
into the standard library. It's relatively easy to move things into
the standard library. The hard part is making sure that you
implemented the right thing in the first place, and that's MUCH more
likely if you start out as a third-party package.

>> I'd actually like to see a more general solution that isn't restricted
>> to any one language, because multi-language analysis pipelines are
>> very common. For example, we could standardize a convention where if a
>> certain environment variable is set, then the software writes out
>> citation information to a certain location, and then implement
>> libraries that do this in multiple languages. Of course, that's a
>> "dynamic" solution that requires running the software -- which is
>> probably necessary if you want to do fine-grained citations, but it
>> might be useful to also have static metadata, e.g. as part of the
>> package metadata that goes into sdists, wheels, and on PyPI. That
>> would be a discussion for the distutils-sig mailing list, which
>> manages that metadata.
> Thanks for the reference to the distutils-sig list. I will talk to them if
> the idea gets traction here

I think you misunderstand how these lists work :-). (Which is fine --
it's actually pretty opaque and confusing if you don't already know!)
Generally, distutils-sig operates totally independently from
python-{ideas,dev} -- if you have a packaging proposal, it goes there
and not here; if you have a language proposal, it goes here and not
there. *If* what you want to do is add some static metadata to python
packages through setup.py, then python-ideas is irrelevant and
distutils-sig is who you'll have to convince. (But they'll also want
to see that your proposal has buy-in from established packages,
because they don't understand the intricacies of software citation and
will want people they trust to tell them whether the proposal makes

> I am not entirely convinced for the multi-language pipelines. In
> bioinformatics, often the heavy lifting is done by a single package (for
> instance bowtie for RNA-seq alignment) and the output is piped to the custom
> script, mostly in R or Python. The citations for the library doing the
> heavy-lifting is often well-known and widely cited and the issues arise in
> the custom scripts importing and using libraries that should be cited
> without citing them.

And often the custom scripts are a mix of R and Python, and maybe some
Fortran, ... Plus, if it works for multiple languages, it means you
get to share part of the work with other ecosystems, instead of
everyone reinventing the wheel.

Also, if you want to go down the dynamic route (which is the only way
to get accurate fine-grained citations), then it's just as easy to
solve the problem in a language independent way.

>> One challenge in standardizing this kind of thing is choosing a
>> standard way to represent citation information. Maybe CSL-JSON?
>> There's a lot of complexity as you dig into this, though of course one
>> shouldn't let the perfect be the enemy of the good...
> CLS-JSON represented as a dict to be supplied to the setup file is
> definitely one way of doing it. I was, however, thinking more about the
> BibTeX format, given that CLS-JSON is more closely affiliated with Mendeley

Huh, is it? I only know it from Zotero.


Nathaniel J. Smith -- https://vorpus.org

More information about the Python-ideas mailing list