Hi all,
I was hoping to get some feedback on a proposed refactoring of the
datetime module that should dramatically improve import
performance.
The datetime module is implemented more or less in full both in pure Python and in C; the way that this is currently achieved is that the pure Python implementation is defined in datetime.py, and the C implementation is in _datetime, and after the full Python version is defined, the C version is star-imported and thus any symbols defined in both versions are taken from the C version; if the C version is used, any private symbols used only in the pure Python implementation are manually deleted (see the end of the file).
This adds a lot of unnecessary overhead, both to define a bunch
of unused classes and functions and to import modules that are
required for the pure Python implementation but not for the C
implementation. In the issue he created
about this, Victor Stinner demonstrated that moving the pure
Python implementation to its own module would speed up the import
of datetime by a factor of 4.
I think that we should indeed move the pure Python implementation
into its own module, despite the fact that this is almost
guaranteed to break some people either relying on implementation
details or doing something funky with the import system — I don't
think it should break anyone relying on the guaranteed public
interface. The issue at hand is that we have two options available
for the refactoring: either move the pure Python implementation to
its own private top-level module (single file) such as
`_pydatetime`, or make `datetime` a folder with an `__init__.py`
and move the pure Python implementation to `datetime._pydatetime`
or something of that nature.
The decimal and zoneinfo modules both have this same issue; the
decimal module uses the first strategy with _pydecimal and
decimal, the zoneinfo module uses a folder with a
zoneinfo._zoneinfo submodule. Assuming we go forward with this, we
need to decide which strategy to adopt for datetime.
In favor of using a datetime/ folder, I'd say it's cleaner to put
the pure Python implementation of datetime under the datetime
namespace, and also it gives us more freedom to play with the
module's structure in the future, since we could have
lazily-imported sub-components, or we could implement some logic
common to both implementations in Python and import it from a
`datetime._common` module without requiring the C version to
import the entire Python version, similar to the way zoneinfo has
the zoneinfo._common
module.
The downside of the folder method is that it complicates the way
datetime is imported — especially if we add additional
structure to the module, or add any logic into the __init__.py.
Two single-file modules side-by-side, one imported by the other
doesn't change anything about the nature of how the datetime
module is imported, and is much less likely to break anything.
Anyone have thoughts or strong preferences here? Anyone have use
cases where one or the other approaches is likely to cause a bunch
of undue hardship? I'd like to avoid moving this more than once.
Best,
Paul
P.S. Victor's PR moving
this code to _pydatetime is currently done in such a way
that the ability to backport changes from post-refactoring to
pre-refactoring branches is preserved; I have not checked but I think
we should be able to do the same thing with the other strategy as
well.