Hello,
Here's a proposal to fix several niggles we found when distributing Python libraries in Fedora. What do you think? Do you face similar issues in other distros?
You can also discuss at: https://discuss.python.org/t/pysource-file-layout-for-installed-modules/145…
Abstract
========
For modules loaded directly from bytecode cache (``*.pyc``) files, Python will
look for corresponding source in a ``__pysource__`` directory.
The existing ability to load modules from ``*.pyc`` files *only* is
unchanged, but conceptually it becomes a special case of a “pyc-first”
file layout.
Motivation
==========
Most pure Python code is installed as a source file (``*.py``), combined with a
bytecode cache file (``__pycache__/*.pyc``), which is created/updated ahead of
time or on demand.
This layout is designed for rapid iteration. Each time a module is imported,
Python assumes the source might have changed: if a bytecode cache is present,
Python normally checks whether it still corresponds to the source.
:pep:`552` introduced an “unchecked” mode, in which this check is skipped.
However, this causes updates to the source to be silently ignored, possibly
confusing users that aren't aware of this rarely used mode.
The remaining checking modes have their own disadvantages.
In both, the best case scenario (the cache is present and fresh), Python must
access at least two files (the source and the cache). Further:
* In the timestamp-based mode, the source file's last-modification time is
used as part of the cache key, causing issues with reproducible builds
as described in :pep:`552`.
* In the hash-based mode, the entire source file is read and hashed.
This is potentially a slow operation. [XXX data needed.]
Another way to install Python modules is to not install the source,
and use the ``*.pyc`` file directly in place of the ``*.py`` file
(removing Python version tag from the filename and moving the file
out of the ``__pycache__`` directory).
This layout has two main issues:
* The Python version tag is not used, meaning that modules using
this layout are only usable by a specific version, and
* the source is not available, making it hard to debug (tracebacks
and the ``inspect`` module don't show code; file is unreadable to the
debugging human).
The first issue is usually not relevant, as most installations are tightly
tied to a specific interpreter. [XXX any examples where this isn't the case?]
This PEP proposes to solve the second issue by allowing installers to
distribute the source file alongside the file with the bytecode.
Rationale
=========
The new file layout is optimized for “installed libraries”: third-party
libraries installed on a user's system.
This can include the Python standard library.
We assume that these files will most likely not be edited after installation.
Python will only consult the bytecode file (``*.pyc``) when loading
a module, and not check whether a ``*.py`` file was edited.
We assume than retreiving a module's source is useful, but it is not a
performance-sensitive operation. It is used when displaying tracebacks
or debugging.
This makes it more palatable for distributors to use the resource-intensive
“checked hash” bytecode files and enjoy their benefits (explained in :pep:552).
On the other hand, we believe that Python should remain “hackable”: if a
source file is available, it should be possible to modify it and use the
result -- for example, to add a few ``print`` calls to a library for
some quick-and-dirty debugging (in a throwaway virtual environment, of course),
or even to explore the standard library by breaking it.
The proposed file layout makes this relatively straightforward: when the
source (``*.py``) file is moved out of the ``__pysource__`` directory,
Python will ignore the bytecode file and load the source instead, producing
a cache in ``__pycache__``. (This is the existing behavior when both a
``*.py`` and ``*.pyc`` are present for a given name.)
We hope that users who'd like to do this, but aren't familiar
with the proposed mechanics, will notice the extra directory, search the Web
for ``__pysource__`` and find relevant instructions.
The proposed layout makes it easy to omit the source files, which will be
useful in resource-constrained environments (e.g. minimal Linux containers).
Omiting them should not affect non-debug functionality.
Adding the sources to an installation that omits them involves only creating
directories and copying source files to the right places, which is relatively
easy even for non-Python-specific tools (like Linux package managers).
This PEP does not propose that any particular distributor or installer
(including Python's build system) should immediately switch to the new layout.
The PEP will be implemented when ``importlib`` supports reading the layout
and stdlib tools like ``py_compile`` can generate it. Switching to it should be
a separate decision -- although one that might not need a PEP.
Specification
=============
``importlib.machinery.SourcelessFileLoader``, the loader that handles
stand-alone ``*.pyc`` files, will be renamed to ``BytecodeFileLoader``.
The old name will remain as an alias for the foreseeable future,
with no ``DeprecationWarning``. However, third-party linters and code-quality
tools are encouraged to treat the old name as suboptimal.
The ``get_source_filename`` method of ``BytecodeFileLoader`` will
be changed to return the expected location of an auxiliary source file, e.g.
``dir/__pysource__/module.py`` for ``dir/module.pyc``.
The ``get_source`` method of ``BytecodeFileLoader`` will
check if the auxiliary source file corresponds to the bytecode file
(as returned by ``get_filename``).
.. note::
This check is done at the time of the call. There is no check that the
source file corresponds to an in-memory module loaded by the
``BytecodeFileLoader``. For example, if both ``*.pyc`` and ``*.py`` are
changed after a module is loaded, tracebacks will show lines of the updated
source, which might not correspond to the running code.
The same “gotcha” applies to current handling of ``*.py`` files.
The ``py_compile`` and ``compileall`` modules will gain arguments and CLI
options for compiling to the new layout.
[XXX: This needs fleshing out. The original source needs to be moved. Need to ensure that compilation is still idempotent.]
Implications
------------
The following follows naturally [XXX verify this!] from the changes above, but will
be tested separately.
``inspect.getsource``, ``inspect.getsourcefile``, ``inspect.getsourcelines``,
the ``python -m inspect`` CLI will retreive source for modules using the new
layout (if the ``__pysource__/*.py`` file is available and current).
Tracebacks will show source lines for modules using the new layout
(if the ``__pysource__/*.py`` file is available and current).
Backwards Compatibility
=======================
The proposal is backwards compatible.
However, once an installer (including Python's build process) switches to the
new layout, tools that are not prepared for it may stop working.
This affects tools like IDEs, debuggers, API doc generators, etc. if they
either don't use ``importlib`` or ``inspect``, or use these modules from a
different version of Python than the code they are handling.
Even in that case, the failure -- not being able to retreive source code
for a third-party module -- is usually a quality-of-life issue rather than
a serious flaw.
Security Implications
=====================
None known.
The proposal adds source code information to modules that can already be
loaded and executed.
How to Teach This
=================
This change does not affect code that users write directly.
Most teaching materials can stay unchanged.
Authors of existing installer tools should read this PEP.
Authors of future installer tools should read documentation that will be added.
Searching for the ``__pysource__`` directory name in Python's documentation
should yield relevant documentation.
We hope that people exploring the libraries installed on their system will
naturally reach relevant docs by searching for ``__pysource__``.
Reference Implementation
========================
https://github.com/encukou/cpython/tree/pysource
Rejected Ideas
==============
Nothing yet.
Open Issues
===========
See XXX's above.