[Python-Dev] PEP 488: elimination of PYO files
Brett Cannon
bcannon at gmail.com
Fri Mar 13 14:39:59 CET 2015
Here is the latest draft of the PEP. I have closed the issue of file name
formatting thanks to the informal poll results being very clear on the
preferred format and also closed the idea of embedding the optimization
level in the bytecode file metadata (that can be another PEP if someone
cares to write that one). The only remaining open issue is whether to
special-case non-optimized bytecode file names and completely leave out the
optimization level in that instance.
I think this PEP is close enough to be finished to ask whether Guido wants
to wield his BDFL stick for this PEP or if he wants to delegate to a BDFAP.
================================================
PEP: 488
Title: Elimination of PYO files
Version: $Revision$
Last-Modified: $Date$
Author: Brett Cannon <brett at python.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 20-Feb-2015
Post-History:
2015-03-06
2015-03-13
Abstract
========
This PEP proposes eliminating the concept of PYO files from Python.
To continue the support of the separation of bytecode files based on
their optimization level, this PEP proposes extending the PYC file
name to include the optimization level in bytecode repository
directory (i.e., the ``__pycache__`` directory).
Rationale
=========
As of today, bytecode files come in two flavours: PYC and PYO. A PYC
file is the bytecode file generated and read from when no
optimization level is specified at interpreter startup (i.e., ``-O``
is not specified). A PYO file represents the bytecode file that is
read/written when **any** optimization level is specified (i.e., when
``-O`` is specified, including ``-OO``). This means that while PYC
files clearly delineate the optimization level used when they were
generated -- namely no optimizations beyond the peepholer -- the same
is not true for PYO files. Put in terms of optimization levels and
the file extension:
- 0: ``.pyc``
- 1 (``-O``): ``.pyo``
- 2 (``-OO``): ``.pyo``
The reuse of the ``.pyo`` file extension for both level 1 and 2
optimizations means that there is no clear way to tell what
optimization level was used to generate the bytecode file. In terms
of reading PYO files, this can lead to an interpreter using a mixture
of optimization levels with its code if the user was not careful to
make sure all PYO files were generated using the same optimization
level (typically done by blindly deleting all PYO files and then
using the `compileall` module to compile all-new PYO files [1]_).
This issue is only compounded when people optimize Python code beyond
what the interpreter natively supports, e.g., using the astoptimizer
project [2]_.
In terms of writing PYO files, the need to delete all PYO files
every time one either changes the optimization level they want to use
or are unsure of what optimization was used the last time PYO files
were generated leads to unnecessary file churn. The change proposed
by this PEP also allows for **all** optimization levels to be
pre-compiled for bytecode files ahead of time, something that is
currently impossible thanks to the reuse of the ``.pyo`` file
extension for multiple optimization levels.
As for distributing bytecode-only modules, having to distribute both
``.pyc`` and ``.pyo`` files is unnecessary for the common use-case
of code obfuscation and smaller file deployments.
Proposal
========
To eliminate the ambiguity that PYO files present, this PEP proposes
eliminating the concept of PYO files and their accompanying ``.pyo``
file extension. To allow for the optimization level to be unambiguous
as well as to avoid having to regenerate optimized bytecode files
needlessly in the `__pycache__` directory, the optimization level
used to generate a PYC file will be incorporated into the bytecode
file name. Currently bytecode file names are created by
``importlib.util.cache_from_source()``, approximately using the
following expression defined by PEP 3147 [3]_, [4]_, [5]_::
'{name}.{cache_tag}.pyc'.format(name=module_name,
cache_tag=sys.implementation.cache_tag)
This PEP proposes to change the expression to::
'{name}.{cache_tag}.opt-{optimization}.pyc'.format(
name=module_name,
cache_tag=sys.implementation.cache_tag,
optimization=str(sys.flags.optimize))
The "opt-" prefix was chosen so as to provide a visual separator
from the cache tag. The placement of the optimization level after
the cache tag was chosen to preserve lexicographic sort order of
bytecode file names based on module name and cache tag which will
not vary for a single interpreter. The "opt-" prefix was chosen over
"o" so as to be somewhat self-documenting. The "opt-" prefix was
chosen over "O" so as to not have any confusion with "0" while being
so close to the interpreter version number.
A period was chosen over a hyphen as a separator so as to distinguish
clearly that the optimization level is not part of the interpreter
version as specified by the cache tag. It also lends to the use of
the period in the file name to delineate semantically different
concepts.
For example, the bytecode file name of ``importlib.cpython-35.pyc``
would become ``importlib.cpython-35.opt-0.pyc``. If ``-OO`` had been
passed to the interpreter then instead of
``importlib.cpython-35.pyo`` the file name would be
``importlib.cpython-35.opt-2.pyc``.
It should be noted that this change in no way affects the performance
of import. Since the import system looks for a single bytecode file
based on the optimization level of the interpreter already and
generates a new bytecode file if it doesn't exist, the introduction
of potentially more bytecode files in the ``__pycache__`` directory
has no effect. The interpreter will continue to look for only a
single bytecode file based on the optimization level and thus no
increase in stat calls will occur.
Implementation
==============
importlib
---------
As ``importlib.util.cache_from_source()`` is the API that exposes
bytecode file paths as well as being directly used by importlib, it
requires the most critical change. As of Python 3.4, the function's
signature is::
importlib.util.cache_from_source(path, debug_override=None)
This PEP proposes changing the signature in Python 3.5 to::
importlib.util.cache_from_source(path, debug_override=None, *,
optimization=None)
The introduced ``optimization`` keyword-only parameter will control
what optimization level is specified in the file name. If the
argument is ``None`` then the current optimization level of the
interpreter will be assumed. Any argument given for ``optimization``
will be passed to ``str()`` and must have ``str.isalnum()`` be true,
else ``ValueError`` will be raised (this prevents invalid characters
being used in the file name). If the empty string is passed in for
``optimization`` then the addition of the optimization will be
suppressed, reverting to the file name format which predates this
PEP.
It is expected that beyond Python's own
0-2 optimization levels, third-party code will use a hash of
optimization names to specify the optimization level, e.g.
``hashlib.sha256(','.join(['dead code elimination', 'constant
folding'])).hexdigest()``.
While this might lead to long file names, it is assumed that most
users never look at the contents of the __pycache__ directory and so
this won't be an issue.
The ``debug_override`` parameter will be deprecated. As the parameter
expects a boolean, the integer value of the boolean will be used as
if it had been provided as the argument to ``optimization`` (a
``None`` argument will mean the same as for ``optimization``). A
deprecation warning will be raised when ``debug_override`` is given a
value other than ``None``, but there are no plans for the complete
removal of the parameter at this time (but removal will be no later
than Python 4).
The various module attributes for importlib.machinery which relate to
bytecode file suffixes will be updated [7]_. The
``DEBUG_BYTECODE_SUFFIXES`` and ``OPTIMIZED_BYTECODE_SUFFIXES`` will
both be documented as deprecated and set to the same value as
``BYTECODE_SUFFIXES`` (removal of ``DEBUG_BYTECODE_SUFFIXES`` and
``OPTIMIZED_BYTECODE_SUFFIXES`` is not currently planned, but will be
not later than Python 4).
All various finders and loaders will also be updated as necessary,
but updating the previous mentioned parts of importlib should be all
that is required.
Rest of the standard library
----------------------------
The various functions exposed by the ``py_compile`` and
``compileall`` functions will be updated as necessary to make sure
they follow the new bytecode file name semantics [6]_, [1]_. The CLI
for the ``compileall`` module will not be directly affected (the
``-b`` flag will be implicit as it will no longer generate ``.pyo``
files when ``-O`` is specified).
Compatibility Considerations
============================
Any code directly manipulating bytecode files from Python 3.2 on
will need to consider the impact of this change on their code (prior
to Python 3.2 -- including all of Python 2 -- there was no
__pycache__ which already necessitates bifurcating bytecode file
handling support). If code was setting the ``debug_override``
argument to ``importlib.util.cache_from_source()`` then care will be
needed if they want the path to a bytecode file with an optimization
level of 2. Otherwise only code **not** using
``importlib.util.cache_from_source()`` will need updating.
As for people who distribute bytecode-only modules (i.e., use a
bytecode file instead of a source file), they will have to choose
which optimization level they want their bytecode files to be since
distributing a ``.pyo`` file with a ``.pyc`` file will no longer be
of any use. Since people typically only distribute bytecode files for
code obfuscation purposes or smaller distribution size then only
having to distribute a single ``.pyc`` should actually be beneficial
to these use-cases. And since the magic number for bytecode files
changed in Python 3.5 to support PEP 465 there is no need to support
pre-existing ``.pyo`` files [8]_.
Rejected Ideas
==============
Completely dropping optimization levels from CPython
----------------------------------------------------
Some have suggested that instead of accommodating the various
optimization levels in CPython, we should instead drop them
entirely. The argument is that significant performance gains would
occur from runtime optimizations through something like a JIT and not
through pre-execution bytecode optimizations.
This idea is rejected for this PEP as that ignores the fact that
there are people who do find the pre-existing optimization levels for
CPython useful. It also assumes that no other Python interpreter
would find what this PEP proposes useful.
Alternative formatting of the optimization level in the file name
-----------------------------------------------------------------
Using the "opt-" prefix and placing the optimization level between
the cache tag and file extension is not critical. All options which
have been considered are:
* ``importlib.cpython-35.opt-0.pyc``
* ``importlib.cpython-35.opt0.pyc``
* ``importlib.cpython-35.o0.pyc``
* ``importlib.cpython-35.O0.pyc``
* ``importlib.cpython-35.0.pyc``
* ``importlib.cpython-35-O0.pyc``
* ``importlib.O0.cpython-35.pyc``
* ``importlib.o0.cpython-35.pyc``
* ``importlib.0.cpython-35.pyc``
These were initially rejected either because they would change the
sort order of bytecode files, possible ambiguity with the cache tag,
or were not self-documenting enough. An informal poll was taken and
people clearly preferred the formatting proposed by the PEP [9]_.
Since this topic is non-technical and of personal choice, the issue
is considered solved.
Embedding the optimization level in the bytecode metadata
---------------------------------------------------------
Some have suggested that rather than embedding the optimization level
of bytecode in the file name that it be included in the file's
metadata instead. This would mean every interpreter had a single copy
of bytecode at any time. Changing the optimization level would thus
require rewriting the bytecode, but there would also only be a single
file to care about.
This has been rejected due to the fact that Python is often installed
as a root-level application and thus modifying the bytecode file for
modules in the standard library are always possible. In this
situation integrators would need to guess at what a reasonable
optimization level was for users for any/all situations. By
allowing multiple optimization levels to co-exist simultaneously it
frees integrators from having to guess what users want and allows
users to utilize the optimization level they want.
Open Issues
===========
Not specifying the optimization level when it is at 0
-----------------------------------------------------
It has been suggested that for the common case of when the
optimizations are at level 0 that the entire part of the file name
relating to the optimization level be left out. This would allow for
file names of ``.pyc`` files to go unchanged, potentially leading to
less backwards-compatibility issues (although Python 3.5 introduces a
new magic number for bytecode so all bytecode files will have to be
regenerated regardless of the outcome of this PEP).
It would also allow a potentially redundant bit of information to be
left out of the file name if an implementation of Python did not
allow for optimizing bytecode. This would only occur, though, if the
interpreter didn't support ``-O`` **and** didn't implement the ast
module, else users could implement their own optimizations.
Arguments against allowing this special case is "explicit is better
than implicit" and "special cases aren't special enough to break the
rules".
At this people have weakly supporting this idea while no one has
explicitly come out against it.
References
==========
.. [1] The compileall module
(https://docs.python.org/3/library/compileall.html#module-compileall)
.. [2] The astoptimizer project
(https://pypi.python.org/pypi/astoptimizer)
.. [3] ``importlib.util.cache_from_source()``
(
https://docs.python.org/3.5/library/importlib.html#importlib.util.cache_from_source
)
.. [4] Implementation of ``importlib.util.cache_from_source()`` from
CPython 3.4.3rc1
(
https://hg.python.org/cpython/file/038297948389/Lib/importlib/_bootstrap.py#l437
)
.. [5] PEP 3147, PYC Repository Directories, Warsaw
(http://www.python.org/dev/peps/pep-3147)
.. [6] The py_compile module
(https://docs.python.org/3/library/compileall.html#module-compileall)
.. [7] The importlib.machinery module
(
https://docs.python.org/3/library/importlib.html#module-importlib.machinery)
.. [8] ``importlib.util.MAGIC_NUMBER``
(
https://docs.python.org/3/library/importlib.html#importlib.util.MAGIC_NUMBER
)
.. [9] Informal poll of file name format options on Google+
(https://plus.google.com/u/0/+BrettCannon/posts/fZynLNwHWGm)
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20150313/cb341236/attachment.html>
More information about the Python-Dev
mailing list