[Python-Dev] PEP 3147 ready for pronouncement and merging
Guido van Rossum
guido at python.org
Fri Apr 16 05:01:38 CEST 2010
Comments inline. Nothing showstopping, mostly just spewing obscure
background information...
Overall, congratulations! I'm fine with the implementation going in
and the PEP being marked as accepted as long as you get to the
clarifications I suggest below soon after.
--Guido
On Tue, Apr 13, 2010 at 1:21 PM, Barry Warsaw <barry at python.org> wrote:
> I am attaching the latest revision of PEP 3147 to this message, which is also
> available here:
>
> http://www.python.org/dev/peps/pep-3147/
>
> I think the PEP is ready for pronouncement, and the patch is pretty much ready
> for merging into py3k. The only thing that I can think of that is not
> implemented yet is this section on PEP 302 loaders:
>
> PEP 302 [18]_ defined loaders have a `.get_filename()` method which
> points to the `__file__` for a module. As part of this PEP, we will
> extend this API, to include a new method `.get_paths()` which will
> return a 2-tuple containing the path to the source file and the path
> to where the matching `pyc` file is (or would be).
>
> I'm honestly not sure whether this is still essential, or whether the
> importlib ABC changes Brett and I talked about at Pycon are still required. I
> now believe they are at best a minor part of the implementation if so. Maybe
> Brett can chime in on that.
Fine with me to omit.
> Everything else is implemented, tested, and has undergone four rounds of
> Rietveld reviews (thanks Antoine, Benjamin, Brett, and Georg!). A fifth patch
> set has been uploaded and is available here:
>
> http://codereview.appspot.com/842043/show
TL;DR :-)
> This addresses all previous comments, includes some fixes from Brian Curtin
> for Windows (thanks!) and fixes __main__ and -m support. I'd like to commit
> this to py3k sooner rather than later so that we can shake out any additional
> issues that might crop up, without having to continue to maintain my external
> branches.
>
> Guido, what say you?
> -Barry
>
> PEP: 3147
> Title: PYC Repository Directories
> Version: $Revision: 80025 $
> Last-Modified: $Date: 2010-04-12 22:17:40 -0400 (Mon, 12 Apr 2010) $
> Author: Barry Warsaw <barry at python.org>
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 2009-12-16
> Python-Version: 3.2
> Post-History: 2010-01-30, 2010-02-25, 2010-03-03, 2010-04-12
>
>
> Abstract
> ========
>
> This PEP describes an extension to Python's import mechanism which
> improves sharing of Python source code files among multiple installed
> different versions of the Python interpreter. It does this by
> allowing more than one byte compilation file (.pyc files) to be
> co-located with the Python source file (.py file). The extension
> described here can also be used to support different Python
> compilation caches, such as JIT output that may be produced by an
> Unladen Swallow [1]_ enabled C Python.
>
>
> Background
> ==========
>
> CPython compiles its source code into "byte code", and for performance
> reasons, it caches this byte code on the file system whenever the
> source file has changes. This makes loading of Python modules much
> faster because the compilation phase can be bypassed. When your
> source file is `foo.py`, CPython caches the byte code in a `foo.pyc`
> file right next to the source.
>
> Byte code files contain two 32-bit numbers followed by the marshaled
big-endian
> [2]_ code object. The 32-bit numbers represent a magic number and a
> timestamp. The magic number changes whenever Python changes the byte
> code format, e.g. by adding new byte codes to its virtual machine.
> This ensures that pyc files built for previous versions of the VM
> won't cause problems. The timestamp is used to make sure that the pyc
> file is not older than the py file that was used to create it. When
is not older than -> matches
(Obscure fact: the timestamp in the pyc file must match the source's
mtime exactly.)
> either the magic number or timestamp do not match, the py file is
> recompiled and a new pyc file is written.
>
> In practice, it is well known that pyc files are not compatible across
> Python major releases. A reading of import.c [3]_ in the Python
> source code proves that within recent memory, every new CPython major
> release has bumped the pyc magic number.
>
>
> Rationale
> =========
>
> Linux distributions such as Ubuntu [4]_ and Debian [5]_ provide more
> than one Python version at the same time to their users. For example,
> Ubuntu 9.10 Karmic Koala users can install Python 2.5, 2.6, and 3.1,
> with Python 2.6 being the default.
>
> This causes a conflict for Python source files installed by the
> system (including third party packages), because you cannot compile a
I'd say only 3rd part packages right? (And code written by the distro,
which from Python's POV is also 3rd party.) At least ought to clarify
that the stdlib is unaffected by this conflict, because multiple
versions of the stdlib *are* installed.
> single Python source file for more than one Python version at a time.
> Thus if your system wanted to install a `/usr/share/python/foo.py`, it
> could not create a `/usr/share/python/foo.pyc` file usable across all
> installed Python versions.
Note that (due to the magic#) Python doesn't crash, it just falls back
on the slower approach of compiling from source.
Perhaps more important is that different Python versions (if the user
has write permission) will fight over the pyc file and rewrite it each
time the source is compiled. Worse, even though the magic# is
initially written as zero and then rewritten with the correct value,
concurrent processes running different Python versions can actually
end up reading corrupt bytecode. (Alex Martelli diagnosed this at
Google years ago.)
> Furthermore, in order to ease the burden on operating system packagers
> for these distributions, the distribution packages do not contain
> Python version numbers [6]_; they are shared across all Python
> versions installed on the system. Putting Python version numbers in
> the packages would be a maintenance nightmare, since all the packages
> - *and their dependencies* - would have to be updated every time a new
> Python release was added or removed from the distribution. Because of
> the sheer number of packages available, this amount of work is
> infeasible.
>
> C extensions can be source compatible across multiple versions of
> Python. Compiled extension modules are usually not compatible though,
Actually we typically make every effort to support backwards
compatibility for compiled modules, and the module initialization API
contains a version# check. This is a different version# than the
import magic# and historically has changed much less frequently.
> and PEP 384 [7]_ has been proposed to address this by defining a
> stable ABI for extension modules.
>
> Because these distributions cannot share pyc files, elaborate
> mechanisms have been developed to put the resulting pyc files in
> non-shared locations while the source code is still shared. Examples
> include the symlink-based Debian regimes python-support [8]_ and
> python-central [9]_. These approaches make for much more complicated,
> fragile, inscrutable, and fragmented policies for delivering Python
> applications to a wide range of users. Arguably more users get Python
> from their operating system vendor than from upstream tarballs. Thus,
> solving this pyc sharing problem for CPython is a high priority for
> such vendors.
>
> This PEP proposes a solution to this problem.
>
>
> Proposal
> ========
>
> Python's import machinery is extended to write and search for byte
> code cache files in a single directory inside every Python package
> directory. This directory will be called `__pycache__`.
> Further, pyc files will contain a magic string that differentiates the
Clarify that the magic string is in the filename, not in the file contents.
> Python version they were compiled for. This allows multiple byte
> compiled cache files to co-exist for a single Python source file.
>
> This scheme has the added benefit of reducing the clutter in a Python
> package directory.
>
> When a Python source file is imported for the first time, a
> `__pycache__` directory will be created in the package directory, if
Is this still true? ISTR there was a lot of discussion about the
auto-creation and possible security concerns.
> one does not already exist. The pyc file for the imported source will
> be written to the `__pycache__` directory, using the magic-tag
By now the magic-tag format should have been defined (or a "see below"
inserted).
> formatted name. If either the creation of the `__pycache__` directory
> or the pyc file inside that fails, the import will still succeed, just
> as it does in a pre-PEP-3147 world.
>
> If the py source file is missing, the pyc file inside `__pycache__`
> will be ignored. This eliminates the problem of accidental stale pyc
> file imports.
>
> For backward compatibility, Python will still support pyc-only
> distributions, however it will only do so when the pyc file lives in
> the directory where the py file *would* have been, i.e. not in the
> `__pycache__` directory. pyc file outside of `__pycache__` will only
> be imported if the py source file is missing.
>
> Tools such as `py_compile` [15]_ and `compileall` [16]_ will be
> extended to create PEP 3147 formatted layouts automatically, but will
> have an option to create pyc-only distribution layouts.
>
>
> Examples
> ---------
>
> What would this look like in practice?
>
> Let's say we have a Python package named `alpha` which contains a
> sub-package name `beta`. The source directory layout before byte
> compilation might look like this::
>
> alpha/
> __init__.py
> one.py
> two.py
> beta/
> __init__.py
> three.py
> four.py
>
> After byte compiling this package with Python 3.2, you would see the
> following layout::
>
> alpha/
> __pycache__/
> __init__.cpython-32.pyc
> one.cpython-32.pyc
> two.cpython-32.pyc
> __init__.py
> one.py
> two.py
> beta/
> __pycache__/
> __init__.cpython-32.pyc
> three.cpython-32.pyc
> four.cpython-32.pyc
> __init__.py
> three.py
> four.py
>
> *Note: listing order may differ depending on the platform.*
>
> Let's say that two new versions of Python are installed, one is Python
> 3.3 and another is Unladen Swallow. After byte compilation, the file
> system would look like this::
>
> alpha/
> __pycache__/
> __init__.cpython-32.pyc
> __init__.cpython-33.pyc
> __init__.unladen-10.pyc
> one.cpython-32.pyc
> one.cpython-33.pyc
> one.unladen-10.pyc
> two.cpython-32.pyc
> two.cpython-33.pyc
> two.unladen-10.pyc
> __init__.py
> one.py
> two.py
> beta/
> __pycache__/
> __init__.cpython-32.pyc
> __init__.cpython-33.pyc
> __init__.unladen-10.pyc
> three.cpython-32.pyc
> three.cpython-33.pyc
> three.unladen-10.pyc
> four.cpython-32.pyc
> four.cpython-33.pyc
> four.unladen-10.pyc
> __init__.py
> three.py
> four.py
>
> As you can see, as long as the Python version identifier string is
> unique, any number of pyc files can co-exist. These identifier
> strings are described in more detail below.
>
> A nice property of this layout is that the `__pycache__` directories
> can generally be ignored, such that a normal directory listing would
> show something like this::
>
> alpha/
> __pycache__/
> __init__.py
> one.py
> two.py
> beta/
> __pycache__/
> __init__.py
> three.py
> four.py
>
> This is much less cluttered than even today's Python.
(Aside: this is a major selling point of this PEP!)
>
>
> Python behavior
> ===============
>
> When Python searches for a module to import (say `foo`), it may find
> one of several situations. As per current Python rules, the term
> "matching pyc" means that the magic number matches the current
> interpreter's magic number, and the source file is not newer than the
> `pyc` file.
>
>
> Case 1: The first import
> ------------------------
>
> When Python is asked to import module `foo`, it searches for a
> `foo.py` file (or `foo` package, but that's not important for this
> discussion) along its `sys.path`. When Python locates the `foo.py`
> file it will look for a `__pycache__` directory in the directory where
> it found the `foo.py`. If the `__pycache__` directory is missing,
> Python will create it. Then it will parse and byte compile the
> `foo.py` file and save the byte code in `__pycache__/foo.<magic>.pyc`,
> where <magic> is defined by the Python implementation, but will be a
> human readable string such as `cpython-32`.
(Aside: at first I read this as a description of the full algorithm.
But there is a step missing -- the __pycache__/foo.<magic>.pyc file is
searched and not found.)
>
>
> Case 2: The second import
> -------------------------
>
> When Python is asked to import module `foo` a second time (in a
> different process of course), it will again search for the `foo.py`
> file along its `sys.path`. When Python locates the `foo.py` file, it
> looks for a matching `__pycache__/foo.<magic>.pyc` and finding this,
> it reads the byte code and continues as usual.
>
>
> Case 3: __pycache__/foo.<magic>.pyc with no source
> ---------------------------------------------------
>
> It's possible that the `foo.py` file somehow got removed, while
> leaving the cached pyc file still on the file system. If the
> `__pycache__/foo.<magic>.pyc` file exists, but the `foo.py` file used
> to create it does not, Python will raise an `ImportError` when asked
> to import foo. In other words, Python will not import a pyc file from
> the cache directory unless the source file exists.
>
>
> Case 4: legacy pyc files and source-less imports
> ------------------------------------------------
>
> Python will ignore all legacy pyc files when a source file exists next
> to it. In other words, if a `foo.pyc` file exists next to the
> `foo.py` file, the pyc file will be ignored in all cases
>
> In order to continue to support source-less distributions though, if
> the source file is missing, Python will import a lone pyc file if it
> lives where the source file would have been.
>
>
> Case 5: read-only file systems
> ------------------------------
>
> When the source lives on a read-only file system, or the `__pycache__`
> directory or pyc file cannot otherwise be written, all the same rules
> apply. This is also the case when `__pycache__` happens to be written
> with permissions which do not allow for writing containing pyc files.
>
>
>
> Flow chart
> ==========
>
> Here is a flow chart describing how modules are loaded:
>
> .. image:: pep-3147-1.png
> :scale: 75
>
>
> Magic identifiers
> =================
>
> pyc files inside of the `__pycache__` directories contain a magic
> identifier in their file names. These are mnemonic tags for the
> actual magic numbers used by the importer. For example, in Python
> 3.2, we could use the hexlified [10]_ magic number as a unique
(Aside: when you search Wikipedia for "hexlify" it says "did you mean:
heavily?" :-)
> identifier::
>
> >>> from binascii import hexlify
> >>> from imp import get_magic
> >>> 'foo.{}.pyc'.format(hexlify(get_magic()).decode('ascii'))
> 'foo.580c0d0a.pyc'
>
> This isn't particularly human friendly though. Instead, this PEP
This section reads a bit weird -- first it describes the solution we
*didn't* pick. I'd move that to a "Alternatives Considered and
Rejected" section or some such.
> proposes a *magic tag* that uniquely defines `.pyc` files for the
> current version of Python. Whenever the magic number is bumped, a new
> magic tag is defined which is unique among all versions and
> implementations of Python. The actual contents of the magic tag is
> left up to the implementation, although it is recommended that the tag
> include the implementation name and a version shorthand. In general,
> magic numbers never change between Python micro releases, but the
> convention can be extended to handle magic number changes between
> pre-release development versions.
>
> For example, CPython 3.2 would have a magic tag of `cpython-32` and
> write pyc files like this: `foo.cpython-32.pyc`. When the `-O` flag
> is used, it would write `foo.cpython-32.pyo`. For backports of this
> feature to Python 2, when the `-U` flag is used, a file such as
> `foo.cpython-27u.pyc` can be written.
Does all of this match the implementation?
> The magic tag is available in the `imp` module via the `get_tag()`
> function. This is analogous to the `get_magic()` function already
> available in that module.
>
>
> Alternative Python implementations
> ==================================
>
> Alternative Python implementations such as Jython [11]_, IronPython
> [12]_, PyPy [13]_, Pynie [14]_, and Unladen Swallow can also use the
> `__pycache__` directory to store whatever compilation artifacts make
> sense for their platforms. For example, Jython could store the class
> file for the module in `__pycache__/foo.jython-32.class`.
>
>
> Implementation strategy
> =======================
>
> This feature is targeted for Python 3.2, solving the problem for those
> and all future versions. It may be back-ported to Python 2.7.
Is there time given that 2.7b1 was released?
> Vendors are free to backport the changes to earlier distributions as
> they see fit.
>
>
> Effects on existing code
> ========================
>
> Adoption of this PEP will affect existing code and idioms, both inside
> Python and outside. This section enumerates some of these effects.
>
>
> __file__
> ---------
>
> In Python 3, when you import a module, its `__file__` attribute points
> to its source `py` file (in Python 2, it points to the `pyc` file). A
> package's `__file__` points to the `py` file for its `__init__.py`.
> E.g.::
>
> >>> import foo
> >>> foo.__file__
> 'foo.py'
> # baz is a package
> >>> import baz
> >>> baz.__file__
> 'baz/__init__.py'
>
> Nothing in this PEP would change the semantics of `__file__`.
>
> This PEP proposes the addition of an `__cached__` attribute to
> modules, which will always point to the actual `pyc` file that was
> read or written. When the environment variable
> `$PYTHONDONTWRITEBYTECODE` is set, or the `-B` option is given, or if
> the source lives on a read-only filesystem, then the `__cached__`
> attribute will point to the location that the `pyc` file *would* have
> been written to if it didn't exist. This location of course includes
> the `__pycache__` subdirectory in its path.
Hm. I wish there was a way to find out whether the bytecode (or
whatever) actually *was* read from this file. __file__ in Python 2
supports this (though not in Python 3).
>
> For alternative Python implementations which do not support `pyc`
> files, the `__cached__` attribute may point to whatever information
> makes sense. E.g. on Jython, this might be the `.class` file for the
> module: `__pycache__/foo.jython-32.class`. Some implementations may
> use multiple compiled files to create the module, in which case
> `__cached__` may be a tuple. The exact contents of `__cached__` are
> Python implementation specific.
>
> It is recommended that when nothing sensible can be calculated,
> implementations should set the `__cached__` attribute to `None`.
>
>
> py_compile and compileall
> -------------------------
>
> Python comes with two modules, `py_compile` [15]_ and `compileall`
> [16]_ which support compiling Python modules external to the built-in
> import machinery. `py_compile` in particular has intimate knowledge
> of byte compilation, so these will be updated to understand the new
> layout. The `-b` flag is added to `compileall` for writing legacy
> `.pyc` byte-compiled file path names.
>
>
> bdist_wininst and the Windows installer
> ---------------------------------------
>
> These tools also compile modules explicitly on installation. If they
> do not use `py_compile` and `compileall`, then they would also have to
> be modified to understand the new layout.
>
>
> File extension checks
> ---------------------
>
> There exists some code which checks for files ending in `.pyc` and
> simply chops off the last character to find the matching `.py` file.
> This code will obviously fail once this PEP is implemented.
>
> To support this use case, we'll add two new methods to the `imp`
> package [17]_:
>
> * `imp.source_from_cache(py_path)` -> `pyc_path`
> * `imp.cache_from_source(pyc_path)` -> `py_path`
>
> Alternative implementations are free to override these functions to
> return reasonable values based on their own support for this PEP.
>
>
> PEP 302 loaders
> ---------------
>
> PEP 302 [18]_ defined loaders have a `.get_filename()` method which
> points to the `__file__` for a module. As part of this PEP, we will
> extend this API, to include a new method `.get_paths()` which will
> return a 2-tuple containing the path to the source file and the path
> to where the matching `pyc` file is (or would be).
>
>
> Backports
> ---------
>
> For versions of Python earlier than 3.2 (and possibly 2.7), it is
> possible to backport this PEP. However, in Python 3.2 (and possibly
> 2.7), this behavior will be turned on by default, and in fact, it will
> replace the old behavior. Backports will need to support the old
> layout by default. We suggest supporting PEP 3147 through the use of
> an environment variable called `$PYTHONENABLECACHEDIR` or the command
> line switch `-Xenablecachedir` to enable the feature.
I would be okay if a distro decided to turn it on by default, as long
as there was a way to opt out.
>
>
> Makefiles and other dependency tools
> ------------------------------------
>
> Makefiles and other tools which calculate dependencies on `.pyc` files
> (e.g. to byte-compile the source if the `.pyc` is missing) will have
> to be updated to check the new paths.
>
>
> Alternatives
> ============
>
> PEP 304
> -------
>
> There is some overlap between the goals of this PEP and PEP 304 [19]_,
> which has been withdrawn. However PEP 304 would allow a user to
> create a shadow file system hierarchy in which to store `pyc` files.
> This concept of a shadow hierarchy for `pyc` files could be used to
> satisfy the aims of this PEP. Although the PEP 304 does not indicate
> why it was withdrawn, shadow directories have a number of problems.
> The location of the shadow `pyc` files would not be easily discovered
> and would depend on the proper and consistent use of the
> `$PYTHONBYTECODE` environment variable both by the system and by end
> users. There are also global implications, meaning that while the
> system might want to shadow `pyc` files, users might not want to, but
> the PEP defines only an all-or-nothing approach.
>
> As an example of the problem, a common (though fragile) Python idiom
> for locating data files is to do something like this::
>
> from os import dirname, join
> import foo.bar
> data_file = join(dirname(foo.bar.__file__), 'my.dat')
>
> This would be problematic since `foo.bar.__file__` will give the
> location of the `pyc` file in the shadow directory, and it may not be
> possible to find the `my.dat` file relative to the source directory
> from there.
>
>
> Fat byte compilation files
> --------------------------
>
> An earlier version of this PEP described "fat" Python byte code files.
> These files would contain the equivalent of multiple `pyc` files in a
> single `pyf` file, with a lookup table keyed off the appropriate magic
> number. This was an extensible file format so that the first 5
> parallel Python implementations could be supported fairly efficiently,
> but with extension lookup tables available to scale `pyf` byte code
> objects as large as necessary.
>
> The fat byte compilation files were fairly complex, and inherently
> introduced difficult race conditions, so the current simplification of
> using directories was suggested. The same problem applies to using
> zip files as the fat pyc file format.
>
>
> Multiple file extensions
> ------------------------
>
> The PEP author also considered an approach where multiple thin byte
> compiled files lived in the same place, but used different file
> extensions to designate the Python version. E.g. foo.pyc25,
> foo.pyc26, foo.pyc31 etc. This was rejected because of the clutter
> involved in writing so many different files. The multiple extension
> approach makes it more difficult (and an ongoing task) to update any
> tools that are dependent on the file extension.
>
>
> .pyc
> ----
>
> A proposal was floated to call the `__pycache__` directory `.pyc` or
> some other dot-file name. This would have the effect on *nix systems
> of hiding the directory. There are many reasons why this was
> rejected by the BDFL [20]_ including the fact that dot-files are only
> special on some platforms, and we actually do *not* want to hide these
> completely from users.
>
>
> Reference implementation
> ========================
>
> Work on this code is tracked in a Bazaar branch on Launchpad [22]_
> until it's ready for merge into Python 3.2. The work-in-progress diff
> can also be viewed [23]_ and is updated automatically as new changes
> are uploaded.
>
> A Rietveld code review issue [24]_ has been opened as of 2010-04-01 (no,
> this is not an April Fools joke :).
>
>
> References
> ==========
>
> .. [1] PEP 3146
>
> .. [2] The marshal module:
> http://www.python.org/doc/current/library/marshal.html
>
> .. [3] import.c:
> http://svn.python.org/view/python/branches/py3k/Python/import.c?view=markup
>
> .. [4] Ubuntu: <http://www.ubuntu.com>
>
> .. [5] Debian: <http://www.debian.org>
>
> .. [6] Debian Python Policy:
> http://www.debian.org/doc/packaging-manuals/python-policy/
>
> .. [7] PEP 384
>
> .. [8] python-support:
> http://wiki.debian.org/DebianPythonFAQ#Whatispython-support.3F
>
> .. [9] python-central:
> http://wiki.debian.org/DebianPythonFAQ#Whatispython-central.3F
>
> .. [10] binascii.hexlify():
> http://www.python.org/doc/current/library/binascii.html#binascii.hexlify
>
> .. [11] Jython: http://www.jython.org/
>
> .. [12] IronPython: http://ironpython.net/
>
> .. [13] PyPy: http://codespeak.net/pypy/dist/pypy/doc/
>
> .. [14] Pynie: http://code.google.com/p/pynie/
>
> .. [15] py_compile: http://docs.python.org/library/py_compile.html
>
> .. [16] compileall: http://docs.python.org/library/compileall.html
>
> .. [17] imp: http://www.python.org/doc/current/library/imp.html
>
> .. [18] PEP 302
>
> .. [19] PEP 304
>
> .. [20] http://www.mail-archive.com/python-dev@python.org/msg45203.html
>
> .. [21] importlib: http://docs.python.org/3.1/library/importlib.html
>
> .. [22] https://code.launchpad.net/~barry/python/pep3147
>
> .. [23] https://code.launchpad.net/~barry/python/pep3147/+merge/22648
>
> .. [24] http://codereview.appspot.com/842043/show
>
>
> ACKNOWLEDGMENTS
> ===============
>
> Barry Warsaw's original idea was for fat Python byte code files.
> Martin von Loewis reviewed an early draft of the PEP and suggested the
> simplification to store traditional `pyc` and `pyo` files in a
> directory. Many other people reviewed early versions of this PEP and
> provided useful feedback including but not limited to:
>
> * David Malcolm
> * Josselin Mouette
> * Matthias Klose
> * Michael Hudson
> * Michael Vogt
> * Piotr Ożarowski
> * Scott Kitterman
> * Toshio Kuratomi
>
>
> Copyright
> =========
>
> This document has been placed in the public domain.
>
>
>
> ..
> Local Variables:
> mode: indented-text
> indent-tabs-mode: nil
> sentence-end-double-space: t
> fill-column: 70
> coding: utf-8
> End:
>
--
--Guido van Rossum (python.org/~guido)
More information about the Python-Dev
mailing list