[Python-ideas] A new .pyc file format

Brett Cannon brett at python.org
Fri Apr 25 23:20:50 CEST 2008

On Fri, Apr 25, 2008 at 3:44 AM, Gabriel Genellina
<gagsl-py2 at yahoo.com.ar> wrote:
> Hello
>  (Sorry if you get this twice, I can't see my original post from gmane)
>  I want to propose a new .pyc file format. Currently .pyc files use a very
>  simple format:
>  - MAGIC number (4 bytes, little-endian)
>  - last modification time of source file (4 bytes, little-endian)
>  - code object (marshaled)
>  The problem is that this format is *too* simple. It can't be changed, nor
>  can accomodate other fields if desired. I propose using a more flexible
>  ..pyc format (resembling RIFF files with multiple levels). The layout would
>  be as follows:
>  - A file contains a sequence of sections.
>  - A section has an identifier (4 bytes, usually ASCII letters), followed
>  by its size (4 bytes, not counting the section identifier nor the size
>  itself), followed by the actual section content.
>  - The layout inside each section is arbitrary, but it's suggested to use
>  the same technique: a sequence of (identifier x 4 bytes, size x 4 bytes,
>  actual value)
>  The outer section is called "PYCO" (from Python Code, or a contraction of
>  pyc+pyo) and contains at least 4 subsections:
>   - "VERS": import's "MAGIC number", now seen as a "code version number" (4
>  bytes, same format as before)
>   - "DATE": last modification time of source file (4 bytes, same format as
>  before)
>   - "COFL": the code.co_flags attribute (4 bytes)
>   - "CODE": the marshaled code object
>             4 bytes                 4 bytes
>    +-----.-----.-----.-----+-----.-----.-----.-----+
>    | "P" | "Y" | "C" | "O" | size of whole section |
>    +-----------------------+-----------------------+
>    +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
>    | "V" | "E" | "R" | "S" |   4                   | import "MAGIC number" |
>    +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
>    | "D" | "A" | "T" | "E" |   4                   | source file st_mtime  |
>    +-----.-----.-----.-----+-----.-----.-----.-----+-----------------------+
>    | "C" | "O" | "F" | "L" |   4                   | code.co_flags         |
>    +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
>    | "C" | "O" | "D" | "E" | size of marshaled code| marshaled code object |
>    +-----.-----.-----.-----+-----.-----.-----.-----+     ... ... ...       |
>                                                    |     ... ... ...       |
>                                                    +-----------------------+
>  New sections -or subsections inside a section- can be defined in the
>  future. No implied knowledge of section meanings or its structure is
>  required to read the file; readers can safely skip over sections they
>  don't understand, and never lost synchronism.

While I think having a more flexible format is important to allow for
modifying the AST before bytecode write-out, I don't know if it needs
to go quite this far. The magic number, timestamp, and marshaled code
are not about to go away. Thus the current format can basically stay,
but we can add a flexible addition between the timestamp and the code
object. This saves some memory and simplifies the format slightly in
the case where the guaranteed requirements of .pyc regeneration can be
quickly checked (e.g., a quick 8 byte read off the file will quickly
tell if the magic number of timestamp are out of date, thus skipping
having to read the entire header for these two critical sanity

The thing I think that the new format needs to easily support is not
just the removal of .pyo, but of user-defined AST transformations
prior to .pyc generation. Now that this can be done at the Python
level some people might start coming up with compiler-optimizations
that they want to do which changes semantics. That means there needs
to be a clear way to register an AST transformation has having
occurred. I am just worried that the 4 bytes for labeling something
won't be enough. We could say that all optimizations are labeled
"OPTO" and that what format is used is specified is the value, but
that means supporting multiple instances of the same label in the
header (which I think is fine since this is going to be read linearly
off disk 99% of the time).

So I guess this boils down to I think we don't need to label what MUST
be in the header, and that we should allow for multiple instances of
the same label (whether this is always true or we use some way to flag
that through capitalization).


More information about the Python-ideas mailing list