[Python-ideas] A new .pyc file format

25 Apr 2008

      Hello

(Sorry if you get this twice, I can't see my original post from gmane)

I want to propose a new .pyc file format. Currently .pyc files use a very
simple format:

- MAGIC number (4 bytes, little-endian)
- last modification time of source file (4 bytes, little-endian)
- code object (marshaled)

The problem is that this format is *too* simple. It can't be changed, nor
can accomodate other fields if desired. I propose using a more flexible
..pyc format (resembling RIFF files with multiple levels). The layout would
be as follows:

- A file contains a sequence of sections.
- A section has an identifier (4 bytes, usually ASCII letters), followed
by its size (4 bytes, not counting the section identifier nor the size
itself), followed by the actual section content.
- The layout inside each section is arbitrary, but it's suggested to use
the same technique: a sequence of (identifier x 4 bytes, size x 4 bytes,
actual value)

The outer section is called "PYCO" (from Python Code, or a contraction of
pyc+pyo) and contains at least 4 subsections:

   - "VERS": import's "MAGIC number", now seen as a "code version number" (4
bytes, same format as before)
   - "DATE": last modification time of source file (4 bytes, same format as
before)
   - "COFL": the code.co_flags attribute (4 bytes)
   - "CODE": the marshaled code object

             4 bytes                 4 bytes
    +-----.-----.-----.-----+-----.-----.-----.-----+
    | "P" | "Y" | "C" | "O" | size of whole section |
    +-----------------------+-----------------------+

    +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
    | "V" | "E" | "R" | "S" |   4                   | import "MAGIC number"  
|
    +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
    | "D" | "A" | "T" | "E" |   4                   | source file st_mtime   
|
    +-----.-----.-----.-----+-----.-----.-----.-----+-----------------------+
    | "C" | "O" | "F" | "L" |   4                   | code.co_flags          
|
    +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
    | "C" | "O" | "D" | "E" | size of marshaled code| marshaled code object  
|
    +-----.-----.-----.-----+-----.-----.-----.-----+     ... ... ...        
|
                                                    |     ... ... ...        
|
                                                    +-----------------------+

New sections -or subsections inside a section- can be defined in the
future. No implied knowledge of section meanings or its structure is
required to read the file; readers can safely skip over sections they
don't understand, and never lost synchronism.

Compared with the current format, it has an overhead of 44 bytes.

The format above can replace the current format used for .pyc/.pyo files
(but see below). Of course it's totally incompatible with the old format.
Apart from changing every place where .pyc files are read or written in
the Python sources (not so many, I've identified all of them), 3rd party
libraries and tools using the old format would have to be updated. Perhaps
a new module should be provided to read and write pyc files.
Anyway the change is "safe", in the sense that any old code expecting the
MAGIC number in the first 4 bytes will reject the new format as invalid
and not process it.
Due to this incompatibility, this should be aimed at Python 3.x; I hope we
are on time to implement this for 3.0?

A step further:

Currently, the generated code object depends on the Python version and the
optimize flag; it used to depend on the Unicode flag too, but that's not
the case for Python 3.
The Python version determines the base MAGIC number; the Unicode flag
increments that number by 1; the optimize flag determines the file
extension used (.pyc/.pyo).
With this new format, there is no need to use two different extensions
anymore: all of this can be gathered from the attributes above, so several
variants of the same code object can be stored in a single file. The
importer can choose which one to load based on those attributes. The
selection can be made rather quickly, just the relevant attributes have to
be read actually; all other subsections can be entirely skipped without
further parsing.

Some issues:

- endianness: .pyc files currently store the modification time and magic
number in little-endian; probably one should just stick to it.
- making the size of all sections multiple of 4 may be a good idea, so
marshaled code should be padded with up to 3 NUL bytes at the end.
- section ordering, and subsection ordering inside a section: should not
be relevant; what if one can't seek to an earlier part of the file? (Ok,
unlikely, but currently import.c appears to handle such cases). If "CODE"
comes before any of "VERS", "COFL", "DATE" it should be necesary to rewind
the file to read the code section. The easy fix is to forbid that
situation: "CODE" must come after all of those subsections.
- The co_flags attribute of code objects is now externally visible; future
Python versions should not redefine those flags.
- There is no provision for explicit attribute types: "VERS" is a number,
"CODE" is a marshaled code object... The reader has to *know* that
(although it can completely skip over unknown attributes). No string
attributes were defined (nor required). For the *current* needs, it's
enough as it is. But perhaps in the future this reveals as a shortcoming,
and the .pyc format has to be changed *again*, and I'd hate that.
- Perhaps the source modification date should be stored in a more portable
way?
- a naming problem: currently, the code version number defined in import.c
is called "MAGIC", and is written at the very beginning of the file. It
identifies the file as having a valid code object. In the proposed format,
the file will begin with the letters "PYCO" instead, and the current magic
number is buried inside a subsection... it's not a "magic" anymore, just a
version number, and the "magic" in the sense used by file(1) would be the
4 bytes "PYCO". So the name "MAGIC" should be changed everywhere... it
seems too drastic.
- 32 bits should be enough for all sizes (and 640k should be enough for
all people...)

Implementation:

I don't have a complete implementation yet, but if this format is approved
(as is or with any changes) I could submit a patch. I've made a small but
incompatible modification in the currently used .pyc format in order to
detect all places where this change would impact, and they're not so many
actually.

-- 
Gabriel Genellina