[Python-ideas] A new .pyc file format

Guido van Rossum guido at python.org
Fri Apr 25 16:29:13 CEST 2008


I think this is a reasonable thing to do, but I'd like to hear more
motivation. Maybe you can write it all up in PEP format an add a
section that explains what features we want from .pyc files?

I like that this would get rid of .pyo files BTW.

--Guido

On Fri, Apr 25, 2008 at 3:44 AM, Gabriel Genellina
<gagsl-py2 at yahoo.com.ar> wrote:
> Hello
>
>  (Sorry if you get this twice, I can't see my original post from gmane)
>
>  I want to propose a new .pyc file format. Currently .pyc files use a very
>  simple format:
>
>  - MAGIC number (4 bytes, little-endian)
>  - last modification time of source file (4 bytes, little-endian)
>  - code object (marshaled)
>
>  The problem is that this format is *too* simple. It can't be changed, nor
>  can accomodate other fields if desired. I propose using a more flexible
>  ..pyc format (resembling RIFF files with multiple levels). The layout would
>  be as follows:
>
>  - A file contains a sequence of sections.
>  - A section has an identifier (4 bytes, usually ASCII letters), followed
>  by its size (4 bytes, not counting the section identifier nor the size
>  itself), followed by the actual section content.
>  - The layout inside each section is arbitrary, but it's suggested to use
>  the same technique: a sequence of (identifier x 4 bytes, size x 4 bytes,
>  actual value)
>
>  The outer section is called "PYCO" (from Python Code, or a contraction of
>  pyc+pyo) and contains at least 4 subsections:
>
>   - "VERS": import's "MAGIC number", now seen as a "code version number" (4
>  bytes, same format as before)
>   - "DATE": last modification time of source file (4 bytes, same format as
>  before)
>   - "COFL": the code.co_flags attribute (4 bytes)
>   - "CODE": the marshaled code object
>
>             4 bytes                 4 bytes
>    +-----.-----.-----.-----+-----.-----.-----.-----+
>    | "P" | "Y" | "C" | "O" | size of whole section |
>    +-----------------------+-----------------------+
>
>    +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
>    | "V" | "E" | "R" | "S" |   4                   | import "MAGIC number" |
>    +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
>    | "D" | "A" | "T" | "E" |   4                   | source file st_mtime  |
>    +-----.-----.-----.-----+-----.-----.-----.-----+-----------------------+
>    | "C" | "O" | "F" | "L" |   4                   | code.co_flags         |
>    +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
>    | "C" | "O" | "D" | "E" | size of marshaled code| marshaled code object |
>    +-----.-----.-----.-----+-----.-----.-----.-----+     ... ... ...       |
>                                                    |     ... ... ...       |
>                                                    +-----------------------+
>
>  New sections -or subsections inside a section- can be defined in the
>  future. No implied knowledge of section meanings or its structure is
>  required to read the file; readers can safely skip over sections they
>  don't understand, and never lost synchronism.
>
>  Compared with the current format, it has an overhead of 44 bytes.
>
>  The format above can replace the current format used for .pyc/.pyo files
>  (but see below). Of course it's totally incompatible with the old format.
>  Apart from changing every place where .pyc files are read or written in
>  the Python sources (not so many, I've identified all of them), 3rd party
>  libraries and tools using the old format would have to be updated. Perhaps
>  a new module should be provided to read and write pyc files.
>  Anyway the change is "safe", in the sense that any old code expecting the
>  MAGIC number in the first 4 bytes will reject the new format as invalid
>  and not process it.
>  Due to this incompatibility, this should be aimed at Python 3.x; I hope we
>  are on time to implement this for 3.0?
>
>
>  A step further:
>
>  Currently, the generated code object depends on the Python version and the
>  optimize flag; it used to depend on the Unicode flag too, but that's not
>  the case for Python 3.
>  The Python version determines the base MAGIC number; the Unicode flag
>  increments that number by 1; the optimize flag determines the file
>  extension used (.pyc/.pyo).
>  With this new format, there is no need to use two different extensions
>  anymore: all of this can be gathered from the attributes above, so several
>  variants of the same code object can be stored in a single file. The
>  importer can choose which one to load based on those attributes. The
>  selection can be made rather quickly, just the relevant attributes have to
>  be read actually; all other subsections can be entirely skipped without
>  further parsing.
>
>
>  Some issues:
>
>  - endianness: .pyc files currently store the modification time and magic
>  number in little-endian; probably one should just stick to it.
>  - making the size of all sections multiple of 4 may be a good idea, so
>  marshaled code should be padded with up to 3 NUL bytes at the end.
>  - section ordering, and subsection ordering inside a section: should not
>  be relevant; what if one can't seek to an earlier part of the file? (Ok,
>  unlikely, but currently import.c appears to handle such cases). If "CODE"
>  comes before any of "VERS", "COFL", "DATE" it should be necesary to rewind
>  the file to read the code section. The easy fix is to forbid that
>  situation: "CODE" must come after all of those subsections.
>  - The co_flags attribute of code objects is now externally visible; future
>  Python versions should not redefine those flags.
>  - There is no provision for explicit attribute types: "VERS" is a number,
>  "CODE" is a marshaled code object... The reader has to *know* that
>  (although it can completely skip over unknown attributes). No string
>  attributes were defined (nor required). For the *current* needs, it's
>  enough as it is. But perhaps in the future this reveals as a shortcoming,
>  and the .pyc format has to be changed *again*, and I'd hate that.
>  - Perhaps the source modification date should be stored in a more portable
>  way?
>  - a naming problem: currently, the code version number defined in import.c
>  is called "MAGIC", and is written at the very beginning of the file. It
>  identifies the file as having a valid code object. In the proposed format,
>  the file will begin with the letters "PYCO" instead, and the current magic
>  number is buried inside a subsection... it's not a "magic" anymore, just a
>  version number, and the "magic" in the sense used by file(1) would be the
>  4 bytes "PYCO". So the name "MAGIC" should be changed everywhere... it
>  seems too drastic.
>  - 32 bits should be enough for all sizes (and 640k should be enough for
>  all people...)
>
>  Implementation:
>
>  I don't have a complete implementation yet, but if this format is approved
>  (as is or with any changes) I could submit a patch. I've made a small but
>  incompatible modification in the currently used .pyc format in order to
>  detect all places where this change would impact, and they're not so many
>  actually.
>
>  --
>  Gabriel Genellina
>
>  _______________________________________________
>  Python-ideas mailing list
>  Python-ideas at python.org
>  http://mail.python.org/mailman/listinfo/python-ideas
>



-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)



More information about the Python-ideas mailing list