[Python-ideas] A new .pyc file format
Guido van Rossum
guido at python.org
Fri Apr 25 16:29:13 CEST 2008
I think this is a reasonable thing to do, but I'd like to hear more
motivation. Maybe you can write it all up in PEP format an add a
section that explains what features we want from .pyc files?
I like that this would get rid of .pyo files BTW.
--Guido
On Fri, Apr 25, 2008 at 3:44 AM, Gabriel Genellina
<gagsl-py2 at yahoo.com.ar> wrote:
> Hello
>
> (Sorry if you get this twice, I can't see my original post from gmane)
>
> I want to propose a new .pyc file format. Currently .pyc files use a very
> simple format:
>
> - MAGIC number (4 bytes, little-endian)
> - last modification time of source file (4 bytes, little-endian)
> - code object (marshaled)
>
> The problem is that this format is *too* simple. It can't be changed, nor
> can accomodate other fields if desired. I propose using a more flexible
> ..pyc format (resembling RIFF files with multiple levels). The layout would
> be as follows:
>
> - A file contains a sequence of sections.
> - A section has an identifier (4 bytes, usually ASCII letters), followed
> by its size (4 bytes, not counting the section identifier nor the size
> itself), followed by the actual section content.
> - The layout inside each section is arbitrary, but it's suggested to use
> the same technique: a sequence of (identifier x 4 bytes, size x 4 bytes,
> actual value)
>
> The outer section is called "PYCO" (from Python Code, or a contraction of
> pyc+pyo) and contains at least 4 subsections:
>
> - "VERS": import's "MAGIC number", now seen as a "code version number" (4
> bytes, same format as before)
> - "DATE": last modification time of source file (4 bytes, same format as
> before)
> - "COFL": the code.co_flags attribute (4 bytes)
> - "CODE": the marshaled code object
>
> 4 bytes 4 bytes
> +-----.-----.-----.-----+-----.-----.-----.-----+
> | "P" | "Y" | "C" | "O" | size of whole section |
> +-----------------------+-----------------------+
>
> +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
> | "V" | "E" | "R" | "S" | 4 | import "MAGIC number" |
> +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
> | "D" | "A" | "T" | "E" | 4 | source file st_mtime |
> +-----.-----.-----.-----+-----.-----.-----.-----+-----------------------+
> | "C" | "O" | "F" | "L" | 4 | code.co_flags |
> +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+
> | "C" | "O" | "D" | "E" | size of marshaled code| marshaled code object |
> +-----.-----.-----.-----+-----.-----.-----.-----+ ... ... ... |
> | ... ... ... |
> +-----------------------+
>
> New sections -or subsections inside a section- can be defined in the
> future. No implied knowledge of section meanings or its structure is
> required to read the file; readers can safely skip over sections they
> don't understand, and never lost synchronism.
>
> Compared with the current format, it has an overhead of 44 bytes.
>
> The format above can replace the current format used for .pyc/.pyo files
> (but see below). Of course it's totally incompatible with the old format.
> Apart from changing every place where .pyc files are read or written in
> the Python sources (not so many, I've identified all of them), 3rd party
> libraries and tools using the old format would have to be updated. Perhaps
> a new module should be provided to read and write pyc files.
> Anyway the change is "safe", in the sense that any old code expecting the
> MAGIC number in the first 4 bytes will reject the new format as invalid
> and not process it.
> Due to this incompatibility, this should be aimed at Python 3.x; I hope we
> are on time to implement this for 3.0?
>
>
> A step further:
>
> Currently, the generated code object depends on the Python version and the
> optimize flag; it used to depend on the Unicode flag too, but that's not
> the case for Python 3.
> The Python version determines the base MAGIC number; the Unicode flag
> increments that number by 1; the optimize flag determines the file
> extension used (.pyc/.pyo).
> With this new format, there is no need to use two different extensions
> anymore: all of this can be gathered from the attributes above, so several
> variants of the same code object can be stored in a single file. The
> importer can choose which one to load based on those attributes. The
> selection can be made rather quickly, just the relevant attributes have to
> be read actually; all other subsections can be entirely skipped without
> further parsing.
>
>
> Some issues:
>
> - endianness: .pyc files currently store the modification time and magic
> number in little-endian; probably one should just stick to it.
> - making the size of all sections multiple of 4 may be a good idea, so
> marshaled code should be padded with up to 3 NUL bytes at the end.
> - section ordering, and subsection ordering inside a section: should not
> be relevant; what if one can't seek to an earlier part of the file? (Ok,
> unlikely, but currently import.c appears to handle such cases). If "CODE"
> comes before any of "VERS", "COFL", "DATE" it should be necesary to rewind
> the file to read the code section. The easy fix is to forbid that
> situation: "CODE" must come after all of those subsections.
> - The co_flags attribute of code objects is now externally visible; future
> Python versions should not redefine those flags.
> - There is no provision for explicit attribute types: "VERS" is a number,
> "CODE" is a marshaled code object... The reader has to *know* that
> (although it can completely skip over unknown attributes). No string
> attributes were defined (nor required). For the *current* needs, it's
> enough as it is. But perhaps in the future this reveals as a shortcoming,
> and the .pyc format has to be changed *again*, and I'd hate that.
> - Perhaps the source modification date should be stored in a more portable
> way?
> - a naming problem: currently, the code version number defined in import.c
> is called "MAGIC", and is written at the very beginning of the file. It
> identifies the file as having a valid code object. In the proposed format,
> the file will begin with the letters "PYCO" instead, and the current magic
> number is buried inside a subsection... it's not a "magic" anymore, just a
> version number, and the "magic" in the sense used by file(1) would be the
> 4 bytes "PYCO". So the name "MAGIC" should be changed everywhere... it
> seems too drastic.
> - 32 bits should be enough for all sizes (and 640k should be enough for
> all people...)
>
> Implementation:
>
> I don't have a complete implementation yet, but if this format is approved
> (as is or with any changes) I could submit a patch. I've made a small but
> incompatible modification in the currently used .pyc format in order to
> detect all places where this change would impact, and they're not so many
> actually.
>
> --
> Gabriel Genellina
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-ideas
mailing list