Hello (Sorry if you get this twice, I can't see my original post from gmane) I want to propose a new .pyc file format. Currently .pyc files use a very simple format: - MAGIC number (4 bytes, little-endian) - last modification time of source file (4 bytes, little-endian) - code object (marshaled) The problem is that this format is *too* simple. It can't be changed, nor can accomodate other fields if desired. I propose using a more flexible ..pyc format (resembling RIFF files with multiple levels). The layout would be as follows: - A file contains a sequence of sections. - A section has an identifier (4 bytes, usually ASCII letters), followed by its size (4 bytes, not counting the section identifier nor the size itself), followed by the actual section content. - The layout inside each section is arbitrary, but it's suggested to use the same technique: a sequence of (identifier x 4 bytes, size x 4 bytes, actual value) The outer section is called "PYCO" (from Python Code, or a contraction of pyc+pyo) and contains at least 4 subsections: - "VERS": import's "MAGIC number", now seen as a "code version number" (4 bytes, same format as before) - "DATE": last modification time of source file (4 bytes, same format as before) - "COFL": the code.co_flags attribute (4 bytes) - "CODE": the marshaled code object 4 bytes 4 bytes +-----.-----.-----.-----+-----.-----.-----.-----+ | "P" | "Y" | "C" | "O" | size of whole section | +-----------------------+-----------------------+ +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+ | "V" | "E" | "R" | "S" | 4 | import "MAGIC number" | +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+ | "D" | "A" | "T" | "E" | 4 | source file st_mtime | +-----.-----.-----.-----+-----.-----.-----.-----+-----------------------+ | "C" | "O" | "F" | "L" | 4 | code.co_flags | +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+ | "C" | "O" | "D" | "E" | size of marshaled code| marshaled code object | +-----.-----.-----.-----+-----.-----.-----.-----+ ... ... ... | | ... ... ... | +-----------------------+ New sections -or subsections inside a section- can be defined in the future. No implied knowledge of section meanings or its structure is required to read the file; readers can safely skip over sections they don't understand, and never lost synchronism. Compared with the current format, it has an overhead of 44 bytes. The format above can replace the current format used for .pyc/.pyo files (but see below). Of course it's totally incompatible with the old format. Apart from changing every place where .pyc files are read or written in the Python sources (not so many, I've identified all of them), 3rd party libraries and tools using the old format would have to be updated. Perhaps a new module should be provided to read and write pyc files. Anyway the change is "safe", in the sense that any old code expecting the MAGIC number in the first 4 bytes will reject the new format as invalid and not process it. Due to this incompatibility, this should be aimed at Python 3.x; I hope we are on time to implement this for 3.0? A step further: Currently, the generated code object depends on the Python version and the optimize flag; it used to depend on the Unicode flag too, but that's not the case for Python 3. The Python version determines the base MAGIC number; the Unicode flag increments that number by 1; the optimize flag determines the file extension used (.pyc/.pyo). With this new format, there is no need to use two different extensions anymore: all of this can be gathered from the attributes above, so several variants of the same code object can be stored in a single file. The importer can choose which one to load based on those attributes. The selection can be made rather quickly, just the relevant attributes have to be read actually; all other subsections can be entirely skipped without further parsing. Some issues: - endianness: .pyc files currently store the modification time and magic number in little-endian; probably one should just stick to it. - making the size of all sections multiple of 4 may be a good idea, so marshaled code should be padded with up to 3 NUL bytes at the end. - section ordering, and subsection ordering inside a section: should not be relevant; what if one can't seek to an earlier part of the file? (Ok, unlikely, but currently import.c appears to handle such cases). If "CODE" comes before any of "VERS", "COFL", "DATE" it should be necesary to rewind the file to read the code section. The easy fix is to forbid that situation: "CODE" must come after all of those subsections. - The co_flags attribute of code objects is now externally visible; future Python versions should not redefine those flags. - There is no provision for explicit attribute types: "VERS" is a number, "CODE" is a marshaled code object... The reader has to *know* that (although it can completely skip over unknown attributes). No string attributes were defined (nor required). For the *current* needs, it's enough as it is. But perhaps in the future this reveals as a shortcoming, and the .pyc format has to be changed *again*, and I'd hate that. - Perhaps the source modification date should be stored in a more portable way? - a naming problem: currently, the code version number defined in import.c is called "MAGIC", and is written at the very beginning of the file. It identifies the file as having a valid code object. In the proposed format, the file will begin with the letters "PYCO" instead, and the current magic number is buried inside a subsection... it's not a "magic" anymore, just a version number, and the "magic" in the sense used by file(1) would be the 4 bytes "PYCO". So the name "MAGIC" should be changed everywhere... it seems too drastic. - 32 bits should be enough for all sizes (and 640k should be enough for all people...) Implementation: I don't have a complete implementation yet, but if this format is approved (as is or with any changes) I could submit a patch. I've made a small but incompatible modification in the currently used .pyc format in order to detect all places where this change would impact, and they're not so many actually. -- Gabriel Genellina