On Fri, Apr 25, 2008 at 3:44 AM, Gabriel Genellina
Hello
(Sorry if you get this twice, I can't see my original post from gmane)
I want to propose a new .pyc file format. Currently .pyc files use a very simple format:
- MAGIC number (4 bytes, little-endian) - last modification time of source file (4 bytes, little-endian) - code object (marshaled)
The problem is that this format is *too* simple. It can't be changed, nor can accomodate other fields if desired. I propose using a more flexible ..pyc format (resembling RIFF files with multiple levels). The layout would be as follows:
- A file contains a sequence of sections. - A section has an identifier (4 bytes, usually ASCII letters), followed by its size (4 bytes, not counting the section identifier nor the size itself), followed by the actual section content. - The layout inside each section is arbitrary, but it's suggested to use the same technique: a sequence of (identifier x 4 bytes, size x 4 bytes, actual value)
The outer section is called "PYCO" (from Python Code, or a contraction of pyc+pyo) and contains at least 4 subsections:
- "VERS": import's "MAGIC number", now seen as a "code version number" (4 bytes, same format as before) - "DATE": last modification time of source file (4 bytes, same format as before) - "COFL": the code.co_flags attribute (4 bytes) - "CODE": the marshaled code object
4 bytes 4 bytes +-----.-----.-----.-----+-----.-----.-----.-----+ | "P" | "Y" | "C" | "O" | size of whole section | +-----------------------+-----------------------+
+-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+ | "V" | "E" | "R" | "S" | 4 | import "MAGIC number" | +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+ | "D" | "A" | "T" | "E" | 4 | source file st_mtime | +-----.-----.-----.-----+-----.-----.-----.-----+-----------------------+ | "C" | "O" | "F" | "L" | 4 | code.co_flags | +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+ | "C" | "O" | "D" | "E" | size of marshaled code| marshaled code object | +-----.-----.-----.-----+-----.-----.-----.-----+ ... ... ... | | ... ... ... | +-----------------------+
New sections -or subsections inside a section- can be defined in the future. No implied knowledge of section meanings or its structure is required to read the file; readers can safely skip over sections they don't understand, and never lost synchronism.
While I think having a more flexible format is important to allow for modifying the AST before bytecode write-out, I don't know if it needs to go quite this far. The magic number, timestamp, and marshaled code are not about to go away. Thus the current format can basically stay, but we can add a flexible addition between the timestamp and the code object. This saves some memory and simplifies the format slightly in the case where the guaranteed requirements of .pyc regeneration can be quickly checked (e.g., a quick 8 byte read off the file will quickly tell if the magic number of timestamp are out of date, thus skipping having to read the entire header for these two critical sanity checks). The thing I think that the new format needs to easily support is not just the removal of .pyo, but of user-defined AST transformations prior to .pyc generation. Now that this can be done at the Python level some people might start coming up with compiler-optimizations that they want to do which changes semantics. That means there needs to be a clear way to register an AST transformation has having occurred. I am just worried that the 4 bytes for labeling something won't be enough. We could say that all optimizations are labeled "OPTO" and that what format is used is specified is the value, but that means supporting multiple instances of the same label in the header (which I think is fine since this is going to be read linearly off disk 99% of the time). So I guess this boils down to I think we don't need to label what MUST be in the header, and that we should allow for multiple instances of the same label (whether this is always true or we use some way to flag that through capitalization). -Brett