[Python-ideas] A new .pyc file format

Mon Apr 28 16:30:52 CEST 2008

I'll play my part here and toss out some ideas we could use this for.  
I'm not really advocating it, yet, but I'll say I am +0. In either  
case, if we did, here are some possible uses:

We could break up the code into multiple sections and allow  
alternatives for sections with different versions. Different versions  
could be used for a few different things, including different  
optimization levels, supporting multiple bytecode versions, or  
storing both pre and post AST transformations of the code.

Meta-ish data like docstrings could be pulled out of the code objects  
and injected in non-optimized modes. This might also include original  
source for code, which could be helpful when you change the source  
and still-running code tracebacks and gives you invalid lines.

Bookkeeping data could sit in some sections, detailing things like  
call stats (average call time, frequency, etc) and other information  
that could be useful for optimizers and JIT compilers like psyco.

I am not saying any of these are good ideas or good uses of the  
original idea. I'm just giving thought fodder for the hypothetical.

On Apr 25, 2008, at 10:29 AM, Guido van Rossum wrote:

> I think this is a reasonable thing to do, but I'd like to hear more
> motivation. Maybe you can write it all up in PEP format an add a
> section that explains what features we want from .pyc files?
>
> I like that this would get rid of .pyo files BTW.
>
> --Guido
>
> On Fri, Apr 25, 2008 at 3:44 AM, Gabriel Genellina
> <gagsl-py2 at yahoo.com.ar> wrote:
>> Hello
>>
>>  (Sorry if you get this twice, I can't see my original post from  
>> gmane)
>>
>>  I want to propose a new .pyc file format. Currently .pyc files  
>> use a very
>>  simple format:
>>
>>  - MAGIC number (4 bytes, little-endian)
>>  - last modification time of source file (4 bytes, little-endian)
>>  - code object (marshaled)
>>
>>  The problem is that this format is *too* simple. It can't be  
>> changed, nor
>>  can accomodate other fields if desired. I propose using a more  
>> flexible
>>  ..pyc format (resembling RIFF files with multiple levels). The  
>> layout would
>>  be as follows:
>>
>>  - A file contains a sequence of sections.
>>  - A section has an identifier (4 bytes, usually ASCII letters),  
>> followed
>>  by its size (4 bytes, not counting the section identifier nor the  
>> size
>>  itself), followed by the actual section content.
>>  - The layout inside each section is arbitrary, but it's suggested  
>> to use
>>  the same technique: a sequence of (identifier x 4 bytes, size x 4  
>> bytes,
>>  actual value)
>>
>>  The outer section is called "PYCO" (from Python Code, or a  
>> contraction of
>>  pyc+pyo) and contains at least 4 subsections:
>>
>>   - "VERS": import's "MAGIC number", now seen as a "code version  
>> number" (4
>>  bytes, same format as before)
>>   - "DATE": last modification time of source file (4 bytes, same  
>> format as
>>  before)
>>   - "COFL": the code.co_flags attribute (4 bytes)
>>   - "CODE": the marshaled code object
>>
>>             4 bytes                 4 bytes
>>    +-----.-----.-----.-----+-----.-----.-----.-----+
>>    | "P" | "Y" | "C" | "O" | size of whole section |
>>    +-----------------------+-----------------------+
>>
>>    +-----.-----.-----.-----+-----.-----.-----.----- 
>> +-----.-----.-----.-----+
>>    | "V" | "E" | "R" | "S" |   4                   | import "MAGIC  
>> number" |
>>    +-----.-----.-----.-----+-----.-----.-----.----- 
>> +-----.-----.-----.-----+
>>    | "D" | "A" | "T" | "E" |   4                   | source file  
>> st_mtime  |
>>    +-----.-----.-----.-----+-----.-----.-----.----- 
>> +-----------------------+
>>    | "C" | "O" | "F" | "L" |   4                   |  
>> code.co_flags         |
>>    +-----.-----.-----.-----+-----.-----.-----.----- 
>> +-----.-----.-----.-----+
>>    | "C" | "O" | "D" | "E" | size of marshaled code| marshaled  
>> code object |
>>    +-----.-----.-----.-----+-----.-----.-----.----- 
>> +     ... ... ...       |
>>                                                     
>> |     ... ... ...       |
>>                                                     
>> +-----------------------+
>>
>>  New sections -or subsections inside a section- can be defined in the
>>  future. No implied knowledge of section meanings or its structure is
>>  required to read the file; readers can safely skip over sections  
>> they
>>  don't understand, and never lost synchronism.
>>
>>  Compared with the current format, it has an overhead of 44 bytes.
>>
>>  The format above can replace the current format used  
>> for .pyc/.pyo files
>>  (but see below). Of course it's totally incompatible with the old  
>> format.
>>  Apart from changing every place where .pyc files are read or  
>> written in
>>  the Python sources (not so many, I've identified all of them),  
>> 3rd party
>>  libraries and tools using the old format would have to be  
>> updated. Perhaps
>>  a new module should be provided to read and write pyc files.
>>  Anyway the change is "safe", in the sense that any old code  
>> expecting the
>>  MAGIC number in the first 4 bytes will reject the new format as  
>> invalid
>>  and not process it.
>>  Due to this incompatibility, this should be aimed at Python 3.x;  
>> I hope we
>>  are on time to implement this for 3.0?
>>
>>
>>  A step further:
>>
>>  Currently, the generated code object depends on the Python  
>> version and the
>>  optimize flag; it used to depend on the Unicode flag too, but  
>> that's not
>>  the case for Python 3.
>>  The Python version determines the base MAGIC number; the Unicode  
>> flag
>>  increments that number by 1; the optimize flag determines the file
>>  extension used (.pyc/.pyo).
>>  With this new format, there is no need to use two different  
>> extensions
>>  anymore: all of this can be gathered from the attributes above,  
>> so several
>>  variants of the same code object can be stored in a single file. The
>>  importer can choose which one to load based on those attributes. The
>>  selection can be made rather quickly, just the relevant  
>> attributes have to
>>  be read actually; all other subsections can be entirely skipped  
>> without
>>  further parsing.
>>
>>
>>  Some issues:
>>
>>  - endianness: .pyc files currently store the modification time  
>> and magic
>>  number in little-endian; probably one should just stick to it.
>>  - making the size of all sections multiple of 4 may be a good  
>> idea, so
>>  marshaled code should be padded with up to 3 NUL bytes at the end.
>>  - section ordering, and subsection ordering inside a section:  
>> should not
>>  be relevant; what if one can't seek to an earlier part of the  
>> file? (Ok,
>>  unlikely, but currently import.c appears to handle such cases).  
>> If "CODE"
>>  comes before any of "VERS", "COFL", "DATE" it should be necesary  
>> to rewind
>>  the file to read the code section. The easy fix is to forbid that
>>  situation: "CODE" must come after all of those subsections.
>>  - The co_flags attribute of code objects is now externally  
>> visible; future
>>  Python versions should not redefine those flags.
>>  - There is no provision for explicit attribute types: "VERS" is a  
>> number,
>>  "CODE" is a marshaled code object... The reader has to *know* that
>>  (although it can completely skip over unknown attributes). No string
>>  attributes were defined (nor required). For the *current* needs,  
>> it's
>>  enough as it is. But perhaps in the future this reveals as a  
>> shortcoming,
>>  and the .pyc format has to be changed *again*, and I'd hate that.
>>  - Perhaps the source modification date should be stored in a more  
>> portable
>>  way?
>>  - a naming problem: currently, the code version number defined in  
>> import.c
>>  is called "MAGIC", and is written at the very beginning of the  
>> file. It
>>  identifies the file as having a valid code object. In the  
>> proposed format,
>>  the file will begin with the letters "PYCO" instead, and the  
>> current magic
>>  number is buried inside a subsection... it's not a "magic"  
>> anymore, just a
>>  version number, and the "magic" in the sense used by file(1)  
>> would be the
>>  4 bytes "PYCO". So the name "MAGIC" should be changed  
>> everywhere... it
>>  seems too drastic.
>>  - 32 bits should be enough for all sizes (and 640k should be  
>> enough for
>>  all people...)
>>
>>  Implementation:
>>
>>  I don't have a complete implementation yet, but if this format is  
>> approved
>>  (as is or with any changes) I could submit a patch. I've made a  
>> small but
>>  incompatible modification in the currently used .pyc format in  
>> order to
>>  detect all places where this change would impact, and they're not  
>> so many
>>  actually.
>>
>>  --
>>  Gabriel Genellina
>>
>>  _______________________________________________
>>  Python-ideas mailing list
>>  Python-ideas at python.org
>>  http://mail.python.org/mailman/listinfo/python-ideas
>>
>
>
>
> -- 
> --Guido van Rossum (home page: http://www.python.org/~guido/)
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas