Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: .pyc file format change)

Peter Funk
Tue, 30 May 2000 09:08:15 +0200 (MEST)

Greg Stein:
> I don't think we should have a two-byte magic value. Especially where
> those two bytes are printable, 7-bit ASCII.
> To ensure uniqueness, I think a four-byte magic should stay.

Looking at /etc/magic I see many 16-bit magic numbers kept around
from the good old days.  But you are right: Choosing a four-byte magic
value would make the chance of a clash with some other file format
much less likely.

> I would recommend the approach of adding opcodes into the marshal format.
> Specifically, 'V' followed by a single byte. That can only occur at the
> beginning. If it is not present, then you know that you have an old
> marshal value.

But this would not solve the problem with 8 byte versus 4 byte timestamps
in the header on 64-bit OSes.  Trent Mick pointed this out.

I think, the situation we have now, is very unsatisfactory:  I don't 
see a reasonable solution, which allows us to keep the length of the
header before the marshal-block at a fixed length of 8 bytes together
with a frozen 4 byte magic number.  

Moving the version number into the marshal doesn't help to resolve
this conflict.  So either you have to accept a new magic on 64 bit
systems or you have to enlarge the header.

To come up with a new proposal, the following questions should be answered:
  1. Is there really too much code out there, which depends on 
     the hardcoded assumption, that the marshal part of a .pyc file 
     starts at byte 8?  I see no further evidence for or against this.
     MAL pointed this out in 
  2. If we decide to enlarge the header, do we really need a new
     header field defining the length of the header ? 
     This was proposed by Christian Tismer in 
  3. The 'imp' module exposes somewhat the structure of an .pyc file
     through the function 'get_magic()'.  I proposed changing the signature of
     'imp.get_magic()' in an upward compatible way.  I also proposed 
     adding a new function 'imp.get_version()'.  What do you think about 
     this idea?
  4. Greg proposed prepending the version number to the marshal
     format.  If we do this, we definitely need a frozen way to find
     out, where the marshalled code object actually starts.  This has
     also the disadvantage of making the task to come up with a /etc/magic
     definition whichs displays the version number of a .pyc file slightly

If we decide to move the version number into the marshal, if we can
also move the .py-timestamp there.  This way the timestamp will be handled
in the same way as large integer literals.  Quoting from the docs:

"""Caveat: On machines where C's long int type has more than 32 bits
   (such as the DEC Alpha), it is possible to create plain Python
   integers that are longer than 32 bits. Since the current marshal
   module uses 32 bits to transfer plain Python integers, such values
   are silently truncated. This particularly affects the use of very
   long integer literals in Python modules -- these will be accepted
   by the parser on such machines, but will be silently be truncated
   when the module is read from the .pyc instead.
   A solution would be to refuse such literals in the parser, since
   they are inherently non-portable. Another solution would be to let
   the marshal module raise an exception when an integer value would
   be truncated. At least one of these solutions will be implemented
   in a future version."""

Should this be 1.6?  Changing the format of .pyc files over and over
again in the 1.x series doesn't look very attractive.

Regards, Peter