Summary of .pyc-Discussion so far (was Re: [Python-Dev] Proposal: .pyc file format change)

M.-A. Lemburg
Tue, 30 May 2000 10:10:25 +0200

Peter Funk wrote:
> Greg Stein:
> > I don't think we should have a two-byte magic value. Especially where
> > those two bytes are printable, 7-bit ASCII.
> [...]
> > To ensure uniqueness, I think a four-byte magic should stay.
> Looking at /etc/magic I see many 16-bit magic numbers kept around
> from the good old days.  But you are right: Choosing a four-byte magic
> value would make the chance of a clash with some other file format
> much less likely.

Just for quotes: the current /etc/magic I have on my Linux
machine doesn't know anything about PYC or PYO files, so I
don't really see much of a problem here -- noone seems to
be interested in finding out the file type for these
files anyway ;-)

Also, I don't really get the 16-bit magic argument: we still
have a 32-bit magic number -- one with a 16-bit fixed value and
predefined ranges for the remaining 16 bits. This already is
much better than what we have now w/r to making file(1) work
on PYC files.
> > I would recommend the approach of adding opcodes into the marshal format.
> > Specifically, 'V' followed by a single byte. That can only occur at the
> > beginning. If it is not present, then you know that you have an old
> > marshal value.
> But this would not solve the problem with 8 byte versus 4 byte timestamps
> in the header on 64-bit OSes.  Trent Mick pointed this out.

The switch to 8 byte timestamps is only needed when the current
4 bytes can no longer hold the timestamp value. That will happen
in 2038...

Note that import.c writes the timestamp in 4 bytes until it
reaches an overflow situation.

> I think, the situation we have now, is very unsatisfactory:  I don't
> see a reasonable solution, which allows us to keep the length of the
> header before the marshal-block at a fixed length of 8 bytes together
> with a frozen 4 byte magic number.

Adding a version to the marshal format is a Good Thing --
independent of this discussion.
> Moving the version number into the marshal doesn't help to resolve
> this conflict.  So either you have to accept a new magic on 64 bit
> systems or you have to enlarge the header.

No you don't... please read the code: marshal only writes
8 bytes in case 4 bytes aren't enough to hold the value.
> To come up with a new proposal, the following questions should be answered:
>   1. Is there really too much code out there, which depends on
>      the hardcoded assumption, that the marshal part of a .pyc file
>      starts at byte 8?  I see no further evidence for or against this.
>      MAL pointed this out in
>      <>

I have several references in my tool collection, the import
stuff uses it, old import hooks (remember ihooks ?) also do, etc.

>   2. If we decide to enlarge the header, do we really need a new
>      header field defining the length of the header ?
>      This was proposed by Christian Tismer in
>      <>

In Py3K we can do this right (breaking things is allowed)...
and I agree with Christian that a proper file format needs
a header length field too. Basically, these values have to
be present, IMHO:

1. Magic
2. Version
3. Length of Header
4. (Header Attribute)*n
-- Start of Data ---

Header Attribute can be pretty much anything -- timestamps,
names of files or other entities, bit sizes, architecture
flags, optimization settings, etc.

>   3. The 'imp' module exposes somewhat the structure of an .pyc file
>      through the function 'get_magic()'.  I proposed changing the signature of
>      'imp.get_magic()' in an upward compatible way.  I also proposed
>      adding a new function 'imp.get_version()'.  What do you think about
>      this idea?

imp.get_magic() would have to return the proposed 32-bit value
('PY' + version byte + option byte).

I'd suggest adding additional functions which can read and write the
header given a PYCHeader object which would hold the 
values version and options.

>   4. Greg proposed prepending the version number to the marshal
>      format.  If we do this, we definitely need a frozen way to find
>      out, where the marshalled code object actually starts.  This has
>      also the disadvantage of making the task to come up with a /etc/magic
>      definition whichs displays the version number of a .pyc file slightly
>      harder.
> If we decide to move the version number into the marshal, if we can
> also move the .py-timestamp there.  This way the timestamp will be handled
> in the same way as large integer literals.  Quoting from the docs:
> """Caveat: On machines where C's long int type has more than 32 bits
>    (such as the DEC Alpha), it is possible to create plain Python
>    integers that are longer than 32 bits. Since the current marshal
>    module uses 32 bits to transfer plain Python integers, such values
>    are silently truncated. This particularly affects the use of very
>    long integer literals in Python modules -- these will be accepted
>    by the parser on such machines, but will be silently be truncated
>    when the module is read from the .pyc instead.
>    [...]
>    A solution would be to refuse such literals in the parser, since
>    they are inherently non-portable. Another solution would be to let
>    the marshal module raise an exception when an integer value would
>    be truncated. At least one of these solutions will be implemented
>    in a future version."""
> Should this be 1.6?  Changing the format of .pyc files over and over
> again in the 1.x series doesn't look very attractive.

Marc-Andre Lemburg
Python Pages: