[Python-Dev] Identifying magic prefix on Python files?

Eric S. Raymond esr@thyrsus.com
Sun, 4 Feb 2001 17:35:59 -0500


Guido van Rossum <guido@digicool.com>:
> I don't understand.  The .pyc file has a magic number.  Why is this
> incompatible with file(1)?

It isn't.  I failed to spot the fact that this is file(1)'s problem,
not Python's; my apologies.  Nevertheless, according to Tim Peters
this is a good time for the issue to come up, because the present
method is going to break after year-end.  We might as well redesign
it now.
 
> If we're going to redesign the .pyc file header, I'd propose the
> following:
> 
> (1) magic number -- for file(1), never to be changed
> 
> (2) some kind of version -- Python version, or API version, or
>     bytecode version
> 
> (3) mtime of .py file
> 
> (4) options, e.g. is this a .pyc or a .pyo
> 
> (5) size of marshalled code following
> 
> (6) marshalled code

I agree with these desiderata.  Tim has already pointed out that #4
needs to include a Unicode bit.

What I'd like to throw in the pot is the cleverest file signature design 
I've ever seen -- PNG's.  Here's a quote from the PNG spec:

----------------------------------------------------------------------------
The first eight bytes of a PNG file always contain the following values: 

   (decimal)              137  80  78  71  13  10  26  10
   (hexadecimal)           89  50  4e  47  0d  0a  1a  0a
   (ASCII C notation)    \211   P   N   G  \r  \n \032 \n

This signature both identifies the file as a PNG file and provides for
immediate detection of common file-transfer problems. The first two
bytes distinguish PNG files on systems that expect the first two bytes
to identify the file type uniquely. The first byte is chosen as a
non-ASCII value to reduce the probability that a text file may be
misrecognized as a PNG file; also, it catches bad file transfers that
clear bit 7. Bytes two through four name the format. The CR-LF
sequence catches bad file transfers that alter newline sequences. The
control-Z character stops file display under MS-DOS. The final line
feed checks for the inverse of the CR-LF translation problem.

A decoder may further verify that the next eight bytes contain an IHDR
chunk header with the correct chunk length; this will catch bad
transfers that drop or alter null (zero) bytes.
----------------------------------------------------------------------------

I think we ought to model Python's fixed magic-number part on this prefix.
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

No matter how one approaches the figures, one is forced to the rather
startling conclusion that the use of firearms in crime was very much
less when there were no controls of any sort and when anyone,
convicted criminal or lunatic, could buy any type of firearm without
restriction.  Half a century of strict controls on pistols has ended,
perversely, with a far greater use of this weapon in crime than ever
before.
        -- Colin Greenwood, in the study "Firearms Control", 1972