[Python-ideas] A new .pyc file format

gagsl-py2 at yahoo.com.ar gagsl-py2 at yahoo.com.ar
Tue Apr 29 11:26:22 CEST 2008

Replying to all posts jointly (and directly to the
list, looks like gmane doesn't like my posts on the

En Fri, 25 Apr 2008 11:29:13 -0300, Guido van Rossum  
<guido-+ZN9ApsXKcEdnm+yROfE0A at public.gmane.org>

> I think this is a reasonable thing to do, but I'd
> like to hear more motivation. Maybe you can write 
> it all up in PEP format an add a section that 
> explains what features we want from .pyc files?


> I like that this would get rid of .pyo files BTW.

Yes, both .pyc and .pyo versions could coexist on 
the same file, among other things.

En Fri, 25 Apr 2008 11:33:16 -0300, Blake Winton  
<bwinton-D8CoGe09WXY at public.gmane.org> escribió:

> As a side suggestion, the PNG spec makes the 
> capitalization of each  identifier indicate extra 
> meta-data about the section.  (See:  

> )
> For instance, an identifier that starts with a 
> capital letter means that the decoder must 
> understand this chunk to process the contents of 
> the file, whereas an identifier with a lowercase 
> first letter can safely be skipped.
> An uppercase second letter means that the 
> identifier is defined by Python, whereas a 
> lowercase second letter would indicate a 
> third-party-defined chunk.
> (The PNG spec reserves the case of the third 
> letter, and forces it to be uppercase.  The case 
> of the fourth letter indicates whether it's safe 
> to copy this chunk.  I don't think either of those
> are particularly useful to Python, and so could 
> conveniently be skipped.)
> Would this extra meta-data be useful?  I think so,
> for the "safe to ignore" flag, at least.

Yes, we can reserve the case of the last two letters
(always uppercase now) until any useful meaning 

En Fri, 25 Apr 2008 12:30:06 -0300, Facundo Batista  
<facundobatista-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org>

> 2008/4/25, Gabriel Genellina  
> <gagsl-py2-/E1597aS9LQMlKAeRRkD2Q at public.gmane.org>:

>>  The problem is that this format is *too* simple.
>>  It can't be changed, nor can accomodate other 
>>  fields if desired. I propose using a more 
>>  flexible

> But how do you think that these extended pyc's 
> will be used? I mean, are there use cases for this
> more complex pyc? Or they just will be more 
> complex, but with the same information than 
> before, for years, because nobody needs this 
> flexibility?

.pyc and .pyo files could be merged into a single 
file, using always the .pyc extension. This way the 
logic to locate/search the right file to load would 
be simpler (at the cost of making more complex 
locating the *section* inside the file that must be 
loaded!). In the past, zipimport got it wrong in 
some cases (see http://bugs.python.org/issue1346572).

Another example is python -U; it changes the magic 
number, so modules compiled in this mode are 
incompatible with modules compiled in the "normal" 
mode. If a mechanism like this proposal had existed 
in the past, both variants could have been stored 
in the same .pyc file.
(I think that nobody *really* uses python -U, but 
the same argument applies to any alternate code 
generation method: pyc files are unable to contain 
more than one code variant at a time)

>>  Anyway the change is "safe", in the sense that 
>>  any old code expecting the MAGIC number in the 
>>  first 4 bytes will reject the new format as 
>>  invalid and not process it.

> Maybe what we can do here is that, for some Python
> versions (say, 3.0, and maybe 3.1), the importer 
> will try to import in the new form, and if 
> recognizes it as invalid *and* finds the some 
> MAGIC numbers in the first 4 bytes, just import it
> in the old fashion way..

That could be done, but why? Isn't the same 
situation as a change in the magic number? That 
invalidates all existing .pyc files and they all 
must be recompiled. If this new .pyc format were 
implemented, it's the same thing; all existing .pyc 
files must be recompiled. Old .pyc files have always
been discarded, and the same should apply to this 
new format, I think.

> Yes, you should start writing a PEP (any help you 
> need here, we can talk about it in the next Python
> Argentina meeting, ;).

(Mmm, I would not rely on that, given the past 
statistics... :-( )

En Fri, 25 Apr 2008 11:35:48 -0300, Mike Meyer  
<mwm-tkOQc4lHIczYtjvyW6yDsg at public.gmane.org>

>> - A section has an identifier (4 bytes, usually 
>> ASCII letters), followed by its size (4 bytes, 
>> not counting the section identifier nor the size
>> itself), followed by the actual section content.

> AKA Tag/Length/Value triples. While TLV is the 
> common order, it's slightly easier to deal with 
> them if you go with LTV. You *have* to deal with 
> the length in order to read things in. Beyond 
> that, you can treat TV as atomic you don't care 
> about the tag for some reason.

Ok, the proposed order was that of the RIFF format, 
and the only reason I chose it is because it can be 
read using the chunk.py standard module. But it's 
not a very convincing argument. I like LTV more.

>> - 32 bits should be enough for all sizes (and 
>> 640k should be enough for all people...)

> Given that there are people who write code that 
> writes code, and the memory and disk capacities 
> of modern systems, I'd say this is likely to cause
> problems. Given those capacities, 8 byte lengths 
> instead of 4 shouldn't be a problem. For embedded 
> devices - well, they're not going to like the idea
> in the first place.

I've found that using more than 32 bits would 
require changes in other places too, including 
the marshal format.
According to this thread from last year 
looks like huge code objects are not supported, 
unless something has changed in the meantime.

Gabriel Genellina

Gabriel Genellina
Softlab SRL

      Yahoo! Deportes Beta
¡No te pierdas lo último sobre el torneo clausura 2008! Enterate aquí http://deportes.yahoo.com

More information about the Python-ideas mailing list