A new .pyc file format
data:image/s3,"s3://crabby-images/86cd1/86cd1194b83d9889306e5ee65d73d38602e62f36" alt=""
Hello (Sorry if you get this twice, I can't see my original post from gmane) I want to propose a new .pyc file format. Currently .pyc files use a very simple format: - MAGIC number (4 bytes, little-endian) - last modification time of source file (4 bytes, little-endian) - code object (marshaled) The problem is that this format is *too* simple. It can't be changed, nor can accomodate other fields if desired. I propose using a more flexible ..pyc format (resembling RIFF files with multiple levels). The layout would be as follows: - A file contains a sequence of sections. - A section has an identifier (4 bytes, usually ASCII letters), followed by its size (4 bytes, not counting the section identifier nor the size itself), followed by the actual section content. - The layout inside each section is arbitrary, but it's suggested to use the same technique: a sequence of (identifier x 4 bytes, size x 4 bytes, actual value) The outer section is called "PYCO" (from Python Code, or a contraction of pyc+pyo) and contains at least 4 subsections: - "VERS": import's "MAGIC number", now seen as a "code version number" (4 bytes, same format as before) - "DATE": last modification time of source file (4 bytes, same format as before) - "COFL": the code.co_flags attribute (4 bytes) - "CODE": the marshaled code object 4 bytes 4 bytes +-----.-----.-----.-----+-----.-----.-----.-----+ | "P" | "Y" | "C" | "O" | size of whole section | +-----------------------+-----------------------+ +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+ | "V" | "E" | "R" | "S" | 4 | import "MAGIC number" | +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+ | "D" | "A" | "T" | "E" | 4 | source file st_mtime | +-----.-----.-----.-----+-----.-----.-----.-----+-----------------------+ | "C" | "O" | "F" | "L" | 4 | code.co_flags | +-----.-----.-----.-----+-----.-----.-----.-----+-----.-----.-----.-----+ | "C" | "O" | "D" | "E" | size of marshaled code| marshaled code object | +-----.-----.-----.-----+-----.-----.-----.-----+ ... ... ... | | ... ... ... | +-----------------------+ New sections -or subsections inside a section- can be defined in the future. No implied knowledge of section meanings or its structure is required to read the file; readers can safely skip over sections they don't understand, and never lost synchronism. Compared with the current format, it has an overhead of 44 bytes. The format above can replace the current format used for .pyc/.pyo files (but see below). Of course it's totally incompatible with the old format. Apart from changing every place where .pyc files are read or written in the Python sources (not so many, I've identified all of them), 3rd party libraries and tools using the old format would have to be updated. Perhaps a new module should be provided to read and write pyc files. Anyway the change is "safe", in the sense that any old code expecting the MAGIC number in the first 4 bytes will reject the new format as invalid and not process it. Due to this incompatibility, this should be aimed at Python 3.x; I hope we are on time to implement this for 3.0? A step further: Currently, the generated code object depends on the Python version and the optimize flag; it used to depend on the Unicode flag too, but that's not the case for Python 3. The Python version determines the base MAGIC number; the Unicode flag increments that number by 1; the optimize flag determines the file extension used (.pyc/.pyo). With this new format, there is no need to use two different extensions anymore: all of this can be gathered from the attributes above, so several variants of the same code object can be stored in a single file. The importer can choose which one to load based on those attributes. The selection can be made rather quickly, just the relevant attributes have to be read actually; all other subsections can be entirely skipped without further parsing. Some issues: - endianness: .pyc files currently store the modification time and magic number in little-endian; probably one should just stick to it. - making the size of all sections multiple of 4 may be a good idea, so marshaled code should be padded with up to 3 NUL bytes at the end. - section ordering, and subsection ordering inside a section: should not be relevant; what if one can't seek to an earlier part of the file? (Ok, unlikely, but currently import.c appears to handle such cases). If "CODE" comes before any of "VERS", "COFL", "DATE" it should be necesary to rewind the file to read the code section. The easy fix is to forbid that situation: "CODE" must come after all of those subsections. - The co_flags attribute of code objects is now externally visible; future Python versions should not redefine those flags. - There is no provision for explicit attribute types: "VERS" is a number, "CODE" is a marshaled code object... The reader has to *know* that (although it can completely skip over unknown attributes). No string attributes were defined (nor required). For the *current* needs, it's enough as it is. But perhaps in the future this reveals as a shortcoming, and the .pyc format has to be changed *again*, and I'd hate that. - Perhaps the source modification date should be stored in a more portable way? - a naming problem: currently, the code version number defined in import.c is called "MAGIC", and is written at the very beginning of the file. It identifies the file as having a valid code object. In the proposed format, the file will begin with the letters "PYCO" instead, and the current magic number is buried inside a subsection... it's not a "magic" anymore, just a version number, and the "magic" in the sense used by file(1) would be the 4 bytes "PYCO". So the name "MAGIC" should be changed everywhere... it seems too drastic. - 32 bits should be enough for all sizes (and 640k should be enough for all people...) Implementation: I don't have a complete implementation yet, but if this format is approved (as is or with any changes) I could submit a patch. I've made a small but incompatible modification in the currently used .pyc format in order to detect all places where this change would impact, and they're not so many actually. -- Gabriel Genellina
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
I think this is a reasonable thing to do, but I'd like to hear more motivation. Maybe you can write it all up in PEP format an add a section that explains what features we want from .pyc files? I like that this would get rid of .pyo files BTW. --Guido On Fri, Apr 25, 2008 at 3:44 AM, Gabriel Genellina <gagsl-py2@yahoo.com.ar> wrote:
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
data:image/s3,"s3://crabby-images/3c11c/3c11cab9f092f793e5fe26023f87f14de87d17fb" alt=""
I'll play my part here and toss out some ideas we could use this for. I'm not really advocating it, yet, but I'll say I am +0. In either case, if we did, here are some possible uses: We could break up the code into multiple sections and allow alternatives for sections with different versions. Different versions could be used for a few different things, including different optimization levels, supporting multiple bytecode versions, or storing both pre and post AST transformations of the code. Meta-ish data like docstrings could be pulled out of the code objects and injected in non-optimized modes. This might also include original source for code, which could be helpful when you change the source and still-running code tracebacks and gives you invalid lines. Bookkeeping data could sit in some sections, detailing things like call stats (average call time, frequency, etc) and other information that could be useful for optimizers and JIT compilers like psyco. I am not saying any of these are good ideas or good uses of the original idea. I'm just giving thought fodder for the hypothetical. On Apr 25, 2008, at 10:29 AM, Guido van Rossum wrote:
data:image/s3,"s3://crabby-images/86cd1/86cd1194b83d9889306e5ee65d73d38602e62f36" alt=""
Replying to all posts jointly (and directly to the list, looks like gmane doesn't like my posts on the newsgroup...) En Fri, 25 Apr 2008 11:29:13 -0300, Guido van Rossum <guido-+ZN9ApsXKcEdnm+yROfE0A@public.gmane.org> escribió:
Ok.
I like that this would get rid of .pyo files BTW.
Yes, both .pyc and .pyo versions could coexist on the same file, among other things. En Fri, 25 Apr 2008 11:33:16 -0300, Blake Winton <bwinton-D8CoGe09WXY@public.gmane.org> escribió:
http://www.libpng.org/pub/png/spec/1.2/PNG-Structure.html#Chunk-naming-conve...
Yes, we can reserve the case of the last two letters (always uppercase now) until any useful meaning emerges. En Fri, 25 Apr 2008 12:30:06 -0300, Facundo Batista <facundobatista-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> escribió:
2008/4/25, Gabriel Genellina <gagsl-py2-/E1597aS9LQMlKAeRRkD2Q@public.gmane.org>:
.pyc and .pyo files could be merged into a single file, using always the .pyc extension. This way the logic to locate/search the right file to load would be simpler (at the cost of making more complex locating the *section* inside the file that must be loaded!). In the past, zipimport got it wrong in some cases (see http://bugs.python.org/issue1346572). Another example is python -U; it changes the magic number, so modules compiled in this mode are incompatible with modules compiled in the "normal" mode. If a mechanism like this proposal had existed in the past, both variants could have been stored in the same .pyc file. (I think that nobody *really* uses python -U, but the same argument applies to any alternate code generation method: pyc files are unable to contain more than one code variant at a time)
That could be done, but why? Isn't the same situation as a change in the magic number? That invalidates all existing .pyc files and they all must be recompiled. If this new .pyc format were implemented, it's the same thing; all existing .pyc files must be recompiled. Old .pyc files have always been discarded, and the same should apply to this new format, I think.
(Mmm, I would not rely on that, given the past statistics... :-( ) En Fri, 25 Apr 2008 11:35:48 -0300, Mike Meyer <mwm-tkOQc4lHIczYtjvyW6yDsg@public.gmane.org> escribió:
Ok, the proposed order was that of the RIFF format, and the only reason I chose it is because it can be read using the chunk.py standard module. But it's not a very convincing argument. I like LTV more.
- 32 bits should be enough for all sizes (and 640k should be enough for all people...)
I've found that using more than 32 bits would require changes in other places too, including the marshal format. According to this thread from last year http://mail.python.org/pipermail/python-dev/2007-May/073157.html looks like huge code objects are not supported, unless something has changed in the meantime. -- Gabriel Genellina Gabriel Genellina Softlab SRL Yahoo! Deportes Beta ¡No te pierdas lo último sobre el torneo clausura 2008! Enterate aquí http://deportes.yahoo.com
data:image/s3,"s3://crabby-images/5f86b/5f86b756fe216c6b7e416469f9e8637d1a00ae4d" alt=""
Gabriel Genellina wrote:
[snip...]
As a side suggestion, the PNG spec makes the capitalization of each identifier indicate extra meta-data about the section. (See: http://www.libpng.org/pub/png/spec/1.2/PNG-Structure.html#Chunk-naming-conve... ) For instance, an identifier that starts with a capital letter means that the decoder must understand this chunk to process the contents of the file, whereas an identifier with a lowercase first letter can safely be skipped. An uppercase second letter means that the identifier is defined by Python, whereas a lowercase second letter would indicate a third-party-defined chunk. (The PNG spec reserves the case of the third letter, and forces it to be uppercase. The case of the fourth letter indicates whether it's safe to copy this chunk. I don't think either of those are particularly useful to Python, and so could conveniently be skipped.) Would this extra meta-data be useful? I think so, for the "safe to ignore" flag, at least. Later, Blake.
data:image/s3,"s3://crabby-images/9f3d0/9f3d02f3375786c1b9e625fe336e3e9dfd7b0234" alt=""
On Fri, 25 Apr 2008 07:44:19 -0300 "Gabriel Genellina" <gagsl-py2@yahoo.com.ar> wrote:
Ok, *why* is this a problem? What proposed other fields do you have, other than putting in multiple code segments with different flags? Beyond that:
AKA Tag/Length/Value triples. While TLV is the common order, it's slightly easier to deal with them if you go with LTV. You *have* to deal with the length in order to read things in. Beyond that, you can treat TV as atomic you don't care about the tag for some reason.
- 32 bits should be enough for all sizes (and 640k should be enough for all people...)
Given that there are people who write code that writes code, and the memory and disk capacities of modern systems, I'd say this is likely to cause problems. Given those capacities, 8 byte lengths instead of 4 shouldn't be a problem. For embedded devices - well, they're not going to like the idea in the first place. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information.
data:image/s3,"s3://crabby-images/54541/54541caab4b2a8f6727e67c67f8da567e9464f84" alt=""
2008/4/25, Mike Meyer <mwm@mired.org>:
Well, ASN.1 has a well defined semantics to support L of multiple bytes in a TLV construction, we can adopt that. Note that I think that doing this is too complex for this purpose. Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/
data:image/s3,"s3://crabby-images/54541/54541caab4b2a8f6727e67c67f8da567e9464f84" alt=""
2008/4/25, Gabriel Genellina <gagsl-py2@yahoo.com.ar>:
The problem is that this format is *too* simple. It can't be changed, nor can accomodate other fields if desired. I propose using a more flexible
But how do you think that these extended pyc's will be used? I mean, are there use cases for this more complex pyc? Or they just will be more complex, but with the same information than before, for years, because nobody needs this flexibility?
Maybe what we can do here is that, for some Python versions (say, 3.0, and maybe 3.1), the importer will try to import in the new form, and if recognizes it as invalid *and* finds the some MAGIC numbers in the first 4 bytes, just import it in the old fashion way..
- making the size of all sections multiple of 4 may be a good idea, so marshaled code should be padded with up to 3 NUL bytes at the end.
Why? Has this something to do with memory alignment? I don't see the benefit of this extra rule.
- 32 bits should be enough for all sizes (and 640k should be enough for all people...)
Regarding this, and the 44 bytes of overhead you said, I checked which is the average size of the .pyc in the Linux system I had at hand: $ locate pyc | xargs ls -l | awk 'BEGIN {a=0; c=0} {a += $5; c+=1} END {print c, a/c}' 4960 9069.12 I think that both overhead and 32b for size are ok.
Yes, you should start writing a PEP (any help you need here, we can talk about it in the next Python Argentina meeting, ;). Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/
data:image/s3,"s3://crabby-images/e87f3/e87f3c7c6d92519a9dac18ec14406dd41e3da93d" alt=""
On Fri, Apr 25, 2008 at 3:44 AM, Gabriel Genellina <gagsl-py2@yahoo.com.ar> wrote:
While I think having a more flexible format is important to allow for modifying the AST before bytecode write-out, I don't know if it needs to go quite this far. The magic number, timestamp, and marshaled code are not about to go away. Thus the current format can basically stay, but we can add a flexible addition between the timestamp and the code object. This saves some memory and simplifies the format slightly in the case where the guaranteed requirements of .pyc regeneration can be quickly checked (e.g., a quick 8 byte read off the file will quickly tell if the magic number of timestamp are out of date, thus skipping having to read the entire header for these two critical sanity checks). The thing I think that the new format needs to easily support is not just the removal of .pyo, but of user-defined AST transformations prior to .pyc generation. Now that this can be done at the Python level some people might start coming up with compiler-optimizations that they want to do which changes semantics. That means there needs to be a clear way to register an AST transformation has having occurred. I am just worried that the 4 bytes for labeling something won't be enough. We could say that all optimizations are labeled "OPTO" and that what format is used is specified is the value, but that means supporting multiple instances of the same label in the header (which I think is fine since this is going to be read linearly off disk 99% of the time). So I guess this boils down to I think we don't need to label what MUST be in the header, and that we should allow for multiple instances of the same label (whether this is always true or we use some way to flag that through capitalization). -Brett
data:image/s3,"s3://crabby-images/e94e5/e94e50138bdcb6ec7711217f439489133d1c0273" alt=""
On 4/25/08, Gabriel Genellina <gagsl-py2@yahoo.com.ar> wrote:
The problem is that this [pyc] format is *too* simple. It can't be changed, nor can accomodate other fields if desired.
Why do you need to? Except for bootstrapping, can't you make all these changes with a custom loader/importer? Shipping python with default support for a new format may be reasonable as well -- the interpreter already handles both pure python and extension modules. Even hooking it in as an alternate generated format just extends the pyo/pyc decision. Or were you suggesting that the stdlib should use this new format by default, or even strictly instead of the current format? If so, what are the advantages in the normal case? (Deferring the load of docstrings? Better categorization by some external tool?) -jJ
data:image/s3,"s3://crabby-images/86cd1/86cd1194b83d9889306e5ee65d73d38602e62f36" alt=""
On Sat, 26 Apr 2008 13:21:59 -0300, Jim Jewett <jimjjewett-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> escribió:
I want to unify pyc+pyo so they coexist on the same file.
Yes, the idea is to *replace* completely the current format, not to add another alternative. There are now 4 different code variants: using -O or not, and using -U or not; the first one changes the file extension, the second changes the magic number. If the .pyc format could handle more than one variant, they all could be stored in a single file. Other kind of data can be stored too. Deferring docstrings can't be done without changing the marshal format, and that's out of the scope of this proposal (until now). -- Gabriel Genellina Gabriel Genellina Softlab SRL Yahoo! Encuentros. Ahora encontrar pareja es mucho más fácil, probá el nuevo Yahoo! Encuentros http://yahoo.cupidovirtual.com/servlet/NewRegistration
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
I think this is a reasonable thing to do, but I'd like to hear more motivation. Maybe you can write it all up in PEP format an add a section that explains what features we want from .pyc files? I like that this would get rid of .pyo files BTW. --Guido On Fri, Apr 25, 2008 at 3:44 AM, Gabriel Genellina <gagsl-py2@yahoo.com.ar> wrote:
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
data:image/s3,"s3://crabby-images/3c11c/3c11cab9f092f793e5fe26023f87f14de87d17fb" alt=""
I'll play my part here and toss out some ideas we could use this for. I'm not really advocating it, yet, but I'll say I am +0. In either case, if we did, here are some possible uses: We could break up the code into multiple sections and allow alternatives for sections with different versions. Different versions could be used for a few different things, including different optimization levels, supporting multiple bytecode versions, or storing both pre and post AST transformations of the code. Meta-ish data like docstrings could be pulled out of the code objects and injected in non-optimized modes. This might also include original source for code, which could be helpful when you change the source and still-running code tracebacks and gives you invalid lines. Bookkeeping data could sit in some sections, detailing things like call stats (average call time, frequency, etc) and other information that could be useful for optimizers and JIT compilers like psyco. I am not saying any of these are good ideas or good uses of the original idea. I'm just giving thought fodder for the hypothetical. On Apr 25, 2008, at 10:29 AM, Guido van Rossum wrote:
data:image/s3,"s3://crabby-images/86cd1/86cd1194b83d9889306e5ee65d73d38602e62f36" alt=""
Replying to all posts jointly (and directly to the list, looks like gmane doesn't like my posts on the newsgroup...) En Fri, 25 Apr 2008 11:29:13 -0300, Guido van Rossum <guido-+ZN9ApsXKcEdnm+yROfE0A@public.gmane.org> escribió:
Ok.
I like that this would get rid of .pyo files BTW.
Yes, both .pyc and .pyo versions could coexist on the same file, among other things. En Fri, 25 Apr 2008 11:33:16 -0300, Blake Winton <bwinton-D8CoGe09WXY@public.gmane.org> escribió:
http://www.libpng.org/pub/png/spec/1.2/PNG-Structure.html#Chunk-naming-conve...
Yes, we can reserve the case of the last two letters (always uppercase now) until any useful meaning emerges. En Fri, 25 Apr 2008 12:30:06 -0300, Facundo Batista <facundobatista-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> escribió:
2008/4/25, Gabriel Genellina <gagsl-py2-/E1597aS9LQMlKAeRRkD2Q@public.gmane.org>:
.pyc and .pyo files could be merged into a single file, using always the .pyc extension. This way the logic to locate/search the right file to load would be simpler (at the cost of making more complex locating the *section* inside the file that must be loaded!). In the past, zipimport got it wrong in some cases (see http://bugs.python.org/issue1346572). Another example is python -U; it changes the magic number, so modules compiled in this mode are incompatible with modules compiled in the "normal" mode. If a mechanism like this proposal had existed in the past, both variants could have been stored in the same .pyc file. (I think that nobody *really* uses python -U, but the same argument applies to any alternate code generation method: pyc files are unable to contain more than one code variant at a time)
That could be done, but why? Isn't the same situation as a change in the magic number? That invalidates all existing .pyc files and they all must be recompiled. If this new .pyc format were implemented, it's the same thing; all existing .pyc files must be recompiled. Old .pyc files have always been discarded, and the same should apply to this new format, I think.
(Mmm, I would not rely on that, given the past statistics... :-( ) En Fri, 25 Apr 2008 11:35:48 -0300, Mike Meyer <mwm-tkOQc4lHIczYtjvyW6yDsg@public.gmane.org> escribió:
Ok, the proposed order was that of the RIFF format, and the only reason I chose it is because it can be read using the chunk.py standard module. But it's not a very convincing argument. I like LTV more.
- 32 bits should be enough for all sizes (and 640k should be enough for all people...)
I've found that using more than 32 bits would require changes in other places too, including the marshal format. According to this thread from last year http://mail.python.org/pipermail/python-dev/2007-May/073157.html looks like huge code objects are not supported, unless something has changed in the meantime. -- Gabriel Genellina Gabriel Genellina Softlab SRL Yahoo! Deportes Beta ¡No te pierdas lo último sobre el torneo clausura 2008! Enterate aquí http://deportes.yahoo.com
data:image/s3,"s3://crabby-images/5f86b/5f86b756fe216c6b7e416469f9e8637d1a00ae4d" alt=""
Gabriel Genellina wrote:
[snip...]
As a side suggestion, the PNG spec makes the capitalization of each identifier indicate extra meta-data about the section. (See: http://www.libpng.org/pub/png/spec/1.2/PNG-Structure.html#Chunk-naming-conve... ) For instance, an identifier that starts with a capital letter means that the decoder must understand this chunk to process the contents of the file, whereas an identifier with a lowercase first letter can safely be skipped. An uppercase second letter means that the identifier is defined by Python, whereas a lowercase second letter would indicate a third-party-defined chunk. (The PNG spec reserves the case of the third letter, and forces it to be uppercase. The case of the fourth letter indicates whether it's safe to copy this chunk. I don't think either of those are particularly useful to Python, and so could conveniently be skipped.) Would this extra meta-data be useful? I think so, for the "safe to ignore" flag, at least. Later, Blake.
data:image/s3,"s3://crabby-images/9f3d0/9f3d02f3375786c1b9e625fe336e3e9dfd7b0234" alt=""
On Fri, 25 Apr 2008 07:44:19 -0300 "Gabriel Genellina" <gagsl-py2@yahoo.com.ar> wrote:
Ok, *why* is this a problem? What proposed other fields do you have, other than putting in multiple code segments with different flags? Beyond that:
AKA Tag/Length/Value triples. While TLV is the common order, it's slightly easier to deal with them if you go with LTV. You *have* to deal with the length in order to read things in. Beyond that, you can treat TV as atomic you don't care about the tag for some reason.
- 32 bits should be enough for all sizes (and 640k should be enough for all people...)
Given that there are people who write code that writes code, and the memory and disk capacities of modern systems, I'd say this is likely to cause problems. Given those capacities, 8 byte lengths instead of 4 shouldn't be a problem. For embedded devices - well, they're not going to like the idea in the first place. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information.
data:image/s3,"s3://crabby-images/54541/54541caab4b2a8f6727e67c67f8da567e9464f84" alt=""
2008/4/25, Mike Meyer <mwm@mired.org>:
Well, ASN.1 has a well defined semantics to support L of multiple bytes in a TLV construction, we can adopt that. Note that I think that doing this is too complex for this purpose. Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/
data:image/s3,"s3://crabby-images/54541/54541caab4b2a8f6727e67c67f8da567e9464f84" alt=""
2008/4/25, Gabriel Genellina <gagsl-py2@yahoo.com.ar>:
The problem is that this format is *too* simple. It can't be changed, nor can accomodate other fields if desired. I propose using a more flexible
But how do you think that these extended pyc's will be used? I mean, are there use cases for this more complex pyc? Or they just will be more complex, but with the same information than before, for years, because nobody needs this flexibility?
Maybe what we can do here is that, for some Python versions (say, 3.0, and maybe 3.1), the importer will try to import in the new form, and if recognizes it as invalid *and* finds the some MAGIC numbers in the first 4 bytes, just import it in the old fashion way..
- making the size of all sections multiple of 4 may be a good idea, so marshaled code should be padded with up to 3 NUL bytes at the end.
Why? Has this something to do with memory alignment? I don't see the benefit of this extra rule.
- 32 bits should be enough for all sizes (and 640k should be enough for all people...)
Regarding this, and the 44 bytes of overhead you said, I checked which is the average size of the .pyc in the Linux system I had at hand: $ locate pyc | xargs ls -l | awk 'BEGIN {a=0; c=0} {a += $5; c+=1} END {print c, a/c}' 4960 9069.12 I think that both overhead and 32b for size are ok.
Yes, you should start writing a PEP (any help you need here, we can talk about it in the next Python Argentina meeting, ;). Regards, -- . Facundo Blog: http://www.taniquetil.com.ar/plog/ PyAr: http://www.python.org/ar/
data:image/s3,"s3://crabby-images/e87f3/e87f3c7c6d92519a9dac18ec14406dd41e3da93d" alt=""
On Fri, Apr 25, 2008 at 3:44 AM, Gabriel Genellina <gagsl-py2@yahoo.com.ar> wrote:
While I think having a more flexible format is important to allow for modifying the AST before bytecode write-out, I don't know if it needs to go quite this far. The magic number, timestamp, and marshaled code are not about to go away. Thus the current format can basically stay, but we can add a flexible addition between the timestamp and the code object. This saves some memory and simplifies the format slightly in the case where the guaranteed requirements of .pyc regeneration can be quickly checked (e.g., a quick 8 byte read off the file will quickly tell if the magic number of timestamp are out of date, thus skipping having to read the entire header for these two critical sanity checks). The thing I think that the new format needs to easily support is not just the removal of .pyo, but of user-defined AST transformations prior to .pyc generation. Now that this can be done at the Python level some people might start coming up with compiler-optimizations that they want to do which changes semantics. That means there needs to be a clear way to register an AST transformation has having occurred. I am just worried that the 4 bytes for labeling something won't be enough. We could say that all optimizations are labeled "OPTO" and that what format is used is specified is the value, but that means supporting multiple instances of the same label in the header (which I think is fine since this is going to be read linearly off disk 99% of the time). So I guess this boils down to I think we don't need to label what MUST be in the header, and that we should allow for multiple instances of the same label (whether this is always true or we use some way to flag that through capitalization). -Brett
data:image/s3,"s3://crabby-images/e94e5/e94e50138bdcb6ec7711217f439489133d1c0273" alt=""
On 4/25/08, Gabriel Genellina <gagsl-py2@yahoo.com.ar> wrote:
The problem is that this [pyc] format is *too* simple. It can't be changed, nor can accomodate other fields if desired.
Why do you need to? Except for bootstrapping, can't you make all these changes with a custom loader/importer? Shipping python with default support for a new format may be reasonable as well -- the interpreter already handles both pure python and extension modules. Even hooking it in as an alternate generated format just extends the pyo/pyc decision. Or were you suggesting that the stdlib should use this new format by default, or even strictly instead of the current format? If so, what are the advantages in the normal case? (Deferring the load of docstrings? Better categorization by some external tool?) -jJ
data:image/s3,"s3://crabby-images/86cd1/86cd1194b83d9889306e5ee65d73d38602e62f36" alt=""
On Sat, 26 Apr 2008 13:21:59 -0300, Jim Jewett <jimjjewett-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> escribió:
I want to unify pyc+pyo so they coexist on the same file.
Yes, the idea is to *replace* completely the current format, not to add another alternative. There are now 4 different code variants: using -O or not, and using -U or not; the first one changes the file extension, the second changes the magic number. If the .pyc format could handle more than one variant, they all could be stored in a single file. Other kind of data can be stored too. Deferring docstrings can't be done without changing the marshal format, and that's out of the scope of this proposal (until now). -- Gabriel Genellina Gabriel Genellina Softlab SRL Yahoo! Encuentros. Ahora encontrar pareja es mucho más fácil, probá el nuevo Yahoo! Encuentros http://yahoo.cupidovirtual.com/servlet/NewRegistration
participants (10)
-
Blake Winton
-
Brett Cannon
-
Calvin Spealman
-
Christian Heimes
-
Facundo Batista
-
Gabriel Genellina
-
gagsl-py2@yahoo.com.ar
-
Guido van Rossum
-
Jim Jewett
-
Mike Meyer