Request for Pronouncement: PEP 441 - Improving Python ZIP Application Support
The discussion on PEP 441 (started in thread
https://mail.python.org/pipermail/python-dev/2015-February/138277.html
and on the issue tracker at http://bugs.python.org/issue23491) seems
to have mostly settled down.
I don't think there are any outstanding tasks, so I'm happy that the
PEP is ready for pronouncement. I've included the latest copy of the
PEP inline below, for reference.
(Can I also add a gentle reminder that PEP 486 is ready for
pronouncement? It's be nice to get these into 3.5 alpha 2, which is
due at the end of next week...)
Paul
PEP: 441
Title: Improving Python ZIP Application Support
Version: $Revision$
Last-Modified: $Date$
Author: Daniel Holth
Overall I like this and don't see any reason not to accept it, so +1. I do have a couple comments/questions on the module API, though. On Mon Feb 23 2015 at 12:45:28 PM Paul Moore
<SNIP>
``set_interpreter(archive, new_archive, interpreter=None)`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Modifies the *archive*'s shebang line to contain the specified interpreter, and writes the updated archive to *new_archive*. If the *interpreter* is ``None``, removes the shebang line.
Should new_archive default to None to allow for in-place editing? -Brett
On Mon, Feb 23, 2015 at 1:16 PM, Brett Cannon
Overall I like this and don't see any reason not to accept it, so +1. I do have a couple comments/questions on the module API, though.
On Mon Feb 23 2015 at 12:45:28 PM Paul Moore
wrote: <SNIP>
``set_interpreter(archive, new_archive, interpreter=None)`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Modifies the *archive*'s shebang line to contain the specified interpreter, and writes the updated archive to *new_archive*. If the *interpreter* is ``None``, removes the shebang line.
Should new_archive default to None to allow for in-place editing?
-Brett
That would be cool but more work. Unless the length of the new shebang is <= the old one, the zip file contents have to be moved out of the way.
On Mon Feb 23 2015 at 1:34:03 PM Daniel Holth
On Mon, Feb 23, 2015 at 1:16 PM, Brett Cannon
wrote: Overall I like this and don't see any reason not to accept it, so +1. I do have a couple comments/questions on the module API, though.
On Mon Feb 23 2015 at 12:45:28 PM Paul Moore
wrote: <SNIP>
``set_interpreter(archive, new_archive, interpreter=None)`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Modifies the *archive*'s shebang line to contain the specified interpreter, and writes the updated archive to *new_archive*. If the *interpreter* is ``None``, removes the shebang line.
Should new_archive default to None to allow for in-place editing?
-Brett
That would be cool but more work. Unless the length of the new shebang is <= the old one, the zip file contents have to be moved out of the way.
Couldn't you just keep it in memory as bytes and then write directly over the file? I realize that's a bit wasteful memory-wise but it is possible. The docs could mention the memory cost is something to watch out for when doing an in-place replacement. Heck the code could even make it an io.BytesIO instance so the rest of the code doesn't have to care about this special case.
On 23 February 2015 at 18:40, Brett Cannon
Couldn't you just keep it in memory as bytes and then write directly over the file? I realize that's a bit wasteful memory-wise but it is possible. The docs could mention the memory cost is something to watch out for when doing an in-place replacement. Heck the code could even make it an io.BytesIO instance so the rest of the code doesn't have to care about this special case.
I did consider this option, and I still quite like it. In fact, originally I wrote the API to *only* be in-place, until I realised that wouldn't work for things bigger than memory (but who has a Python app that's bigger than RAM?) I'm happy to modify the API along these lines (details to be thrashed out) if people think it's worthwhile. Paul
Sounds reasonable. It could be done by just reading the entire file contents after the shebang and re-writing them with the necessary offset all in RAM, truncating the file if necessary, without involving the zipfile module very much; the shebang could have some amount of padding by default; the file could just be re-compressed in memory depending on your appetite for complexity. On Mon, Feb 23, 2015 at 1:49 PM, Paul Moore
On 23 February 2015 at 18:40, Brett Cannon
wrote: Couldn't you just keep it in memory as bytes and then write directly over the file? I realize that's a bit wasteful memory-wise but it is possible. The docs could mention the memory cost is something to watch out for when doing an in-place replacement. Heck the code could even make it an io.BytesIO instance so the rest of the code doesn't have to care about this special case.
I did consider this option, and I still quite like it. In fact, originally I wrote the API to *only* be in-place, until I realised that wouldn't work for things bigger than memory (but who has a Python app that's bigger than RAM?)
I'm happy to modify the API along these lines (details to be thrashed out) if people think it's worthwhile. Paul
On 02/23/2015 11:01 AM, Daniel Holth wrote:
On Mon, Feb 23, 2015 at 1:49 PM, Paul Moore wrote:
On 23 February 2015 at 18:40, Brett Cannon wrote:
Couldn't you just keep it in memory as bytes and then write directly over the file? I realize that's a bit wasteful memory-wise but it is possible. The docs could mention the memory cost is something to watch out for when doing an in-place replacement. Heck the code could even make it an io.BytesIO instance so the rest of the code doesn't have to care about this special case.
I did consider this option, and I still quite like it. In fact, originally I wrote the API to *only* be in-place, until I realised that wouldn't work for things bigger than memory (but who has a Python app that's bigger than RAM?)
I'm happy to modify the API along these lines (details to be thrashed out) if people think it's worthwhile.
Sounds reasonable. It could be done by just reading the entire file contents after the shebang and re-writing them with the necessary offset all in RAM, truncating the file if necessary, without involving the zipfile module very much; the shebang could have some amount of padding by default; the file could just be re-compressed in memory depending on your appetite for complexity.
This could be a completely stupid question, but how does the zip file know where the individual files are? More to the point, does the index work via relative or absolute offset? If absolute, wouldn't the index have to be rewritten if the zip portion of the file moves? -- ~Ethan~
On 23 February 2015 at 19:01, Daniel Holth
Sounds reasonable. It could be done by just reading the entire file contents after the shebang and re-writing them with the necessary offset all in RAM, truncating the file if necessary, without involving the zipfile module very much; the shebang could have some amount of padding by default; the file could just be re-compressed in memory depending on your appetite for complexity.
The biggest problem with that is finding the end of the prefix data. Frankly it's easier just to write a new prefix then use the zipfile module to rewrite all of the content. That's what the current code does writing to a new file. Paul
On 23 February 2015 at 19:22, Ethan Furman
This could be a completely stupid question, but how does the zip file know where the individual files are? More to the point, does the index work via relative or absolute offset? If absolute, wouldn't the index have to be rewritten if the zip portion of the file moves?
Essentially the index is stored at the *end* of the file, and contains relative offsets to all the files. Gory details at https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT Paul
On Mon, Feb 23, 2015 at 8:22 PM, Ethan Furman
On Mon, Feb 23, 2015 at 1:49 PM, Paul Moore wrote:
On 23 February 2015 at 18:40, Brett Cannon wrote:
Couldn't you just keep it in memory as bytes and then write directly
over
the file? I realize that's a bit wasteful memory-wise but it is
The docs could mention the memory cost is something to watch out for when doing an in-place replacement. Heck the code could even make it an io.BytesIO instance so the rest of the code doesn't have to care about
On 02/23/2015 11:01 AM, Daniel Holth wrote: possible. this
special case.
I did consider this option, and I still quite like it. In fact, originally I wrote the API to *only* be in-place, until I realised that wouldn't work for things bigger than memory (but who has a Python app that's bigger than RAM?)
I'm happy to modify the API along these lines (details to be thrashed out) if people think it's worthwhile.
Sounds reasonable. It could be done by just reading the entire file contents after the shebang and re-writing them with the necessary offset all in RAM, truncating the file if necessary, without involving the zipfile module very much; the shebang could have some amount of padding by default; the file could just be re-compressed in memory depending on your appetite for complexity.
This could be a completely stupid question, but how does the zip file know where the individual files are? More to the point, does the index work via relative or absolute offset? If absolute, wouldn't the index have to be rewritten if the zip portion of the file moves?
Yes and no. The ZIP format uses a 'central directory' which is a record of
each file in the archive. The offsets are relative (although the
specification is a little vague on what they're relative *to* when using a
.zip file. The wording talks about disk numbers, ZIP being from the era of
floppy disks.) You find the central directory by searching from the end (or
reading a specific spot at the end, if you don't support archive comments.
zipimport, for example, doesn't support archive comments) and it turns out
you can find the central directory from just that information (and as far
as I know, all tools do.) However, there are still some offsets that would
change if you add stuff to the front of the ZIP file (or remove it), and
some zip tools will complain (usually just in verbose mode, though.)
--
Thomas Wouters
So is the PEP ready for pronouncement or should there be more discussion?
Also, do you have a BDFL-delegate or do you want me to review it?
On Mon, Feb 23, 2015 at 11:41 AM, Thomas Wouters
On Mon, Feb 23, 2015 at 8:22 PM, Ethan Furman
wrote: On Mon, Feb 23, 2015 at 1:49 PM, Paul Moore wrote:
On 23 February 2015 at 18:40, Brett Cannon wrote:
Couldn't you just keep it in memory as bytes and then write directly
over
the file? I realize that's a bit wasteful memory-wise but it is
On 02/23/2015 11:01 AM, Daniel Holth wrote: possible.
The docs could mention the memory cost is something to watch out for when doing an in-place replacement. Heck the code could even make it an io.BytesIO instance so the rest of the code doesn't have to care about this special case.
I did consider this option, and I still quite like it. In fact, originally I wrote the API to *only* be in-place, until I realised that wouldn't work for things bigger than memory (but who has a Python app that's bigger than RAM?)
I'm happy to modify the API along these lines (details to be thrashed out) if people think it's worthwhile.
Sounds reasonable. It could be done by just reading the entire file contents after the shebang and re-writing them with the necessary offset all in RAM, truncating the file if necessary, without involving the zipfile module very much; the shebang could have some amount of padding by default; the file could just be re-compressed in memory depending on your appetite for complexity.
This could be a completely stupid question, but how does the zip file know where the individual files are? More to the point, does the index work via relative or absolute offset? If absolute, wouldn't the index have to be rewritten if the zip portion of the file moves?
Yes and no. The ZIP format uses a 'central directory' which is a record of each file in the archive. The offsets are relative (although the specification is a little vague on what they're relative *to* when using a .zip file. The wording talks about disk numbers, ZIP being from the era of floppy disks.) You find the central directory by searching from the end (or reading a specific spot at the end, if you don't support archive comments. zipimport, for example, doesn't support archive comments) and it turns out you can find the central directory from just that information (and as far as I know, all tools do.) However, there are still some offsets that would change if you add stuff to the front of the ZIP file (or remove it), and some zip tools will complain (usually just in verbose mode, though.)
-- Thomas Wouters
Hi! I'm an email virus! Think twice before sending your email to help me spread!
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
On 02/23/2015 10:49 AM, Paul Moore wrote:
I did consider this option, and I still quite like it. In fact, originally I wrote the API to *only* be in-place, until I realised that wouldn't work for things bigger than memory (but who has a Python app that's bigger than RAM?)
Depends -- how many image files are that app? ;) -- ~Ethan~
On Mon, Feb 23, 2015 at 8:24 PM, Paul Moore
On 23 February 2015 at 19:01, Daniel Holth
wrote: Sounds reasonable. It could be done by just reading the entire file contents after the shebang and re-writing them with the necessary offset all in RAM, truncating the file if necessary, without involving the zipfile module very much; the shebang could have some amount of padding by default; the file could just be re-compressed in memory depending on your appetite for complexity.
The biggest problem with that is finding the end of the prefix data. Frankly it's easier just to write a new prefix then use the zipfile module to rewrite all of the content. That's what the current code does writing to a new file.
I don't think you need to rewrite all of the contents, if you don't mind poking into zipfile internals: endrec = zipfile._EndRecData(f) prefix_length = endrec[zipfile._ECD_LOCATION] - endrec[zipfile._ECD_SIZE] - endrec[zipfile._ECD_OFFSET] I do something similar to get at the prefix, although I need the zipfile opened anyway, so I use: endrec = zipfile._EndRecData(f) # pylint: disable=protected-access zf = zipfile.ZipFile(f) # endrec is None if reading it failed, but then ZipFile should have # raised an exception... assert endrec prefix_len = zf.start_dir - endrec[zipfile._ECD_OFFSET] # pylint: disable=protected-access Paul
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/thomas%40python.org
--
Thomas Wouters
On 23.02.15 21:22, Ethan Furman wrote:
This could be a completely stupid question, but how does the zip file know where the individual files are? More to the point, does the index work via relative or absolute offset? If absolute, wouldn't the index have to be rewritten if the zip portion of the file moves?
Absolute.
On 23 February 2015 at 19:47, Guido van Rossum
So is the PEP ready for pronouncement or should there be more discussion?
I think Brett's idea is worth incorporating, so let's thrash that out first.
Also, do you have a BDFL-delegate or do you want me to review it?
No-one has stepped up as BDFL-delegate, and there's no obvious candidate, so I think you're it, sorry :-) Paul
On 23 February 2015 at 18:40, Brett Cannon
Couldn't you just keep it in memory as bytes and then write directly over the file? I realize that's a bit wasteful memory-wise but it is possible. The docs could mention the memory cost is something to watch out for when doing an in-place replacement. Heck the code could even make it an io.BytesIO instance so the rest of the code doesn't have to care about this special case.
The real problem with overwriting is if there's a failure during the overwrite you lose the original file. My original API had overwrite as the default, but I think the risk makes that a bad idea. One option would be to allow outputs (TARGET in pack() and NEW_ARCHIVE in set_interpreter()) to be open files (open for write in bytes mode) as well as filenames[1]. Then the caller has the choice of how to manage the output. The docs could include an example of overwriting via a BytesIO object, and point out the risk. BTW, while I was looking at the API, I realised I don't like the order of arguments in pack(). I'm tempted to make it pack(directory, target=None, interpreter=None, main=None) where a target of None means "use the name of the source directory with .pyz tacked on", exactly as for the command line API. What do you think? The change would be no more than a few minutes' work if it's acceptable. Paul [1] What's the standard practice for such dual-mode arguments? ZipFile tests if the argument is a str instance and assumes a file if not. I'd be inclined to follow that practice here.
On Mon Feb 23 2015 at 3:51:18 PM Paul Moore
Couldn't you just keep it in memory as bytes and then write directly over the file? I realize that's a bit wasteful memory-wise but it is possible. The docs could mention the memory cost is something to watch out for when doing an in-place replacement. Heck the code could even make it an io.BytesIO instance so the rest of the code doesn't have to care about
On 23 February 2015 at 18:40, Brett Cannon
wrote: this special case.
The real problem with overwriting is if there's a failure during the overwrite you lose the original file. My original API had overwrite as the default, but I think the risk makes that a bad idea.
Couldn't you catch the exception, write the original file back out, and then re-raise the exception?
One option would be to allow outputs (TARGET in pack() and NEW_ARCHIVE in set_interpreter()) to be open files (open for write in bytes mode) as well as filenames[1]. Then the caller has the choice of how to manage the output. The docs could include an example of overwriting via a BytesIO object, and point out the risk.
That sounds like a good idea. No reason to do the file opening on someone's behalf when opening files is so easy and keeps layering abstractions at a good level. Would this extend also to the archive being read to be consistent? I should mention I originally thought of extending this to pack() for 'main', but realized that passing in the function to set would require tools to import the code they are simply trying to pack and that was the wrong thing to do.
BTW, while I was looking at the API, I realised I don't like the order of arguments in pack(). I'm tempted to make it pack(directory, target=None, interpreter=None, main=None) where a target of None means "use the name of the source directory with .pyz tacked on", exactly as for the command line API.
What do you think? The change would be no more than a few minutes' work if it's acceptable.
+1 from me. -Brett
Paul
[1] What's the standard practice for such dual-mode arguments? ZipFile tests if the argument is a str instance and assumes a file if not. I'd be inclined to follow that practice here.
On 23.02.15 22:23, Serhiy Storchaka wrote:
On 23.02.15 21:22, Ethan Furman wrote:
This could be a completely stupid question, but how does the zip file know where the individual files are? More to the point, does the index work via relative or absolute offset? If absolute, wouldn't the index have to be rewritten if the zip portion of the file moves?
Absolute.
Oh, I were wrong. The specification and the are not very clear about this.
On 02/23/2015 01:02 PM, Brett Cannon wrote:
On Mon Feb 23 2015 at 3:51:18 PM Paul Moore wrote:
The real problem with overwriting is if there's a failure during the overwrite you lose the original file. My original API had overwrite as the default, but I think the risk makes that a bad idea.
Couldn't you catch the exception, write the original file back out, and then re-raise the exception?
This seems to be getting pretty complex for a nice-to-have.
One option would be to allow outputs (TARGET in pack() and NEW_ARCHIVE in set_interpreter()) to be open files (open for write in bytes mode) as well as filenames[1].
+1 for this.
BTW, while I was looking at the API, I realised I don't like the order of arguments in pack(). I'm tempted to make it pack(directory, target=None, interpreter=None, main=None) where a target of None means "use the name of the source directory with .pyz tacked on", exactly as for the command line API.
What do you think? The change would be no more than a few minutes' work if it's acceptable.
+1 from me.
+1 from me as well. -- ~Ethan~
On 23.02.15 22:51, Paul Moore wrote:
BTW, while I was looking at the API, I realised I don't like the order of arguments in pack(). I'm tempted to make it pack(directory, target=None, interpreter=None, main=None) where a target of None means "use the name of the source directory with .pyz tacked on", exactly as for the command line API.
If the order of arguments is not obvious, make them keyword-only.
On 23 February 2015 at 21:02, Brett Cannon
The real problem with overwriting is if there's a failure during the overwrite you lose the original file. My original API had overwrite as the default, but I think the risk makes that a bad idea.
Couldn't you catch the exception, write the original file back out, and then re-raise the exception?
But you don't *have* the original file. You read the source archive entry-by-entry, not all at once. Apart from the implementation difficulty, this is getting too complex, and I think it's better to just give the user the tools to add whatever robustness or exception handling they want on top. Paul
On 23 February 2015 at 21:18, Serhiy Storchaka
On 23.02.15 22:51, Paul Moore wrote:
BTW, while I was looking at the API, I realised I don't like the order of arguments in pack(). I'm tempted to make it pack(directory, target=None, interpreter=None, main=None) where a target of None means "use the name of the source directory with .pyz tacked on", exactly as for the command line API.
If the order of arguments is not obvious, make them keyword-only.
To be honest, I don't think it *is* that non-obvious. I just think I got it wrong initially. With the new API, you have pack('myappdir') pack('myappdir', 'named_target.pyz') Seems obvious enough to me. It matches the "source, destination" order of similar functions like shutil.copy. And you can use a named argument if you're not sure. But I don't think it's worth forcing it. Paul
On 24 February 2015 at 06:32, Paul Moore
On 23 February 2015 at 19:47, Guido van Rossum
wrote: So is the PEP ready for pronouncement or should there be more discussion?
I think Brett's idea is worth incorporating, so let's thrash that out first.
Also, do you have a BDFL-delegate or do you want me to review it?
No-one has stepped up as BDFL-delegate, and there's no obvious candidate, so I think you're it, sorry :-)
If Guido isn't keen, I'd be willing to cover it as the current runpy module maintainer and the one that updated this part of the interpreter startup sequence to handle the switch to importlib in 3.3. The PEP itself doesn't actually touch any of that internal machinery though, it just makes the capability a bit more discoverable and user-friendly. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
I can do it but I don't want to be reviewing and accepting a PEP that's
still under discussion, and I don't have the bandwidth to follow the
discussion here -- I can only read the PEP. I will start that now.
On Tue, Feb 24, 2015 at 1:03 AM, Nick Coghlan
On 24 February 2015 at 06:32, Paul Moore
wrote: On 23 February 2015 at 19:47, Guido van Rossum
wrote: So is the PEP ready for pronouncement or should there be more discussion?
I think Brett's idea is worth incorporating, so let's thrash that out first.
Also, do you have a BDFL-delegate or do you want me to review it?
No-one has stepped up as BDFL-delegate, and there's no obvious candidate, so I think you're it, sorry :-)
If Guido isn't keen, I'd be willing to cover it as the current runpy module maintainer and the one that updated this part of the interpreter startup sequence to handle the switch to importlib in 3.3.
The PEP itself doesn't actually touch any of that internal machinery though, it just makes the capability a bit more discoverable and user-friendly.
Regards, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
-- --Guido van Rossum (python.org/~guido)
On 24 February 2015 at 17:46, Guido van Rossum
I can do it but I don't want to be reviewing and accepting a PEP that's still under discussion, and I don't have the bandwidth to follow the discussion here -- I can only read the PEP. I will start that now.
I'm just about to push an update based on Brett's comments, then it should be done. Can you hold off for a short while? Thanks. Paul. (Sorry, always the way - final comments appear as soon as you say "it's done" :-))
participants (8)
-
Brett Cannon
-
Daniel Holth
-
Ethan Furman
-
Guido van Rossum
-
Nick Coghlan
-
Paul Moore
-
Serhiy Storchaka
-
Thomas Wouters