Fast Implementation for ZIP decryption

Hi All, I have implemented the simple zip decryption in C (yes, the much loathed weak password based classical PKWARE encryption, which incidentally is the only one currently supported in python) . It performs fast, as one would expect, as compared to the current all-python implementation. Does it sound worthy enough to create a patch for and integrate into python itself? -- Regards Shashank Singh Senior Undergraduate, Department of Computer Science and Engineering Indian Institute of Technology Bombay shashank.sunny.singh@gmail.com http://www.cse.iitb.ac.in/~shashanksingh

On Sun, 30 Aug 2009 06:55:33 pm Martin v. Löwis wrote:
I would think that for most people, the threat model isn't "the CIA is reading my files" but "my little brother or nosey co-worker is reading my files", and for that, zip encryption with a good password is probably perfectly adequate. E.g. OpenOffice uses it for password-protected documents. Given that Python already supports ZIP decryption (as it should), are there any reasons to prefer the current pure-Python implementation over a faster version? -- Steven D'Aprano

On 12:59 pm, steve@pearwood.info wrote:
Given that the use case is "protect my biology homework from my little brother", how fast does the implementation really need to be? Is speeding it up from 0.1 seconds to 0.001 seconds worth the potential new problems that come with more C code (more code to maintain, less portability to other runtimes, potential for interpreter crashes or even arbitrary code execution vulnerabilities from specially crafted files)? Jean-Paul

just to give you an idea of the speed up: a 3.3 mb zip file extracted using the current all-python implementation on my machine (win xp 1.67Ghz 1.5GB) takes approximately 38 seconds. the same file when extracted using c implementation takes 0.4 seconds. --shashank On Sun, Aug 30, 2009 at 6:35 PM, <exarkun@twistedmatrix.com> wrote:

On 30 aug 2009, at 16:34, Shashank Singh wrote:
If this matters to the users of the API, then likely they'd search for alternatives -- no need for it to go into the standard library just because it replaces functionality, or am I misunderstanding? - Ludvig Ericson <ludvig@lericson.se>

On Sun, Aug 30, 2009 at 7:34 AM, Shashank Singh<shashank.sunny.singh@gmail.com> wrote:
Are there any applications/frameworks which have zip files on their critical path, where this kind of (admittedly impressive) speedup would be beneficial? What was the motivation for writing the C version? Collin Winter

-On [20090831 06:29], Collin Winter (collinw@gmail.com) wrote:
Would zipped eggs count? For example, SQLAlchemy runs in the 5 MB range. -- Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B All for one, one for all...

On Sun, Aug 30, 2009 at 10:40 PM, Jeroen Ruigrok van der Werven < asmodai@in-nomine.org> wrote:
Unless someone's also pushing for being able to import and execute code from scrambled zip files, no that doesn't matter. The C code for this should be trivially tiny. See the zipfile._ZipDecryptor class, its got ~25 lines of actual code in it. It is not worth arguing about. I'll commit this if you post it as a patch in a tracker issue. Please make sure your patch includes the following: * A unittest that compares the C version of the descrambler to the python version of the descrambler using a variety of inputs and outputs that exercise any boundary condition. * Conditional import code in the zipfile module itself so that the module works even if the C module isn't available. -Greg

On Mon, Aug 31, 2009 at 12:08 PM, Gregory P. Smith <greg@krypto.org> wrote:
For those who have not seen it : http://bugs.python.org/issue6749 asks for such an ability (there was a good deal of discussion about it on python-dev too and I think Greg you were a -1 on it :).
right you are. It is just a simple translation of the (~25 lines) of code into C. It is not worth arguing about. I'll commit this if you post it as a patch
I sure can do that. What boundary conditions do you have on mind? While we are at it (and forgive my obsession with the zip module :), is there enough need for supporting the Strong Encryption Specification in the zip module? At least one immediate benefit I can see is that the OP of the link I posted above will be happy :) The main reason the idea of supporting import of encrypted module was shot down is that the simple encryption scheme is too weak to bother about. Supporting Strong Encryption might do away with that problem beside, possibly, adding a whole new way of distributing python modules. Are there any (more?) use cases or am I missing something very trivial why Strong Encryption was never supported in the zip module? -- Shashank -- Regards Shashank Singh Senior Undergraduate, Department of Computer Science and Engineering Indian Institute of Technology Bombay shashank.sunny.singh@gmail.com http://www.cse.iitb.ac.in/~shashanksingh

exarkun@twistedmatrix.com wrote:
Also, if the use case is just protecting stuff from a sibling or your childen, use an archiving program to zip/extract it :) So -1 here as well. Any added C code has a real cost for the reasons Jean-Paul listed, so it should only be used in cases where there's a major practical benefit to the speed-up. Faster execution of a problematic algorithm that is already well implemented by plenty of other applications doesn't qualify in my book (even if the speedup is by a couple of orders of magnitude). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Sun, 30 Aug 2009 06:55:33 pm Martin v. Löwis wrote:
I would think that for most people, the threat model isn't "the CIA is reading my files" but "my little brother or nosey co-worker is reading my files", and for that, zip encryption with a good password is probably perfectly adequate. E.g. OpenOffice uses it for password-protected documents. Given that Python already supports ZIP decryption (as it should), are there any reasons to prefer the current pure-Python implementation over a faster version? -- Steven D'Aprano

On 12:59 pm, steve@pearwood.info wrote:
Given that the use case is "protect my biology homework from my little brother", how fast does the implementation really need to be? Is speeding it up from 0.1 seconds to 0.001 seconds worth the potential new problems that come with more C code (more code to maintain, less portability to other runtimes, potential for interpreter crashes or even arbitrary code execution vulnerabilities from specially crafted files)? Jean-Paul

just to give you an idea of the speed up: a 3.3 mb zip file extracted using the current all-python implementation on my machine (win xp 1.67Ghz 1.5GB) takes approximately 38 seconds. the same file when extracted using c implementation takes 0.4 seconds. --shashank On Sun, Aug 30, 2009 at 6:35 PM, <exarkun@twistedmatrix.com> wrote:

On 30 aug 2009, at 16:34, Shashank Singh wrote:
If this matters to the users of the API, then likely they'd search for alternatives -- no need for it to go into the standard library just because it replaces functionality, or am I misunderstanding? - Ludvig Ericson <ludvig@lericson.se>

On Sun, Aug 30, 2009 at 7:34 AM, Shashank Singh<shashank.sunny.singh@gmail.com> wrote:
Are there any applications/frameworks which have zip files on their critical path, where this kind of (admittedly impressive) speedup would be beneficial? What was the motivation for writing the C version? Collin Winter

-On [20090831 06:29], Collin Winter (collinw@gmail.com) wrote:
Would zipped eggs count? For example, SQLAlchemy runs in the 5 MB range. -- Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B All for one, one for all...

On Sun, Aug 30, 2009 at 10:40 PM, Jeroen Ruigrok van der Werven < asmodai@in-nomine.org> wrote:
Unless someone's also pushing for being able to import and execute code from scrambled zip files, no that doesn't matter. The C code for this should be trivially tiny. See the zipfile._ZipDecryptor class, its got ~25 lines of actual code in it. It is not worth arguing about. I'll commit this if you post it as a patch in a tracker issue. Please make sure your patch includes the following: * A unittest that compares the C version of the descrambler to the python version of the descrambler using a variety of inputs and outputs that exercise any boundary condition. * Conditional import code in the zipfile module itself so that the module works even if the C module isn't available. -Greg

On Mon, Aug 31, 2009 at 12:08 PM, Gregory P. Smith <greg@krypto.org> wrote:
For those who have not seen it : http://bugs.python.org/issue6749 asks for such an ability (there was a good deal of discussion about it on python-dev too and I think Greg you were a -1 on it :).
right you are. It is just a simple translation of the (~25 lines) of code into C. It is not worth arguing about. I'll commit this if you post it as a patch
I sure can do that. What boundary conditions do you have on mind? While we are at it (and forgive my obsession with the zip module :), is there enough need for supporting the Strong Encryption Specification in the zip module? At least one immediate benefit I can see is that the OP of the link I posted above will be happy :) The main reason the idea of supporting import of encrypted module was shot down is that the simple encryption scheme is too weak to bother about. Supporting Strong Encryption might do away with that problem beside, possibly, adding a whole new way of distributing python modules. Are there any (more?) use cases or am I missing something very trivial why Strong Encryption was never supported in the zip module? -- Shashank -- Regards Shashank Singh Senior Undergraduate, Department of Computer Science and Engineering Indian Institute of Technology Bombay shashank.sunny.singh@gmail.com http://www.cse.iitb.ac.in/~shashanksingh

exarkun@twistedmatrix.com wrote:
Also, if the use case is just protecting stuff from a sibling or your childen, use an archiving program to zip/extract it :) So -1 here as well. Any added C code has a real cost for the reasons Jean-Paul listed, so it should only be used in cases where there's a major practical benefit to the speed-up. Faster execution of a problematic algorithm that is already well implemented by plenty of other applications doesn't qualify in my book (even if the speedup is by a couple of orders of magnitude). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
participants (9)
-
"Martin v. Löwis"
-
Collin Winter
-
exarkun@twistedmatrix.com
-
Gregory P. Smith
-
Jeroen Ruigrok van der Werven
-
Ludvig Ericson
-
Nick Coghlan
-
Shashank Singh
-
Steven D'Aprano