Python has support for several compression methods, which is great...sort of. Recently, I had to port an application that used zips to use tar+gzip. It *sucked*. Why? Inconsistency. For instance, zipfile uses write to add files, tar uses add. Then, tar has addfile (which confusingly does not add a file), and zip has writestr, for which a tar alternative has been proposed on the bug tracker http://bugs.python.org/issue22208 but hasn't had any activity for a while. Then, zipfile and tarfile's extract methods have slightly different arguments. Zipinfo has namelist, and tarfile as getnames. It's all a mess. The idea is to reorganize Python's zip support just like the sha and md5 modules were put into the hashlib module. Name? ziplib. It would have a few ABCs: - an ArchiveFile class that ZipFile and TarFile would derive from. - a BasicArchiveFile class that ArchiveFile extends. bz2, lzma, and (maybe) gzip would be derived from this one. - ArchiveFileInfo, which ZipInfo and TarInfo would derive from. ArchiveFile and ArchiveFileInfo go hand-in-hand. Now, the APIs would be more consistent. Of course, some have methods that others don't have, but the usage would be easier. zlib would be completely separate, since it has its own API and all. I'm not quite sure how gzip fits into this, since it has a very minimal API. Thoughts? -- Ryan If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated." Personal reality distortion fields are immune to contradictory evidence. - srean Check out my website: http://kirbyfan64.github.io/
On 02/16/2015 03:10 PM, Ryan Gonzalez wrote:
The idea is to reorganize Python's zip support just like the sha and md5 modules were put into the hashlib module. Name? ziplib.
Seems like an interesting idea.
Thoughts?
Have you checked to see if anybody has already done this in the wild? -- ~Ethan~
On 16 February 2015 at 23:19, Ethan Furman
The idea is to reorganize Python's zip support just like the sha and md5 modules were put into the hashlib module. Name? ziplib.
Seems like an interesting idea.
Agreed. A unified interface would be nice for those applications that want to be able to work with general "archives". There's quite a lot of code like that around (shutil has some, distutils' sdists can be zip or tgz files, ...)
Thoughts?
Have you checked to see if anybody has already done this in the wild?
That's certainly worth doing. Also, this may be something that would work better on PyPI - a stdlib module would certainly be a possibility, but there's so much code out there using the zipfile module (maybe not so much using tarfile, I'm not sure) that a ziplib module is unlikely ever to be a *replacement*, just an addition with a unified interface. Backward compatibility would prevent anything else. And as a wrapper round stdlib classes, it may find a good home on PyPI for people that need it. But your ABC-based approach (which presumably would allow extension to add in new archive types like 7-zip) seems like a good design. (Hmm, it seems like I'm saying "put a module on PyPI" a lot at the moment - that's odd, as I'm normally a strong proponent of the "batteries included" philosophy. I need to go take my medication...) Paul
I love the idea (I've had to write my own custom wrapper around a subset of functionality for zip and tgz a few times, and I don't remember enjoying it...), and I especially like the idea of starting with the ABCs (which would hopefully encourage all those people writing cpio and 7z and whatever parsers to conform to the same API, even if they do so by wrapping LGPL libs like libarchive or even subprocessing out to command-line tools).
But why "ziplib"? Why not something generic like "archive" that doesn't sound like it's specific to ZIP files?
Also, this definitely seems like something that would be worth putting on PyPI first and then proposing for the stdlib. You're pretty much guaranteed to need a backport for 3.4 and 2.7 users anyway, right?
Sent from a random iPhone
On Feb 16, 2015, at 15:10, Ryan Gonzalez
Python has support for several compression methods, which is great...sort of.
Recently, I had to port an application that used zips to use tar+gzip. It *sucked*. Why? Inconsistency.
For instance, zipfile uses write to add files, tar uses add. Then, tar has addfile (which confusingly does not add a file), and zip has writestr, for which a tar alternative has been proposed on the bug tracker but hasn't had any activity for a while.
Then, zipfile and tarfile's extract methods have slightly different arguments. Zipinfo has namelist, and tarfile as getnames. It's all a mess.
The idea is to reorganize Python's zip support just like the sha and md5 modules were put into the hashlib module. Name? ziplib.
It would have a few ABCs:
- an ArchiveFile class that ZipFile and TarFile would derive from. - a BasicArchiveFile class that ArchiveFile extends. bz2, lzma, and (maybe) gzip would be derived from this one. - ArchiveFileInfo, which ZipInfo and TarInfo would derive from. ArchiveFile and ArchiveFileInfo go hand-in-hand.
Now, the APIs would be more consistent. Of course, some have methods that others don't have, but the usage would be easier. zlib would be completely separate, since it has its own API and all.
I'm not quite sure how gzip fits into this, since it has a very minimal API.
Thoughts?
-- Ryan If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated." Personal reality distortion fields are immune to contradictory evidence. - srean Check out my website: http://kirbyfan64.github.io/ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Feb 16, 2015 at 5:32 PM, Andrew Barnert
I love the idea (I've had to write my own custom wrapper around a subset of functionality for zip and tgz a few times, and I don't remember enjoying it...), and I especially like the idea of starting with the ABCs (which would hopefully encourage all those people writing cpio and 7z and whatever parsers to conform to the same API, even if they do so by wrapping LGPL libs like libarchive or even subprocessing out to command-line tools).
But why "ziplib"? Why not something generic like "archive" that doesn't sound like it's specific to ZIP files?
This occurred to me a few moments ago. How about arclib?
Also, this definitely seems like something that would be worth putting on PyPI first and then proposing for the stdlib. You're pretty much guaranteed to need a backport for 3.4 and 2.7 users anyway, right?
I'll look into that.
Sent from a random iPhone
On Feb 16, 2015, at 15:10, Ryan Gonzalez
wrote: Python has support for several compression methods, which is great...sort of.
Recently, I had to port an application that used zips to use tar+gzip. It *sucked*. Why? Inconsistency.
For instance, zipfile uses write to add files, tar uses add. Then, tar has addfile (which confusingly does not add a file), and zip has writestr, for which a tar alternative has been proposed on the bug tracker http://bugs.python.org/issue22208 but hasn't had any activity for a while.
Then, zipfile and tarfile's extract methods have slightly different arguments. Zipinfo has namelist, and tarfile as getnames. It's all a mess.
The idea is to reorganize Python's zip support just like the sha and md5 modules were put into the hashlib module. Name? ziplib.
It would have a few ABCs:
- an ArchiveFile class that ZipFile and TarFile would derive from. - a BasicArchiveFile class that ArchiveFile extends. bz2, lzma, and (maybe) gzip would be derived from this one. - ArchiveFileInfo, which ZipInfo and TarInfo would derive from. ArchiveFile and ArchiveFileInfo go hand-in-hand.
Now, the APIs would be more consistent. Of course, some have methods that others don't have, but the usage would be easier. zlib would be completely separate, since it has its own API and all.
I'm not quite sure how gzip fits into this, since it has a very minimal API.
Thoughts?
-- Ryan If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated." Personal reality distortion fields are immune to contradictory evidence. - srean Check out my website: http://kirbyfan64.github.io/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Ryan If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated." Personal reality distortion fields are immune to contradictory evidence. - srean Check out my website: http://kirbyfan64.github.io/
On Feb 16, 2015, at 17:00, Ryan Gonzalez
On Mon, Feb 16, 2015 at 5:32 PM, Andrew Barnert
wrote: I love the idea (I've had to write my own custom wrapper around a subset of functionality for zip and tgz a few times, and I don't remember enjoying it...), and I especially like the idea of starting with the ABCs (which would hopefully encourage all those people writing cpio and 7z and whatever parsers to conform to the same API, even if they do so by wrapping LGPL libs like libarchive or even subprocessing out to command-line tools).
But why "ziplib"? Why not something generic like "archive" that doesn't sound like it's specific to ZIP files?
This occurred to me a few moments ago. How about arclib?
Sounds good to me--parallel with hashlib and, unlike archive, arclib is still an unused name on PyPI. One last thing: it might be worth looking at the libarchive C API and the SWIG wrapper on PyPI to see if there's anything useful to steal for your API design. IIRC, it also has a nifty shim to provide the zipfile and tarfile APIs on top of libarchive, which could be handy for your original use case (making legacy tarfile-based code work with zip archives without much rewriting).
Also, this definitely seems like something that would be worth putting on PyPI first and then proposing for the stdlib. You're pretty much guaranteed to need a backport for 3.4 and 2.7 users anyway, right?
I'll look into that.
Sent from a random iPhone
On Feb 16, 2015, at 15:10, Ryan Gonzalez
wrote: Python has support for several compression methods, which is great...sort of.
Recently, I had to port an application that used zips to use tar+gzip. It *sucked*. Why? Inconsistency.
For instance, zipfile uses write to add files, tar uses add. Then, tar has addfile (which confusingly does not add a file), and zip has writestr, for which a tar alternative has been proposed on the bug tracker but hasn't had any activity for a while.
Then, zipfile and tarfile's extract methods have slightly different arguments. Zipinfo has namelist, and tarfile as getnames. It's all a mess.
The idea is to reorganize Python's zip support just like the sha and md5 modules were put into the hashlib module. Name? ziplib.
It would have a few ABCs:
- an ArchiveFile class that ZipFile and TarFile would derive from. - a BasicArchiveFile class that ArchiveFile extends. bz2, lzma, and (maybe) gzip would be derived from this one. - ArchiveFileInfo, which ZipInfo and TarInfo would derive from. ArchiveFile and ArchiveFileInfo go hand-in-hand.
Now, the APIs would be more consistent. Of course, some have methods that others don't have, but the usage would be easier. zlib would be completely separate, since it has its own API and all.
I'm not quite sure how gzip fits into this, since it has a very minimal API.
Thoughts?
-- Ryan If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated." Personal reality distortion fields are immune to contradictory evidence. - srean Check out my website: http://kirbyfan64.github.io/ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Ryan If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated." Personal reality distortion fields are immune to contradictory evidence. - srean Check out my website: http://kirbyfan64.github.io/
* Ryan Gonzalez
On Mon, Feb 16, 2015 at 5:32 PM, Andrew Barnert
wrote: I love the idea (I've had to write my own custom wrapper around a subset of functionality for zip and tgz a few times, and I don't remember enjoying it...), and I especially like the idea of starting with the ABCs (which would hopefully encourage all those people writing cpio and 7z and whatever parsers to conform to the same API, even if they do so by wrapping LGPL libs like libarchive or even subprocessing out to command-line tools).
But why "ziplib"? Why not something generic like "archive" that doesn't sound like it's specific to ZIP files?
This occurred to me a few moments ago. How about arclib?
To me, that sounds like a library to handle ARC files: http://en.wikipedia.org/wiki/ARC_(file_format) Florian -- http://www.the-compiler.org | me@the-compiler.org (Mail/XMPP) GPG: 916E B0C8 FD55 A072 | http://the-compiler.org/pubkey.asc I love long mails! | http://email.is-not-s.ms/
On Mon, Feb 16, 2015 at 10:20 PM, Florian Bruhin
* Ryan Gonzalez
[2015-02-16 19:00:33 -0600]: On Mon, Feb 16, 2015 at 5:32 PM, Andrew Barnert
wrote: I love the idea (I've had to write my own custom wrapper around a subset of functionality for zip and tgz a few times, and I don't remember enjoying it...), and I especially like the idea of starting with the ABCs (which would hopefully encourage all those people writing cpio and 7z and whatever parsers to conform to the same API, even if they do so by wrapping LGPL libs like libarchive or even subprocessing out to command-line tools).
But why "ziplib"? Why not something generic like "archive" that doesn't sound like it's specific to ZIP files?
This occurred to me a few moments ago. How about arclib?
To me, that sounds like a library to handle ARC files:
Darn... Does anybody even use that format anymore?
Florian
-- http://www.the-compiler.org | me@the-compiler.org (Mail/XMPP) GPG: 916E B0C8 FD55 A072 | http://the-compiler.org/pubkey.asc I love long mails! | http://email.is-not-s.ms/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Ryan If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated." Personal reality distortion fields are immune to contradictory evidence. - srean Check out my website: http://kirbyfan64.github.io/
If bikeshedding about the name is all we're doing at this point I wouldn't
worry about the ancient BBS era .arc file format that nobody will write a
Python library to read. :) I'd propose "archivers" as a module name,
though arclib is also nice. If you want to really play homage to a good
little used archive format, call it arjlib. ;)
"The" problem with zipfile, and tarfile and similar today is that they are
all under maintained and have a variety of things inherent to the
underlying archive formats that are not common to all of them. ie: zip
files have an end of archive central directory index. tar files do not.
rar files can be seen as a seemingly (sadly?) common successor to zip
files. No doubt there are others (7z? cpio?). zip files compress
individual files, tar files don't support compression as it is applied to
the whole archive after the fact. The amount and type of directory
information available within each of these varies. And that doesn't even
touch multi file multi part archive support that some support for very
horrible hacky reasons.
coming up with common API for these with the key features needed by all is
interesting, doubly so for someone pedantic enough to get an implementation
of each correct, but it should be done as a third party library and should
ideally not use the stdlib zip or tar support at all. Do such libraries
exist for other languages? Any C++ that could be reused? Is there such a
thing as high quality code for this kind of task?
-gps
On Tue Feb 17 2015 at 1:05:04 PM Ryan Gonzalez
On Mon, Feb 16, 2015 at 10:20 PM, Florian Bruhin
wrote: * Ryan Gonzalez
[2015-02-16 19:00:33 -0600]: On Mon, Feb 16, 2015 at 5:32 PM, Andrew Barnert
wrote: I love the idea (I've had to write my own custom wrapper around a subset of functionality for zip and tgz a few times, and I don't remember enjoying it...), and I especially like the idea of starting with the ABCs (which would hopefully encourage all those people writing cpio and 7z and whatever parsers to conform to the same API, even if they do so by wrapping LGPL libs like libarchive or even subprocessing out to command-line tools).
But why "ziplib"? Why not something generic like "archive" that doesn't sound like it's specific to ZIP files?
This occurred to me a few moments ago. How about arclib?
To me, that sounds like a library to handle ARC files:
Darn...
Does anybody even use that format anymore?
Florian
-- http://www.the-compiler.org | me@the-compiler.org (Mail/XMPP) GPG: 916E B0C8 FD55 A072 | http://the-compiler.org/pubkey.asc I love long mails! | http://email.is-not-s.ms/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Ryan If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated." Personal reality distortion fields are immune to contradictory evidence. - srean Check out my website: http://kirbyfan64.github.io/ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 17 February 2015 at 21:17, Gregory P. Smith
coming up with common API for these with the key features needed by all is interesting, doubly so for someone pedantic enough to get an implementation of each correct, but it should be done as a third party library and should ideally not use the stdlib zip or tar support at all. Do such libraries exist for other languages? Any C++ that could be reused? Is there such a thing as high quality code for this kind of task?
In the C world, I think libarchive is the best known "handle all the formats" library. I've only used the command-line bsdtar application that's built on it, though, so I don't know what the library API is like. Paul
On Feb 17, 2015, at 13:17, "Gregory P. Smith"
If bikeshedding about the name is all we're doing at this point I wouldn't worry about the ancient BBS era .arc file format that nobody will write a Python library to read. :) I'd propose "archivers" as a module name, though arclib is also nice. If you want to really play homage to a good little used archive format, call it arjlib. ;)
"The" problem with zipfile, and tarfile and similar today is that they are all under maintained and have a variety of things inherent to the underlying archive formats that are not common to all of them. ie: zip files have an end of archive central directory index. tar files do not. rar files can be seen as a seemingly (sadly?) common successor to zip files. No doubt there are others (7z? cpio?). zip files compress individual files, tar files don't support compression as it is applied to the whole archive after the fact. The amount and type of directory information available within each of these varies. And that doesn't even touch multi file multi part archive support that some support for very horrible hacky reasons.
coming up with common API for these with the key features needed by all is interesting, doubly so for someone pedantic enough to get an implementation of each correct, but it should be done as a third party library and should ideally not use the stdlib zip or tar support at all. Do such libraries exist for other languages? Any C++ that could be reused? Is there such a thing as high quality code for this kind of task?
I already mentioned BSD libarchive. (It's in C, not C++, but why would you want C++ for this?) It does pretty much everything you're asking for, and more (more formats, like ISO disk images; format auto-detection by name or magic; etc.). And python-libarchive seems like a pretty up-to-date and well-maintained wrapper. Plus, as I mentioned, it has compatibility wrappers to offer a subset of its functionality with the zipfile and tarfile APIs, making it easy to modify legacy code--e.g., the original problem that started this thread, modifying an app to use zipfile instead of tarfile, would probably only require a couple lines of code (import the tarfile compat library, change the filenames). But the question isn't whether someone can build an ultimate archive library; something that's self-contained, and easy to build and distribute, but only handles the most important types and the basic level of functionality the current stdlib provides (but in a more consistent way) would still be very useful.
On 18 February 2015 at 07:17, Gregory P. Smith
coming up with common API for these with the key features needed by all is interesting, doubly so for someone pedantic enough to get an implementation of each correct, but it should be done as a third party library and should ideally not use the stdlib zip or tar support at all. Do such libraries exist for other languages? Any C++ that could be reused? Is there such a thing as high quality code for this kind of task?
It's also worth remembering there's an existing higher level procedural API in shutil for the cases where all you need is archive creation and unpacking: https://docs.python.org/3/library/shutil.html#archiving-operations It doesn't help if you need to access or manipulate the internals of an existing archive, though. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
For simple use cases, there is shutil:
https://docs.python.org/dev/library/shutil.html#archiving-operations
Victor
Le 17 févr. 2015 00:11, "Ryan Gonzalez"
Python has support for several compression methods, which is great...sort of.
Recently, I had to port an application that used zips to use tar+gzip. It *sucked*. Why? Inconsistency.
For instance, zipfile uses write to add files, tar uses add. Then, tar has addfile (which confusingly does not add a file), and zip has writestr, for which a tar alternative has been proposed on the bug tracker http://bugs.python.org/issue22208 but hasn't had any activity for a while.
Then, zipfile and tarfile's extract methods have slightly different arguments. Zipinfo has namelist, and tarfile as getnames. It's all a mess.
The idea is to reorganize Python's zip support just like the sha and md5 modules were put into the hashlib module. Name? ziplib.
It would have a few ABCs:
- an ArchiveFile class that ZipFile and TarFile would derive from. - a BasicArchiveFile class that ArchiveFile extends. bz2, lzma, and (maybe) gzip would be derived from this one. - ArchiveFileInfo, which ZipInfo and TarInfo would derive from. ArchiveFile and ArchiveFileInfo go hand-in-hand.
Now, the APIs would be more consistent. Of course, some have methods that others don't have, but the usage would be easier. zlib would be completely separate, since it has its own API and all.
I'm not quite sure how gzip fits into this, since it has a very minimal API.
Thoughts?
-- Ryan If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated." Personal reality distortion fields are immune to contradictory evidence. - srean Check out my website: http://kirbyfan64.github.io/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
participants (8)
-
Andrew Barnert
-
Ethan Furman
-
Florian Bruhin
-
Gregory P. Smith
-
Nick Coghlan
-
Paul Moore
-
Ryan Gonzalez
-
Victor Stinner