Dear python developers,
As a bioinformatician I work a lot with gzip-compressed data. Recently I discovered Intel's Storage Acceleration Libraries at https://github.com/intel/isa-l. These people implemented the DEFLATE and INFLATE algorithms in assembly language. As a result it is much faster than zlib.
I have posted a few benchmarks in this python bug https://bugs.python.org/issue41566. (I just discovered bugs.python.org is the wrong place for feature requests. I am sorry, I am still learning about the proper way of doing this things, as this is my first feature proposal). The TLDR is that it can speed up compression by 5x while speeding up compression by 3x compared to standard gzip.
Isa-l is bsd-3-clause licensed and as such I see no licensing issues when using it in CPython. It is packaged in linux distros already, so I also see no problems in availability. Furthermore the non-Assembly parts are written in C so including from CPython should not pose very big problems.
I am willing to write the PEP if more people think it is a good idea to do this.
Best regards, Ruben Vorderman
On Mon, 17 Aug 2020 08:49:23 -0000 "Ruben Vorderman" r.h.p.vorderman@lumc.nl wrote:
Dear python developers,
As a bioinformatician I work a lot with gzip-compressed data. Recently I discovered Intel's Storage Acceleration Libraries at https://github.com/intel/isa-l. These people implemented the DEFLATE and INFLATE algorithms in assembly language. As a result it is much faster than zlib.
I have posted a few benchmarks in this python bug https://bugs.python.org/issue41566. (I just discovered bugs.python.org is the wrong place for feature requests. I am sorry, I am still learning about the proper way of doing this things, as this is my first feature proposal). The TLDR is that it can speed up compression by 5x while speeding up compression by 3x compared to standard gzip.
Isa-l is bsd-3-clause licensed and as such I see no licensing issues when using it in CPython.
In any case, it should be simple enough to post a package on PyPI that exposes the desired wrapper APIs.
Regards
Antoine.
Antoine Pitrou schrieb am 17.08.20 um 15:00:
On Mon, 17 Aug 2020 08:49:23 -0000 "Ruben Vorderman" wrote:
Dear python developers,
As a bioinformatician I work a lot with gzip-compressed data. Recently I discovered Intel's Storage Acceleration Libraries at https://github.com/intel/isa-l. These people implemented the DEFLATE and INFLATE algorithms in assembly language. As a result it is much faster than zlib.
I have posted a few benchmarks in this python bug https://bugs.python.org/issue41566. (I just discovered bugs.python.org is the wrong place for feature requests. I am sorry, I am still learning about the proper way of doing this things, as this is my first feature proposal). The TLDR is that it can speed up compression by 5x while speeding up compression by 3x compared to standard gzip.
Isa-l is bsd-3-clause licensed and as such I see no licensing issues when using it in CPython.
In any case, it should be simple enough to post a package on PyPI that exposes the desired wrapper APIs.
I re-opened the ticket to allow for some discussion over there in order to understand the implications better. But I agree that a third-party package on PyPI seems like a good first step, also as a backport.
Stefan
On Mon, Aug 17, 2020 at 04:08:54PM +0200, Stefan Behnel wrote:
I re-opened the ticket to allow for some discussion over there in order to understand the implications better. But I agree that a third-party package on PyPI seems like a good first step, also as a backport.
Perhaps I have misunderstood, but isn't this a pure implementation change, with no user visible API changes and backward compatible output?
So why does it need to go on PyPI first? It isn't as if we need to wait to see whether there is demand for faster reading and writing of gzip files, or for the API to settle down.
-- Steve
Steven D'Aprano schrieb am 17.08.20 um 17:00:
On Mon, Aug 17, 2020 at 04:08:54PM +0200, Stefan Behnel wrote:
I re-opened the ticket to allow for some discussion over there in order to understand the implications better. But I agree that a third-party package on PyPI seems like a good first step, also as a backport.
Perhaps I have misunderstood, but isn't this a pure implementation change, with no user visible API changes and backward compatible output?
So why does it need to go on PyPI first? It isn't as if we need to wait to see whether there is demand for faster reading and writing of gzip files, or for the API to settle down.
I didn't say that it won't be accepted into CPython. That depends a lot on how easy it is to integrate and what implications that would have. That will be decided as part of the discussion in the ticket.
However, even if it gets accepted, then that would be a change for CPython 3.10 at the earliest, maybe later, depending on when the patch is ready. Older CPython versions could still benefit from the faster (de-)compression by having a third-party module in PyPI.
Basically, wrapping a zlib compatible library, e.g. in Cython or even as a copy of CPython's own zlib code, seems rather straight forward to me. I'm more worried about the build time dependencies and setup that arise here, which would need tackling regardless of where/how we integrate it.
Having a third-party module available would show how easy or difficult it is to build such a module, thus giving an indication for the effort required to integrate and ship it with the CPython code base. It's a good first step with a readily usable outcome, whatever the decision in CPython will be.
Stefan
On Tue, Aug 18, 2020 at 1:05 AM Steven D'Aprano steve@pearwood.info wrote:
On Mon, Aug 17, 2020 at 04:08:54PM +0200, Stefan Behnel wrote:
I re-opened the ticket to allow for some discussion over there in order to understand the implications better. But I agree that a third-party package on PyPI seems like a good first step, also as a backport.
Perhaps I have misunderstood, but isn't this a pure implementation change, with no user visible API changes and backward compatible output?
So why does it need to go on PyPI first? It isn't as if we need to wait to see whether there is demand for faster reading and writing of gzip files, or for the API to settle down.
That's exactly what I'm asking too - it should have compatible output, but does it have a compatible API? If it does - if you can drop it in and everything behaves equivalently - then it sounds like the sort of thing that can be included in Python 3.10 with minimal fuss. But if the API is different, it might be worth creating a wrapper with the old API (and then it *still* can just get included in 3.10).
ChrisA
On Mon, Aug 17, 2020 at 8:42 AM Chris Angelico <rosuav@gmail.com
but does it have a compatible API? If it does - if you can drop it in
and everything behaves equivalently - then it sounds like the sort of
thing that can be included in Python 3.10 with minimal fuss.
I’m confused - if it is fully compatible, then couldn’t it be included in all maintained versions of cPython?
Is anyone going to complain: “ hey! I just did a minor version update, and suddenly my program runs faster! WTF!”
OK: the obligatory XKCD:
In any case, a prototype outside of the cPython code base would be good for testing and review anyway.
--
Christopher Barker, PhD
Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Tue, Aug 18, 2020 at 3:00 AM Christopher Barker pythonchb@gmail.com wrote:
On Mon, Aug 17, 2020 at 8:42 AM Chris Angelico <rosuav@gmail.com
but does it have a compatible API? If it does - if you can drop it in
and everything behaves equivalently - then it sounds like the sort of
thing that can be included in Python 3.10 with minimal fuss.
I’m confused - if it is fully compatible, then couldn’t it be included in all maintained versions of cPython?
Is anyone going to complain: “ hey! I just did a minor version update, and suddenly my program runs faster! WTF!”
OK: the obligatory XKCD:
In any case, a prototype outside of the cPython code base would be good for testing and review anyway.
That's exactly why it wouldn't be done in 3.9.1, even if it appears to be completely compatible. :) You never know what a change like this might break. This isn't purely an optimization, it's a complete replacement; but that should be safe for 3.10, just not for 3.9.1.
ChrisA
On Mon, Aug 17, 2020 at 8:02 AM Steven D'Aprano steve@pearwood.info wrote:
Perhaps I have misunderstood, but isn't this a pure implementation change, with no user visible API changes and backward compatible output?
According to the documentation [1], it only supports compression levels 0-3. They're supposed to be comparable in ratio to zlib's levels 0-3. I found benchmarks [2] of an older version that only had compression level 1, which shows its ratio being quite a bit worse than zlib's level 1, but maybe they've improved it.
The library interface seems similar, but it isn't drop-in compatible. It doesn't appear to have equivalents of inflateCopy and deflateCopy, which are exposed by Python's standard binding. There may be other missing features that I didn't notice.
The streams it produces are of course standard-compliant, and decompression works with any standard-compliant stream, and is probably always faster than zlib.
[1] https://01.org/sites/default/files/documentation/isa-l_api_2.28.0.pdf
[2] https://ci.spdk.io/download/events/2018-summit-prc/08_Liu_Xiaodong_&_Hui...
Testing myself on a large fasta file, I find far faster performance with igzip, but also far lower compression rate (than gzip -6, the default).
My gzipped fasta is nearly half the size of igzip at its -3 maximum. I haven't tried 'gzip -3' to compare that size.
This limitation seems to limit this greatly as a potential drop-in replacement in current version.
On Mon, Aug 17, 2020, 9:47 PM Ben Rudiak-Gould benrudiak@gmail.com wrote:
On Mon, Aug 17, 2020 at 8:02 AM Steven D'Aprano steve@pearwood.info wrote:
Perhaps I have misunderstood, but isn't this a pure implementation change, with no user visible API changes and backward compatible output?
According to the documentation [1], it only supports compression levels 0-3. They're supposed to be comparable in ratio to zlib's levels 0-3. I found benchmarks [2] of an older version that only had compression level 1, which shows its ratio being quite a bit worse than zlib's level 1, but maybe they've improved it.
The library interface seems similar, but it isn't drop-in compatible. It doesn't appear to have equivalents of inflateCopy and deflateCopy, which are exposed by Python's standard binding. There may be other missing features that I didn't notice.
The streams it produces are of course standard-compliant, and decompression works with any standard-compliant stream, and is probably always faster than zlib.
[1] https://01.org/sites/default/files/documentation/isa-l_api_2.28.0.pdf
[2] https://ci.spdk.io/download/events/2018-summit-prc/08_Liu_Xiaodong_&_Hui...
Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/GLI3YM... Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Aug 17, 2020 at 10:48 PM Ruben Vorderman r.h.p.vorderman@lumc.nl wrote:
Dear python developers,
As a bioinformatician I work a lot with gzip-compressed data. Recently I discovered Intel's Storage Acceleration Libraries at https://github.com/intel/isa-l. These people implemented the DEFLATE and INFLATE algorithms in assembly language. As a result it is much faster than zlib.
I have posted a few benchmarks in this python bug https://bugs.python.org/issue41566. (I just discovered bugs.python.org is the wrong place for feature requests. I am sorry, I am still learning about the proper way of doing this things, as this is my first feature proposal). The TLDR is that it can speed up compression by 5x while speeding up compression by 3x compared to standard gzip.
Isa-l is bsd-3-clause licensed and as such I see no licensing issues when using it in CPython. It is packaged in linux distros already, so I also see no problems in availability. Furthermore the non-Assembly parts are written in C so including from CPython should not pose very big problems.
I am willing to write the PEP if more people think it is a good idea to do this.
You describe this as a feature change. Are there any visible differences when you use isa-l compared to zlib? AIUI the compressed data stream should be compatible, but are there any API-level changes?
Are there any situations in which one would prefer zlib over isa-l?
ChrisA