Add the brotli & zstandard compression algorithms as modules
Hello everyone, We have many compression algorithms as standard modules, but we lack two important new algorithms: brotli and zstandard. Brotli is an important compression algorithm both for HTTP2 and for compressing/decompressing HTTP responses. ZStandard is another algorithm for compression that can easily beat gzip with a better compression ratio and less compression time. Both are now stable enough and are past the 1.0 mark. Let's include them in the standard library and, of course, make them optional at build time like the rest of our compression modules. What do you think?
Hi, On Mon, 21 Sep 2020 09:31:47 -0000 "Omer Katz" <omer.drow@gmail.com> wrote:
Hello everyone,
We have many compression algorithms as standard modules, but we lack two important new algorithms: brotli and zstandard.
Brotli is an important compression algorithm both for HTTP2 and for compressing/decompressing HTTP responses. ZStandard is another algorithm for compression that can easily beat gzip with a better compression ratio and less compression time. Both are now stable enough and are past the 1.0 mark.
Let's include them in the standard library and, of course, make them optional at build time like the rest of our compression modules. What do you think?
I would agree with ZStandard which is best-in-class currently. Brotli sounds less important. Just my 2 cents, though :-) In any case, each of these modules would need a maintainer willing to ensure maintenance inside the Python standard library. Regards Antoine.
On 21 Sep 2020, at 16:14, Antoine Pitrou <solipsis@pitrou.net> wrote:
Hi,
On Mon, 21 Sep 2020 09:31:47 -0000 "Omer Katz" <omer.drow@gmail.com <mailto:omer.drow@gmail.com>> wrote:
Hello everyone,
We have many compression algorithms as standard modules, but we lack two important new algorithms: brotli and zstandard.
Brotli is an important compression algorithm both for HTTP2 and for compressing/decompressing HTTP responses. ZStandard is another algorithm for compression that can easily beat gzip with a better compression ratio and less compression time. Both are now stable enough and are past the 1.0 mark.
Let's include them in the standard library and, of course, make them optional at build time like the rest of our compression modules. What do you think?
I would agree with ZStandard which is best-in-class currently.
Brotli sounds less important. Just my 2 cents, though :-)
We had to add this to our product I recall is extensively by google. We packaged the code at https://github.com/google/brotli to support this. Have not heard of zstd being used on the web yet, elsewhere it pops up all the time. Barry
In any case, each of these modules would need a maintainer willing to ensure maintenance inside the Python standard library.
Regards
Antoine.
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org <mailto:python-ideas@python.org> To unsubscribe send an email to python-ideas-leave@python.org <mailto:python-ideas-leave@python.org> https://mail.python.org/mailman3/lists/python-ideas.python.org/ <https://mail.python.org/mailman3/lists/python-ideas.python.org/> Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/A7C7BK... <https://mail.python.org/archives/list/python-ideas@python.org/message/A7C7BKCZRKZIVGP3ILFTZDYR4VH7GYNU/> Code of Conduct: http://python.org/psf/codeofconduct/ <http://python.org/psf/codeofconduct/>
zstd is used when you need fast compression speed, not the best ratio. Maybe we can ask google and facebook to contribute their implementations?
Hello, On Wed, 23 Sep 2020 06:11:27 -0000 "Omer Katz" <omer.drow@gmail.com> wrote:
zstd is used when you need fast compression speed, not the best ratio.
Maybe we can ask google and facebook to contribute their implementations?
And $$$ to support maintaining it over the years. In the meantime, why can't you use modules on PyPI/github/wherever else? Why CPython stdlib should become an ugly ever-growing monster, siphoning, like a black hole, everything around? -- Best regards, Paul mailto:pmiscml@gmail.com
Mainly because we previously explored creating wheels with better compression. But I think that if LZMA was included, then other new algorithms should be included as well. בתאריך יום ד׳, 23 בספט׳ 2020, 10:08, מאת Paul Sokolovsky <pmiscml@gmail.com
:
Hello,
On Wed, 23 Sep 2020 06:11:27 -0000 "Omer Katz" <omer.drow@gmail.com> wrote:
zstd is used when you need fast compression speed, not the best ratio.
Maybe we can ask google and facebook to contribute their implementations?
And $$$ to support maintaining it over the years.
In the meantime, why can't you use modules on PyPI/github/wherever else? Why CPython stdlib should become an ugly ever-growing monster, siphoning, like a black hole, everything around?
-- Best regards, Paul mailto:pmiscml@gmail.com
On Wed, 23 Sep 2020 at 08:08, Paul Sokolovsky <pmiscml@gmail.com> wrote:
In the meantime, why can't you use modules on PyPI/github/wherever else?
There are significant use cases where 3rd party modules are not easy to use. But let's not get sucked into that digression here. The point of this request is that Python's packaging infrastructure is looking at what compression we use for wheels - the current compression is suboptimal for huge binaries like tensorflow. Packaging is in a unique situation, because it *cannot* use external libraries - you can't have the wheel format depend on something that you need to install a wheel to use. If the decision is not to include these, we just have to stick to the existing compression methods, and accept the cost. But it's certainly worth looking at the possibility. Paul
On Tue, Sep 22, 2020 at 11:55 PM Paul Moore <p.f.moore@gmail.com> wrote:
The point of this request is that Python's packaging infrastructure is looking at what compression we use for wheels - the current compression is suboptimal for huge binaries like tensorflow. Packaging is in a unique situation, because it *cannot* use external libraries
It's hard to see where packaging would have any advantage with brotli or zstd over lzma. XZ is more widely used, and package size seems to dominate speed. There are definitely some intermediate compression levels where both brotli and zstd are significantly faster, but not at the higher levels where lzma does as well or better. Is there a concrete need here, or just an abstract point that compression of packages shouldn't be outside the stdlib? Honestly, if you really want compression size over everything else, PPM is going to beat the LZ based approaches. But being ungodly slow and using tons of memory. -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
I pointed out a use case for Brotli & HTTP2 as a concrete example for why it'd be more convenient to include brotli as a module. I'm sure there are other cases I haven't thought about. I don't understand why LZMA should be included while zstd or brotli shouldn't. What's the actual policy here? בתאריך יום ד׳, 23 בספט׳ 2020 ב-13:09 מאת David Mertz <mertz@gnosis.cx >:
On Tue, Sep 22, 2020 at 11:55 PM Paul Moore <p.f.moore@gmail.com> wrote:
The point of this request is that Python's packaging infrastructure is looking at what compression we use for wheels - the current compression is suboptimal for huge binaries like tensorflow. Packaging is in a unique situation, because it *cannot* use external libraries
It's hard to see where packaging would have any advantage with brotli or zstd over lzma. XZ is more widely used, and package size seems to dominate speed. There are definitely some intermediate compression levels where both brotli and zstd are significantly faster, but not at the higher levels where lzma does as well or better.
Is there a concrete need here, or just an abstract point that compression of packages shouldn't be outside the stdlib?
Honestly, if you really want compression size over everything else, PPM is going to beat the LZ based approaches. But being ungodly slow and using tons of memory.
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
Let's put it this way. If you can only support 3 compression algorithms in the stdlib, which there would you choose? If only 4? If only 10? Each one is concrete maintenance work. There's nothing *wrong* with any of them, and someone uses each of the top 10 or 50. But some kind of cut-off of usefulness vs burden is necessary. Paul's case of packaging needing stdlib support is plausible. I'm not quite convinced even by that though. Just like we have ensurepip, it would be perfectly possible to make whizbang-compression the first dependency of every other package, but package whizbang itself using plain tar. However, HTTP2 is absolutely something that's fine for PyPI. That really doesn't need to be stdlib. On Wed, Sep 23, 2020, 12:26 AM Omer Katz <omer.drow@gmail.com> wrote:
I pointed out a use case for Brotli & HTTP2 as a concrete example for why it'd be more convenient to include brotli as a module. I'm sure there are other cases I haven't thought about.
I don't understand why LZMA should be included while zstd or brotli shouldn't. What's the actual policy here?
בתאריך יום ד׳, 23 בספט׳ 2020 ב-13:09 מאת David Mertz <mertz@gnosis.cx >:
On Tue, Sep 22, 2020 at 11:55 PM Paul Moore <p.f.moore@gmail.com> wrote:
The point of this request is that Python's packaging infrastructure is looking at what compression we use for wheels - the current compression is suboptimal for huge binaries like tensorflow. Packaging is in a unique situation, because it *cannot* use external libraries
It's hard to see where packaging would have any advantage with brotli or zstd over lzma. XZ is more widely used, and package size seems to dominate speed. There are definitely some intermediate compression levels where both brotli and zstd are significantly faster, but not at the higher levels where lzma does as well or better.
Is there a concrete need here, or just an abstract point that compression of packages shouldn't be outside the stdlib?
Honestly, if you really want compression size over everything else, PPM is going to beat the LZ based approaches. But being ungodly slow and using tons of memory.
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
I actually disagree on HTTP2 but that's beside the point. What are the use-cases for LZMA that make it qualify to be part of the stdlib? Why was that library included? I think we shouldn't discriminate. If there are a couple of use-cases users need and the implementation is sufficiently stable, I see no reason not to include those libraries in stdlib.
On 9/29/2020 9:26 AM, Omer Katz wrote:
I actually disagree on HTTP2 but that's beside the point.
What are the use-cases for LZMA that make it qualify to be part of the stdlib? Why was that library included? I think we shouldn't discriminate. If there are a couple of use-cases users need and the implementation is sufficiently stable, I see no reason not to include those libraries in stdlib.
I can't say this is the only reason, but back when distributing and installing packages was more difficult, we had a lower bar for stdlib inclusion. Eric
On Tue, Sep 29, 2020 at 6:34 AM Eric V. Smith <eric@trueblade.com> wrote:
I think we shouldn't discriminate. If there are a couple of use-cases users need and the implementation is sufficiently stable, I see no reason not to include those libraries in stdlib.
I think this was covered earlier in this thread, but even if there are good use-cases, a new method won't be included unless there is someone agreeing to implement and maintain it. That is: a maintainer is a necessary but not sufficient criteria for inclusion. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
29.09.20 16:26, Omer Katz пише:
What are the use-cases for LZMA that make it qualify to be part of the stdlib? Why was that library included? I think we shouldn't discriminate. If there are a couple of use-cases users need and the implementation is sufficiently stable, I see no reason not to include those libraries in stdlib.
That method was very popular for compressing data at that time (and I think it is still popular). It was supported by the tar utility, was included in the zip file specification, was used for distributing source archives and was used for compressing packages in Linux distributives. It made bzip2 mostly obsolete. In Python it was needed for distutils.
On Wed, 23 Sep 2020 13:26:13 +0300 Omer Katz <omer.drow@gmail.com> wrote:
I pointed out a use case for Brotli & HTTP2 as a concrete example for why it'd be more convenient to include brotli as a module. I'm sure there are other cases I haven't thought about.
I don't understand why LZMA should be included while zstd or brotli shouldn't. What's the actual policy here?
The main policy requirement is a maintainer willing to shepherd inclusion and then maintain the module in the stdlib. One person volunteered at some point to do that for LZMA, and it had desirabled characteristics as a compression algorithm, so it happened. (and when LZMA was included in 2011, ZStandard didn't even exist AFAIK) Regards Antoine.
On Wed, 23 Sep 2020 at 11:09, David Mertz <mertz@gnosis.cx> wrote:
It's hard to see where packaging would have any advantage with brotli or zstd over lzma. XZ is more widely used, and package size seems to dominate speed. There are definitely some intermediate compression levels where both brotli and zstd are significantly faster, but not at the higher levels where lzma does as well or better.
Is there a concrete need here, or just an abstract point that compression of packages shouldn't be outside the stdlib?
Honestly, if you really want compression size over everything else, PPM is going to beat the LZ based approaches. But being ungodly slow and using tons of memory.
The discussion over on the Packaging discourse channel is ongoing, and hasn't reached any conclusion yet. But yes, there have been lots of debates over what's the best compression method. That debate triggered this request, but it doesn't indicate a specific need for these compression methods. I can't speak directly for Omer, but I assume the intention was to explore what's possible/likely, to inform the discussion of wheel compression with some concrete information about what we can reasonably assume might be in the stdlib. I don't have a personal stake here - I'd likely never use brotli or zstd unless a file format that I needed to process used them, and for any use case *other* than packaging, I don't currently care if the support is in the stdlib or not. If either format becomes commonly used, then I'd argue for them being in the stdlib, but right now they seem relatively uncommon in my personal experience. Paul
On Wed, Sep 23, 2020 at 3:10 AM David Mertz <mertz@gnosis.cx> wrote:
On Tue, Sep 22, 2020 at 11:55 PM Paul Moore <p.f.moore@gmail.com> wrote:
The point of this request is that Python's packaging infrastructure is looking at what compression we use for wheels - the current compression is suboptimal for huge binaries like tensorflow.
There are definitely some intermediate compression levels where both brotli and zstd are significantly faster [than lzma], but not at the higher levels where lzma does as well or better.
I'd assume that only decompression speed matters for packages, and on that metric both brotli and zstd beat lzma by a mile regardless of the compression level. But I think that lzma gets exceptionally good ratios on x86/x64 machine code. Even after all these years it seems to be the state of the art for "best ratio that isn't painfully slow to decompress". On Wed, Sep 23, 2020 at 3:10 AM David Mertz <mertz@gnosis.cx> wrote:
On Tue, Sep 22, 2020 at 11:55 PM Paul Moore <p.f.moore@gmail.com> wrote:
The point of this request is that Python's packaging infrastructure is looking at what compression we use for wheels - the current compression is suboptimal for huge binaries like tensorflow. Packaging is in a unique situation, because it *cannot* use external libraries
It's hard to see where packaging would have any advantage with brotli or zstd over lzma. XZ is more widely used, and package size seems to dominate speed. There are definitely some intermediate compression levels where both brotli and zstd are significantly faster, but not at the higher levels where lzma does as well or better.
Is there a concrete need here, or just an abstract point that compression of packages shouldn't be outside the stdlib?
Honestly, if you really want compression size over everything else, PPM is going to beat the LZ based approaches. But being ungodly slow and using tons of memory.
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/SC3SQK... Code of Conduct: http://python.org/psf/codeofconduct/
I wrote a zstd module for stdlib: https://github.com/animalize/cpython/pull/8/files And a PyPI version based on it: PyPI: https://pypi.org/project/pyzstd/ Doc: https://pyzstd.readthedocs.io/en/latest/ If you decide to include it into stdlib, the work can be done in a short time. Zstd has some advantages: fast speed, multi-threaded compression, dictionary for small data, etc. IMO it's suitable as a replacement for zlib, but at this time: 1, If it is included into stdlib, it will take advantage of the huge influence of Python and become popular. 2, If wait until zstd becomes popular, and there is no better alternate, unnecessary time will be wasted. (I'm +0.5 on this option. Python promotes a technology that is on the rise, which is a bit strange.) I heard in data science domain, the data is often huge, such as hundreds of GB or more. If people can make full use of multi-core CPU to compress, the experience will be much better than zlib. Your survey on PyPI mentioned data science. Maybe you can talk to those people about this.
On Tue, Oct 13, 2020 at 8:16 AM Ma Lin <malincns@163.com> wrote:
Zstd has some advantages: fast speed, multi-threaded compression, dictionary for small data, etc. IMO it's suitable as a replacement for zlib, but at this time: 1, If it is included into stdlib, it will take advantage of the huge influence of Python and become popular.
Let's stipulate that zstd is "better" than bzip2 or lzma is relevant ways (although the reality is less unambiguous). Zstd was created in 2016, Brotli in 2013. Those are pretty new. Better is nice, but adoption of either of those is only middling. And the two are different and incompatible, although offer largely similar benefits. However, I am not at all certain that someone won't introduce something new in 2021 that is better in every technical way than either Zstd or Brotli. That doesn't make those bad, but Python isn't trying to have the "optimal cutting-edge" thing in its standard library. More like "the well-established, widely-used" thing. I feel like you making Zstd available on PyPI is wonderful, and a huge service to the community. But I'm -0.5 on adding either it or Brotli to the standard library at this time. Until they are older and more established, it feels premature. -- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
but Python isn't trying to have the "optimal cutting-edge" thing in its standard library. More like "the well-established, widely-used" thing.
I also agree with this. At present, I have confidence in zstd. There seems to be a trend that some programmer users are switching to zstd. Don't know if it will be popular among non-programmer users. If this happens in the future, what path will it take?
I feel like you making Zstd available on PyPI is wonderful, and a huge service to the community.
There was already a zstandard module on PyPI since 2016, but its API is different from bz2/lzma module. I intentionally didn't look at its implementation when implementing the module. https://pypi.org/project/zstandard/
On Tue, 13 Oct 2020 05:58:45 -0000 "Ma Lin" <malincns@163.com> wrote:
I heard in data science domain, the data is often huge, such as hundreds of GB or more. If people can make full use of multi-core CPU to compress, the experience will be much better than zlib.
This is true, but in data science it is extremely beneficial to use specialized file formats, such as Parquet (which incidentally can use zstd under the hood). In that case, the compression is built in the Parquet implementation, and won't depend on zstd being available in the Python standard library. Regards Antoine.
I think packaging ought to be able to use binary dependencies. Some disagree. The binary ZStandard decompressor could be offered in a gzip-compressed wheel. The reason an improved packaging format can only use ZStandard and not LZMA is that we need to improve everyone's experience, not just minimize bandwidth. LZMA can save a lot of bandwidth compared to gz and a tiny amount of bandwidth compared to ZStandard. The package consumer won't care because (LZMA download time + decompression time) will usually be greater than (standard ZIP download time + decompression time). This is the same reason other package formats like RPM have switched to ZStandard. At its default settings ZStandard can achieve faster compression and decompression times than gz / standard ZIP compression, and a better compression ratio, at the same time. In other words if we used LZMA our consumers would be unhappy because it would usually slow them down, to say nothing of the producers. Only the most bandwidth constrained users would benefit. ZStandard is generally faster and better than Brotli.
participants (12)
-
Antoine Pitrou
-
Barry Scott
-
Ben Rudiak-Gould
-
Christopher Barker
-
Daniel Holth
-
David Mertz
-
Eric V. Smith
-
Ma Lin
-
Omer Katz
-
Paul Moore
-
Paul Sokolovsky
-
Serhiy Storchaka