[issue733] bz2 decompression is very slow

New submission from Jonas H. <jonas@lophus.org>: Compared to CPython 2.7, PyPy 1.5 (from the Arch Linux repositories) seems to be ~5 times slower on bz2 decompression. Using this script: from bz2 import BZ2File with BZ2File(sys.argv[1]) as f: while True: if not f.read(8*1024): break to decompress the PyPy lib-python/ directory (60M compressed) takes about 5 seconds on CPython and 20s on PyPy 1.5. ---------- messages: 2569 nosy: jonash, pypy-issue priority: bug release: 1.5 status: unread title: bz2 decompression is very slow ________________________________________ PyPy bug tracker <tracker@bugs.pypy.org> <https://bugs.pypy.org/issue733> ________________________________________

Jonas H. <jonas@lophus.org> added the comment: I did some benchmarking and the decompression runtime seems to be much worse than I expected -- here are some stupid decompression benchmarks (script http://paste.pocoo.org/show/397932): $ python bench2-bz2.py 500000 0.01 1000000 0.01 5000000 0.05 10000000 0.11 $ pypy bench2-bz2.py 500000 0.12 1000000 0.49 5000000 7.34 10000000 24.70 The numbers on the left mean the amount of data that was compressed (in bytes), the left column shows the decompression runtime in seconds. ---------- status: unread -> chatting ________________________________________ PyPy bug tracker <tracker@bugs.pypy.org> <https://bugs.pypy.org/issue733> ________________________________________

Jonas H. <jonas@lophus.org> added the comment: The same seems to be true for the gzip module (but bz2 is much worse): $ python bench-gz.py 500000 0.01 1000000 0.01 5000000 0.07 10000000 0.15 100000000 1.36 $ pypy bench-gz.py 500000 0.06 1000000 0.10 5000000 0.43 10000000 0.96 100000000 7.89 http://paste.pocoo.org/show/397936 ________________________________________ PyPy bug tracker <tracker@bugs.pypy.org> <https://bugs.pypy.org/issue733> ________________________________________

Xavier Morel <bugs.pypy.org@masklinn.net> added the comment: Pasting observations I put in duplicate 770 on the same problem: Using a clone of pypy's hg repo (working copy included) as my tar base, decompressing to fs using `tarfile`. Test archives created using BSDTAR, default options (`tar cjf` and `tar czf`), likewise for tar's decompression baseline (`tar xf` in both cases) hg id of local Pypy clone is 27df060341f0 tip OS is OSX 10.6.8 Decompressors tested: * CPython is Python 2.7.2 * Pypy 1.5 is Python 2.7.1 (?, May 22 2011, 11:59:12) [PyPy 1.5.0-alpha0 with GCC 4.0.1] from macports * Pypy trunk is Pypy-65b1ed60d7da from nightlies * Tar is bsdtar 2.6.2 - libarchive 2.6.2 CPython and Pypy were running the exact same script, which can be found at the end of the comment All measurements were performed via `time` and are in minute:seconds, they're the decompression times. First I tested the behavior for gzipped files, in order to get an idea of what I could expect: * tar: 0:19 * CPython: 0:31 * Pypy 1.5: 0:47 * Pypy trunk: 0:43 Pypy is ~50% slower than CPython, itself ~50% slower than the native tar. Then I tested using a bz2-compressed archive: * tar: 0:54 * CPython: 1:10 * Pypy 1.5: hard crash * Pypy trunk: 2:58 pypy is 200% slower than CPython, which is a significant slowdown. I believe it might be a source of performance issues when installing bz2-packed modules via pip. Decompression script: import tarfile import sys tar = tarfile.open(sys.argv[1]) tar.extractall() tar.close() ---------- nosy: +masklinn ________________________________________ PyPy bug tracker <tracker@bugs.pypy.org> <https://bugs.pypy.org/issue733> ________________________________________

Justin Peel <peelpy@gmail.com> added the comment: I just thought that I'd post the current results on the tests that jonash was using: python2.7.1 and bz2: 500000 0.00 1000000 0.01 5000000 0.03 10000000 0.05 100000000 0.53 pypy nightly and bz2: 500000 0.00 1000000 0.01 5000000 0.05 10000000 0.09 100000000 0.80 python2.7.1 and gzip: 500000 0.00 1000000 0.00 5000000 0.03 10000000 0.06 100000000 0.61 pypy nightly and gzip: 500000 0.01 1000000 0.02 5000000 0.13 10000000 0.24 100000000 2.06 So things are better for both of them, but gzip in particular is still really struggling in pypy. ---------- nosy: +justinpeel ________________________________________ PyPy bug tracker <tracker@bugs.pypy.org> <https://bugs.pypy.org/issue733> ________________________________________

Alex Gaynor <alex.gaynor@gmail.com> added the comment: So I just made gzip reading 50% faster with: 7cc899d8de19, by my measurements we're still 2x slower than CPython at the large size though. ---------- nosy: +agaynor ________________________________________ PyPy bug tracker <tracker@bugs.pypy.org> <https://bugs.pypy.org/issue733> ________________________________________

Jonas H. <jonas@lophus.org> added the comment: Still, `pip install http://bitbucket.org/wkornewald/django- nonrel/get/tip.tar.bz2` takes forever on PyPy (today's nightly: > 10min) while CPython 2.7 takes ~ 15 seconds. ________________________________________ PyPy bug tracker <tracker@bugs.pypy.org> <https://bugs.pypy.org/issue733> ________________________________________

Justin Peel <peelpy@gmail.com> added the comment: I also did that pip install and it only took about 4x longer for me on pypy. This matches up with my experiments of just untarring (and bunzipping) the file (using the same code that pip does it with, namely employing the tarfile module) taking 3x-6x longer in pypy as compared to python. ________________________________________ PyPy bug tracker <tracker@bugs.pypy.org> <https://bugs.pypy.org/issue733> ________________________________________

Jonas H. <jonas@lophus.org> added the comment: Here are some cProfile stats using PyPy and this benchmark script import os, tarfile, tempfile tarfile.open("x.tar.gz").extractall(tempfile.mkdtemp()) where "x.tar.gz" is created using apack x.tar.gz /opt/pypy/lib_pypy/ CPython 2.7.2 takes 0.8 seconds to decompress this whereas PyPy 1.6 takes 4 seconds. $ pypy -m cProfile -s time bench.py 498680 function calls (495247 primitive calls) in 4.027 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 2/1 0.698 0.349 3.666 3.666 tarfile.py:2025(extractall) 2436 0.315 0.000 1.341 0.001 shutil.py:45(copyfileobj) 6003 0.224 0.000 0.963 0.000 tarfile.py:798(read) 2436 0.218 0.000 1.559 0.001 tarfile.py:259(copyfileobj) 10207 0.206 0.000 0.206 0.000 {struct.unpack} 22213 0.203 0.000 0.685 0.000 gzip.py:232(read) 4311 0.200 0.000 0.200 0.000 {method 'decompress' of 'Decompress' objects} [...snip...] Could we get this issue fixed sooner if I contributed a benchmark case for speed.pypy.org? ________________________________________ PyPy bug tracker <tracker@bugs.pypy.org> <https://bugs.pypy.org/issue733> ________________________________________

Jonas H. <jonas@lophus.org> added the comment: Performance of GZip and BZip2 has been improved dramatically in PyPy 1.7 it seems -- it's now almost proportional to CPython's performance (PyPy taking twice at long). I consider this bug fixed but maybe you may want to keep it open as a reminder for further optimizations? ________________________________________ PyPy bug tracker <tracker@bugs.pypy.org> <https://bugs.pypy.org/issue733> ________________________________________

Carl Friedrich Bolz <cfbolz@gmx.de> added the comment: this is fixed ---------- nosy: +cfbolz status: chatting -> resolved ________________________________________ PyPy bug tracker <tracker@bugs.pypy.org> <https://bugs.pypy.org/issue733> ________________________________________
participants (5)
-
Alex Gaynor
-
Carl Friedrich Bolz
-
Jonas H.
-
Justin Peel
-
Xavier Morel