[Python-bugs-list] [ python-Bugs-210648 ] performance-problem decoding quoted-printable (PR#340)
noreply@sourceforge.net
noreply@sourceforge.net
Tue, 21 May 2002 13:34:15 -0700
Bugs item #210648, was opened at 2000-07-31 14:10
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=210648&group_id=5470
Category: Python Library
Group: None
Status: Closed
Resolution: Invalid
Priority: 4
Submitted By: Nobody/Anonymous (nobody)
Assigned to: Barry Warsaw (bwarsaw)
Summary: performance-problem decoding quoted-printable (PR#340)
Initial Comment:
Jitterbug-Id: 340
Submitted-By: =?iso-8859-1?Q?Ragnar_Kj=F8rstad?= <ragnark@vestdata.no>
Date: Fri, 26 May 2000 16:51:51 +0200
Version: None
OS: None
--7JfCtLOvnd9MIVvH
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Hi
Problem on python 1.5.1 on linux 2.2.14aa6.
Decoding quoted-printable files using mimetools.decode is really slow.
The really strange thing is, that it appers to work a lot faster on
smaller files!
I put together a little test-program that reads blocks from a file, and
decodes them individually. (The output will not be 100% correct. The
point was just to test the performance).
Results show the time it took to decode a 300k file with the diferent
block-sizes:
1k: 6.28 s
3k: 6.59 s
10k: 8.57 s
30k: 30.45 s
100k: 127.82 s
300k: 221.67 s
I looked in quopri.decode for clues about the problem, but could not
find any. Is there something _very_ wrong with my reasoning, or is
something wrong here?
--
Ragnar Kjørstad
--7JfCtLOvnd9MIVvH
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="test.py"
#!/usr/bin/python
from mimetools import decode
from StringIO import StringIO
from sys import stdout, argv
from string import atoi
bsize=atoi(argv[1])
output=StringIO()
f=open("mail-test")
s=f.read(bsize)
while s:
input=StringIO(s)
decode(input, output, 'quoted-printable')
s=f.read(bsize)
stdout.write('.')
stdout.flush()
stdout.write('done\n')
--7JfCtLOvnd9MIVvH--
====================================================================
Audit trail:
Tue Jul 11 08:25:57 2000 guido moved from incoming to open
----------------------------------------------------------------------
Comment By: Nobody/Anonymous (nobody)
Date: 2002-05-21 13:34
Message:
Logged In: NO
I've been able to reproduce this using ActivePython 2.1
(build 210) on Windows 2000.
The (apparent) solution is to replace the recursive "new =
new + c" with a simple "output.write(c)". Seems kind of silly
to write a single character at a time, but it took my mail
program from infinite loop (running through mod_python) to
decoding a 5MB attachment in seconds.
Thoughts?
----------------------------------------------------------------------
Comment By: Nobody/Anonymous (nobody)
Date: 2001-07-23 10:10
Message:
Logged In: NO
Ive reproduced this problem. The file that gives the
library fits is a PDF file that reports as quoted-printable.
The file is an actual PDF sent with Microsoft Outlook.
Every line in the quoted-printable ends in an =.
It looks like the decode function will keep appending to
a single line if it finds an = at the end of the line. This
is the flag that tells the quopri.decode that the line is
a partial line.
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2000-09-21 18:35
Message:
I marked this Invalid and Closed. There *may* have been a buffering problem in 1.5.2 that got fixed, or it may simply be that the original report used an unrealistic input file (which is what I suspect -- gory details below), or both. But note that the true complaint ("decode is really slow") probably remains valid.
There's *something* still going on here in 2.0b1, at least under Win98SE. Here's the output from a more self-contained version of the program:
C:\Python20>python mtest.py
1024 4.6
2048 6.1
10240 7.9
30720 8.8
102400 8.9
307200 9.4
1024 4.5
2048 5.8
10240 8.3
30720 8.9
102400 9.0
307200 9.4
C:\Python20>
That's with everything in cache, and there's a very clear & reproducible factor-of-2 advantage to using the small blocksize. Here's the program:
BIGFILE = "DLLs/tk83.dll"
def timer(fname, blockseq):
from mimetools import decode
from StringIO import StringIO
from time import clock
for mult in blockseq:
start = clock()
bsize = mult * 1024
output = StringIO()
f = open(fname, "rb")
s = f.read(bsize)
while s:
input = StringIO(s)
decode(input, output, 'quoted-printable')
s = f.read(bsize)
f.close()
del output, f, s, input
finish = clock()
print "%10d %10.1f" % (bsize, finish - start)
if __name__ == "__main__":
timer(BIGFILE, (1, 2, 10, 30, 100, 300) * 2)
Of course, I'm reading in a binary file there, so the test makes no real sense! And I *suspect* the original bug report also used a nonsense file. Why that's important will become clear soon.
If I interrupt the program at random, it's usually on line 72 of quopri.py:
File "C:\PYTHON20\lib\quopri.py", line 72, in decode
new = new + c; i = i+1
That hints at a problem with the quadratic-time nature of repeated appends. Conventional wisdom is that's just theoretical, but it isn't, and the crossover point is highly platform-dependent.
If I replace the call to decode with:
while 1:
line = input.readline()
if not line:
break
new = ""
for ch in line:
pass
output.write(line)
then there's no apparent dependence on block size. So this is not a problem with buffering, it's specific to decode.
But if I replace the "pass" with
new = new + ch
then I get back a very strong dependence on block size. And that loop is what decode does, except for the useful parts <wink>.
So that explains *my* glitch: the smaller the blocksize, the more "lines" get artificially created, the smaller the longest lines become, and the less the ch-at-a-time append can hurt. But this is all an artifact of me using a binary file for input! That is, it's an artifact of using a test file with nearly arbitrarily long "lines".
If I create a giant 4Mb text file via catenating all the *.html files in the Windows Python doc/lib directory, there is no dependence on blocksize in the test program, other than that it gets a little *faster* the larger the block size. And that's what I'd expect (repeated catenation is linear-time on all platforms I've heard of, so long as the lines are reasonably short).
----------------------------------------------------------------------
Comment By: Guido van Rossum (gvanrossum)
Date: 2000-09-19 11:11
Message:
Can't reproduce it either. Note he's using Python 1.5.1. Maybe it was a buffering problem in StringIO there - quopri.py is using readline(). Barry, if you don't have another idea, just close it as Invalid.
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2000-09-19 09:52
Message:
I can't reproduce the problem. On sparc-sun-solaris2.6, Python 1.5.2 needs the same time for a 330k file, regardless of the block size.
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=210648&group_id=5470