[Python-bugs-list] [ python-Bugs-210648 ] performance-problem decoding quoted-printable (PR#340)

Tue, 21 May 2002 13:34:15 -0700

Bugs item #210648, was opened at 2000-07-31 14:10
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=210648&group_id=5470

Category: Python Library
Group: None
Status: Closed
Resolution: Invalid
Priority: 4
Submitted By: Nobody/Anonymous (nobody)
Assigned to: Barry Warsaw (bwarsaw)
Summary: performance-problem decoding quoted-printable (PR#340)

Initial Comment:
Jitterbug-Id: 340
Submitted-By: =?iso-8859-1?Q?Ragnar_Kj=F8rstad?= <ragnark@vestdata.no>
Date: Fri, 26 May 2000 16:51:51 +0200
Version: None
OS: None

--7JfCtLOvnd9MIVvH
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

Hi

Problem on python 1.5.1 on linux 2.2.14aa6.

Decoding quoted-printable files using mimetools.decode is really slow.
The really strange thing is, that it appers to work a lot faster on
smaller files!

I put together a little test-program that reads blocks from a file, and
decodes them individually. (The output will not be 100% correct. The
point was just to test the performance). 

Results show the time it took to decode a 300k file with the diferent
block-sizes:
1k:	6.28 s
3k:	6.59 s
10k:	8.57 s
30k:	30.45 s
100k:	127.82 s
300k:	221.67 s

I looked in quopri.decode for clues about the problem, but could not
find any. Is there something _very_ wrong with my reasoning, or is
something wrong here?

-- 
Ragnar Kjørstad

--7JfCtLOvnd9MIVvH
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="test.py"

#!/usr/bin/python
from mimetools import decode
from StringIO import StringIO
from sys import stdout, argv
from string import atoi
bsize=atoi(argv[1])
output=StringIO()
f=open("mail-test")
s=f.read(bsize)
while s:
	input=StringIO(s)
	decode(input, output, 'quoted-printable')
	s=f.read(bsize)
	stdout.write('.')
	stdout.flush()
stdout.write('done\n')

--7JfCtLOvnd9MIVvH--

====================================================================
Audit trail:
Tue Jul 11 08:25:57 2000	guido	moved from incoming to open

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-05-21 13:34

Message:
Logged In: NO 

I've been able to reproduce this using ActivePython 2.1 
(build 210) on Windows 2000.

The (apparent) solution is to replace the recursive "new = 
new + c" with a simple "output.write(c)". Seems kind of silly 
to write a single character at a time, but it took my mail 
program from infinite loop (running through mod_python) to 
decoding a 5MB attachment in seconds.

Thoughts?

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2001-07-23 10:10

Message:
Logged In: NO 

Ive reproduced this problem.  The file that gives the 
library fits is a PDF file that reports as quoted-printable.
The file is an actual PDF sent with Microsoft Outlook.
Every line in the quoted-printable ends in an =.

It looks like the decode function will keep appending to
a single line if it finds an = at the end of the line.  This
is the flag that tells the quopri.decode that the line is
a partial line.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2000-09-21 18:35

Message:
I marked this Invalid and Closed.  There *may* have been a buffering problem in 1.5.2 that got fixed, or it may simply be that the original report used an unrealistic input file (which is what I suspect -- gory details below), or both.  But note that the true complaint ("decode is really slow") probably remains valid.

There's *something* still going on here in 2.0b1, at least under Win98SE.  Here's the output from a more self-contained version of the program:

C:\Python20>python mtest.py
      1024        4.6
      2048        6.1
     10240        7.9
     30720        8.8
    102400        8.9
    307200        9.4
      1024        4.5
      2048        5.8
     10240        8.3
     30720        8.9
    102400        9.0
    307200        9.4

C:\Python20>

That's with everything in cache, and there's a very clear & reproducible factor-of-2 advantage to using the small blocksize.  Here's the program:

BIGFILE = "DLLs/tk83.dll"

def timer(fname, blockseq):
    from mimetools import decode
    from StringIO import StringIO
    from time import clock
    for mult in blockseq:
        start = clock()
        bsize = mult * 1024
        output = StringIO()
        f = open(fname, "rb")
        s = f.read(bsize)
        while s:
            input = StringIO(s)
            decode(input, output, 'quoted-printable')
            s = f.read(bsize)
        f.close()
        del output, f, s, input
        finish = clock()
        print "%10d %10.1f" % (bsize, finish - start)

if __name__ == "__main__":
    timer(BIGFILE, (1, 2, 10, 30, 100, 300) * 2)

Of course, I'm reading in a binary file there, so the test makes no real sense!  And I *suspect* the original bug report also used a nonsense file.  Why that's important will become clear soon.

If I interrupt the program at random, it's usually on line 72 of quopri.py:

  File "C:\PYTHON20\lib\quopri.py", line 72, in decode
    new = new + c; i = i+1

That hints at a problem with the quadratic-time nature of repeated appends.  Conventional wisdom is that's just theoretical, but it isn't, and the crossover point is highly platform-dependent.

If I replace the call to decode with:

            while 1:
                line = input.readline()
                if not line:
                    break
                new = ""
                for ch in line:
                    pass
                output.write(line)

then there's no apparent dependence on block size.  So this is not a problem with buffering, it's specific to decode.

But if I replace the "pass" with

new = new + ch

then I get back a very strong dependence on block size.  And that loop is what decode does, except for the useful parts <wink>.

So that explains *my* glitch:  the smaller the blocksize, the more "lines" get artificially created, the smaller the longest lines become, and the less the ch-at-a-time append can hurt.  But this is all an artifact of me using a binary file for input!  That is, it's an artifact of using a test file with nearly arbitrarily long "lines".

If I create a giant 4Mb text file via catenating all the *.html files in the Windows Python doc/lib directory, there is no dependence on blocksize in the test program, other than that it gets a little *faster* the larger the block size.  And that's what I'd expect (repeated catenation is linear-time on all platforms I've heard of, so long as the lines are reasonably short).

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2000-09-19 11:11

Message:
Can't reproduce it either. Note he's using Python 1.5.1. Maybe it was a buffering problem in StringIO there - quopri.py is using readline(). Barry, if you don't have another idea, just close it as Invalid.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2000-09-19 09:52

Message:
I can't reproduce the problem. On  sparc-sun-solaris2.6, Python 1.5.2 needs the same time for a 330k file, regardless of the block size.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=210648&group_id=5470