An Un-optimization anecdote with text-processing

Sat Jun 19 14:05:16 EDT 1999

Chad Netzer <chad at vision.arc.nasa.gov> writes:

> Well, I was in the process of starting to write a Perl program
> for a simple text processing task I have to do, when this post
> steered me back to my language of choice.  As I wrote my script,
> which is called by the Unix "find" command, I got to thinking
> about how it was anti-optimized...  Here is what I wrote:
> 
> import string
> import sys
> 
> # Way stupid algorithm, but I'm running it overnight :)
> filename = sys.argv[1]
> f = open(filename, 'r')
> text = f.read()
> while 1:
>     i = string.find(text, r'/"')
>     if i == -1: break
>     text = text[:i+1] + "index.html" + text[i+1:]
> 
> f.close()
> f = open(filename, 'w')
> f.write(text)
> f.close()
>
> 
> 
> As you can see, the core loop scans the entire file for a string
> match, builds a new string by concatenation, then starts scanning from
> the beginning yet again.  It almost doesn't seem like it could be made
> much worse, unless one deliberately tried.
> 
> Which brings me to my new game...  Rather than finding a nice hack
> that would speed things up and use less lines of code, how about
> finding the hack that uses the least amount of code to make this as
> slow as possible?  Adding pointless operations, or other obfuscations
> don't count.  Try to make this operation as elegantly short, and as
> painfully slow as possible...
> 
> For example:
> 
> I had originally coded this with string.index() using a try/except
> pair to break out of the loop, which took more lines of code but is
> probably faster than my loop (it avoids the 'if' statement in the
> loop, for the overhead of one exception).  So I changed it to be
> shorter and (possibly) slower.  Can anyone think of other elegant
> unoptimizations?

My first attempt at a shortest implementation for the string replacement
part is this:

Python 1.5.2 (#3, Apr 21 1999, 16:51:54)  [GCC 2.7.2.3] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> from string import * 
>>> s = ' "www.python.org/" xyzzy/plugh "fee/fie/foe/foo/" '
>>> join(split(s, '/"'), '/index.html"')
' "www.python.org/index.html" xyzzy/plugh "fee/fie/foe/foo/index.html" '
>>> 

But I fear it's also one of the fastest...

Or how about:

>>> reduce(lambda x,y: x[-1] == '/' and y == '"' and x+'index.html"' or x+y, s)
' "www.python.org/index.html" xyzzy/plugh "fee/fie/foe/foo/index.html" '

you don't even need the string module. This is veerrryyyy
sssssllllllooooooowwwwwwww because the result string is copied over and
over again and you call a python function for every char.

To give you an impression of the speed difference:

>>> def test1(s):
..     reduce(lambda x,y: x[-1] == '/' and y == '"' and x+'index.html"' or x+y, s)
.. 
>>> def test2(s):
..     join(split(s, '/"'), '/index.html"')
.. 
>>> import time 
>>> from string import *
>>> s = open("data.html").read()
>>> t = time.clock(); test1(s); print time.clock() - t 
69.49
>>> t = time.clock(); test2(s); print time.clock() - t
0.01
>>> 

-- 
Bernhard Herzog	  | Sketch, a python based drawing program
herzog at online.de  | http://www.online.de/home/sketch/