PATCH: Speed up direct string concatenation by 20+%!

Sun Oct 1 09:45:44 EDT 2006

Congratulations on the clear way in which you have set out your proposal.

I hope that it will make its way to PEPdom.

Colin W.

Larry Hastings wrote:
> This is such a long posting that I've broken it out into sections.
> Note that while developing this patch I discovered a Subtle Bug
> in CPython, which I have discussed in its own section below.
> 
> ____________
> THE OVERVIEW
> 
> I don't remember where I picked it up, but I remember reading years
> ago that the simple, obvious Python approach for string concatenation:
>         x = "a" + "b"
> is slow.  Obviously it's fine for the simple case above, but if you're
> adding together twenty-five things:
>         x =  a + b + c + d + ... + y
> the performance is terrible, as the interpreter creates temporary
> objects
> for each intermediate result.  (a + b, then (a + b) + c,
> then ((a + b) + c) + d, and so on.)
> 
> The suggested idiom for fast string concatenation was this:
>         x = "".join([a, b, c, d, ... y])
> This works fine.  But I don't like this approach, not only because it's
> kind of ugly and inobvious, but because it violates the "one obvious
> way
> to do it" principle.
> 
> I mentioned all this to a friend of mine (Mike Thornburgh), and told
> him matter-of-factly that I'd like to fix this, but the design of
> Python
> simply precluded it.  He said "no it doesn't!" and proceeded to
> describe
> in great detail an approach to permit fast string concatenation using
> +.
> I have implemented what he suggested--with only slight changes--and
> present it to you below.  I only had to touch four files, and only did
> major surgery on one.  The resulting Python passes the test suite as
> well
> as my unmodified locally-built Python 2.5.  (I didn't install the
> external
> source trees for stuff like SQLite, so it fails those tests.)
> 
> Note that times have changed; using Python 2.5, a + b + c isn't *that*
> much slower than "".join([]).  But is is still slower.  Or, was--with
> my
> patch, a + b + c becomes *the fastest method* for string concatenation.
> Hooray!
> 
> (It didn't pay off as *much* as I'd hoped, but it's still an
> improvement.)
> 
> _________
> THE PATCH
> 
> The core concept: adding two strings together no longer returns a pure
> "string" object.  Instead, it returns a "string concatenation" object
> which holds references to the two strings but does not actually
> concatenate
> them... yet.  The strings are concatenated only when someone requests
> the
> string's value, at which point it allocates all the space it needs and
> renders the concatenated string all at once.
> 
> More to the point, if you add multiple strings together (a + b + c),
> it *doesn't* compute the intermediate strings (a + b).
> 
> Upsides to this approach:
> 
>         * String concatenation using + is now the fastest way to
> concatenate
>           strings (that I know of).
> 
>         * Throw off the shackles of "".join([]), you don't need it
> anymore.
> 
>         * Did I mention it was faster?
> 
> Downsides to this approach:
> 
>         * Changes how PyStringObjects are stored internally; ob_sval is
> no
>           longer a char[1], but a char *.  This makes each StringObject
>           four bytes larger.
> 
>         * Adds another memory dereference in order to get the value of
>           a string, which is a teensy-weensy slowdown.
> 
>         * Would force a recompile of all C modules that deal directly
>           with string objects (which I imagine is most of them).
> 
>         * Also, *requires* that C modules use the PyString_AS_STRING()
>           macro, rather than casting the object and grabbing ob_sval
>           directly.  (I was pleased to see that the Python source
>           was very good about using this macro; if all Python C
>           modules are this well-behaved, this point is happily moot.)
> 
>         * On a related note, the file Mac/Modules/MacOS.c implies
>           that there are Mac-specific Python scripts that peer
>           directly into string objects.  These would have to be
>           changed to understand the new semantics.
> 
>         * String concatenation objects are 36 bytes larger than
>           string objects, and this space will often go unreclaimed
>           after the string is rendered.
> 
>         * When rendered, string concatenation objects storing long
>           strings will allocate a second buffer from the heap to
>           store the string.  So this adds some minor allocation
>           overhead (though this is offset by the speed gain from
>           the approach overall).
> 
>         * Will definitely need some heavy review before it could
>           go in, in particular I worry I got the semantics surrounding
>           "interned" strings wrong.
> 
> 
> Internally, a "string concatenation" object is the same as a "string"
> object with two extra fields: an array of references to string objects
> (and string concatenation objects), and a count.  Also, ob_sstate now
> stores a third flag, indicating whether or not the string is a string
> concatenation object.  External users don't need to worry about these
> details, they just use PyString_AS_STRING() and all is well with their
> world.
> 
> I've already added some optimizations to the approach:
> 
>         * If you're adding a string to an existing string concatenation
>           object whose reference count is 1, who isn't full, and who
>           hasn't been rendered yet, it adds the string directly to the
>           existing concatenation object (rather than creating a new
> one).
>           This works whether the existing object is on the right or the
>           left side.
> 
>         * The general case is a + b + c ..., where you're only adding
>           strings to the *end*.  This will build a degenerate tree
> where
>           the first slot points to the child.  The tree for
> concatenating
>           the first twenty-two letters of the alphabet would look like
> this:
> 
>                 [0 1 2 3 4 5 6 7]
>                  | | | | | | | |
>                  | v v v v v v v
>                  | p q r s t u v
>                  |
>                  v
>                 [0 1 2 3 4 5 6 7]
>                  | | | | | | | |
>                  | v v v v v v v
>                  | i j k l m n o
>                  |
>                  v
>                 [0 1 2 3 4 5 6 7]
>                  | | | | | | | |
>                  v v v v v v v v
>                  a b c d e f g h
> 
>           The rendering function is by necessity recursive.  However,
>           it is optimized for this shape; when the tree looks like
> this,
>           it will be rendered purely iteratively--no recursion.
> 
>         * If, after rendering, the resulting string will be small
> enough
>           to fit inside the extra space normally used by the
> concatenation
>           object, the final string will be copied on top of this
> otherwise
>           now-wasted space.
> 
> ______________
> THE BENCHMARKS
> 
> Benchmark 1:
>         def add(a, b, c, ... t): return a + b + c + ... + t
>         for i in range(10000000): add("aaa", "bbb", "ccc", ..., "ttt")
> Benchmark 2:
>         for i in range(10000000): x = "aaa" + "bbb" + "ccc" + ... +
> "ttt"
> Benchmark 3:
>         for i in range(10000000): x = "".join(["aaa", "bbb", "ccc",
> ..., "ttt"])
> Benchmark 4:
>         for i in range(10000000):
>                 x = "aaa"
>                 x += "bbb"
>                 x += "ccc"
>                 ...
>                 x += "ttt"
> Benchmark 5:
>         for i in range(10000000):
>                 array = []
>                 array.append("aaa")
>                 array.append("bbb")
>                 array.append("ccc")
>                 ...
>                 array.append("ttt")
>                 x = "".join(array)
> Benchmark 6:
>         for i in range(10000000):
>                 x = cStringIO.StringIO()
>                 x.write("aaa")
>                 x.write("bbb")
>                 x.write("ccc")
>                 ...
>                 x.write("ttt")
> 
> Benchmark | 2.5 release | 2.5 locally built | 2.5 modified | %
> improvement
>         1 |       32.1s |             32.8s |        26.7s |
> 22.8%
>         2 |       18.7s |             18.7s |        15.1s |
> 23.8%
>         3 |       16.1s |             16.2s |        16.6s |
> -1.1%
>         4 |       64.6s |             64.0s |        57.1s |
> 12.1%
>         5 |       74.1s |             79.4s |        76.7s |
> 3.5%
>         6 |      126.0s |  too slow, didn't bother!
> 
> ______________
> THE SUBTLE BUG
> 
> While developing this patch, I discovered a subtle bug in Python.
> It happened only on one gruesome test in the middle of test_descr.py.
> The problem was: '' == '' was returning False.
> 
> Starting on line 1116 of stringobject.c we have this:
> 
> 	if (op == Py_EQ) {
> 		/* Supporting Py_NE here as well does not save
> 		   much time, since Py_NE is rarely used.  */
> 		if (a->ob_size == b->ob_size
> 		    && (a->ob_sval[0] == b->ob_sval[0]
> 			&& memcmp(a->ob_sval, b->ob_sval,
> 				  a->ob_size) == 0)) {
> 			result = Py_True;
> 		} else {
> 			result = Py_False;
> 		}
> 		goto out;
> 	}
> 
> The bug is with the "a->ob_sval[0] == b->ob_sval[0]".  This is
> apparently
> a speed optimization; check that the first byte is the same before
> proceeding with the memcmp(). But it does this even when *both* strings
> are *zero length*!  I doubt the Python source tree has an official
> position on what the first byte of a zero-length string should be, but
> I assert that it should be undefined.  In practice, it's seemingly
> always
> set to the same value in the released Python, but with my changes it's
> now possible for them to be different.
> 
> I think it's sinful to compare the first bytes of zero-length strings.
> Therefore I suggest this bugfix:
> 
> 		if (a->ob_size == b->ob_size
> /* >> change >> */  && ( (!a->ob_size || a->ob_sval[0] ==
> b->ob_sval[0])
> 			&& memcmp(a->ob_sval, b->ob_sval,
> 
> I've already incorporated this fix in my code.
> 
> __________
> THE FUTURE
> 
> If this patch is accepted, it paves the way for another cheap
> optimization: string slices could be done without copying.
> Add another field into the PyString structure: a reference
> to another PyString *.  If non-NULL, it means the local ob_sval
> actually points into the ob_sval owned by that other PyString *.
> It'd be another four bytes, but we'd only need to incur that
> when creating a slice; we could set a bit in ob_sstate indicating
> "this is a slice", and if so we'd have to copy-on-write ob_sval,
> and free the reference when the object was destroyed.
> 
> ______________
> THE SUBMISSION
> 
> I don't know the protocol from this point out; should I email the patch
> somewhere?  Put it up on a web page?  Post some diff output?  (Or just
> forget about it entirely?)
> 
> Also, for what it's worth: I fixed the .dsp so pythoncore builds under
> VC6 again.  I would be happy to submit that separately.  (What can I
> say, old habits die hard.  I can't stand the newer Microsoft IDEs.)
> 
> 
> Cheers,
> 
> 
> /larry/
> 
> p.s. No, I haven't looked at a Unicode port yet.  If there's interest
> in
> this patch at all, I might take a stab at it.
>