PATCH: Speed up direct string concatenation by 20+%!
Colin J. Williams
cjw at sympatico.ca
Sun Oct 1 09:45:44 EDT 2006
Congratulations on the clear way in which you have set out your proposal.
I hope that it will make its way to PEPdom.
Colin W.
Larry Hastings wrote:
> This is such a long posting that I've broken it out into sections.
> Note that while developing this patch I discovered a Subtle Bug
> in CPython, which I have discussed in its own section below.
>
> ____________
> THE OVERVIEW
>
> I don't remember where I picked it up, but I remember reading years
> ago that the simple, obvious Python approach for string concatenation:
> x = "a" + "b"
> is slow. Obviously it's fine for the simple case above, but if you're
> adding together twenty-five things:
> x = a + b + c + d + ... + y
> the performance is terrible, as the interpreter creates temporary
> objects
> for each intermediate result. (a + b, then (a + b) + c,
> then ((a + b) + c) + d, and so on.)
>
> The suggested idiom for fast string concatenation was this:
> x = "".join([a, b, c, d, ... y])
> This works fine. But I don't like this approach, not only because it's
> kind of ugly and inobvious, but because it violates the "one obvious
> way
> to do it" principle.
>
> I mentioned all this to a friend of mine (Mike Thornburgh), and told
> him matter-of-factly that I'd like to fix this, but the design of
> Python
> simply precluded it. He said "no it doesn't!" and proceeded to
> describe
> in great detail an approach to permit fast string concatenation using
> +.
> I have implemented what he suggested--with only slight changes--and
> present it to you below. I only had to touch four files, and only did
> major surgery on one. The resulting Python passes the test suite as
> well
> as my unmodified locally-built Python 2.5. (I didn't install the
> external
> source trees for stuff like SQLite, so it fails those tests.)
>
> Note that times have changed; using Python 2.5, a + b + c isn't *that*
> much slower than "".join([]). But is is still slower. Or, was--with
> my
> patch, a + b + c becomes *the fastest method* for string concatenation.
> Hooray!
>
> (It didn't pay off as *much* as I'd hoped, but it's still an
> improvement.)
>
> _________
> THE PATCH
>
> The core concept: adding two strings together no longer returns a pure
> "string" object. Instead, it returns a "string concatenation" object
> which holds references to the two strings but does not actually
> concatenate
> them... yet. The strings are concatenated only when someone requests
> the
> string's value, at which point it allocates all the space it needs and
> renders the concatenated string all at once.
>
> More to the point, if you add multiple strings together (a + b + c),
> it *doesn't* compute the intermediate strings (a + b).
>
> Upsides to this approach:
>
> * String concatenation using + is now the fastest way to
> concatenate
> strings (that I know of).
>
> * Throw off the shackles of "".join([]), you don't need it
> anymore.
>
> * Did I mention it was faster?
>
> Downsides to this approach:
>
> * Changes how PyStringObjects are stored internally; ob_sval is
> no
> longer a char[1], but a char *. This makes each StringObject
> four bytes larger.
>
> * Adds another memory dereference in order to get the value of
> a string, which is a teensy-weensy slowdown.
>
> * Would force a recompile of all C modules that deal directly
> with string objects (which I imagine is most of them).
>
> * Also, *requires* that C modules use the PyString_AS_STRING()
> macro, rather than casting the object and grabbing ob_sval
> directly. (I was pleased to see that the Python source
> was very good about using this macro; if all Python C
> modules are this well-behaved, this point is happily moot.)
>
> * On a related note, the file Mac/Modules/MacOS.c implies
> that there are Mac-specific Python scripts that peer
> directly into string objects. These would have to be
> changed to understand the new semantics.
>
> * String concatenation objects are 36 bytes larger than
> string objects, and this space will often go unreclaimed
> after the string is rendered.
>
> * When rendered, string concatenation objects storing long
> strings will allocate a second buffer from the heap to
> store the string. So this adds some minor allocation
> overhead (though this is offset by the speed gain from
> the approach overall).
>
> * Will definitely need some heavy review before it could
> go in, in particular I worry I got the semantics surrounding
> "interned" strings wrong.
>
>
> Internally, a "string concatenation" object is the same as a "string"
> object with two extra fields: an array of references to string objects
> (and string concatenation objects), and a count. Also, ob_sstate now
> stores a third flag, indicating whether or not the string is a string
> concatenation object. External users don't need to worry about these
> details, they just use PyString_AS_STRING() and all is well with their
> world.
>
> I've already added some optimizations to the approach:
>
> * If you're adding a string to an existing string concatenation
> object whose reference count is 1, who isn't full, and who
> hasn't been rendered yet, it adds the string directly to the
> existing concatenation object (rather than creating a new
> one).
> This works whether the existing object is on the right or the
> left side.
>
> * The general case is a + b + c ..., where you're only adding
> strings to the *end*. This will build a degenerate tree
> where
> the first slot points to the child. The tree for
> concatenating
> the first twenty-two letters of the alphabet would look like
> this:
>
> [0 1 2 3 4 5 6 7]
> | | | | | | | |
> | v v v v v v v
> | p q r s t u v
> |
> v
> [0 1 2 3 4 5 6 7]
> | | | | | | | |
> | v v v v v v v
> | i j k l m n o
> |
> v
> [0 1 2 3 4 5 6 7]
> | | | | | | | |
> v v v v v v v v
> a b c d e f g h
>
> The rendering function is by necessity recursive. However,
> it is optimized for this shape; when the tree looks like
> this,
> it will be rendered purely iteratively--no recursion.
>
> * If, after rendering, the resulting string will be small
> enough
> to fit inside the extra space normally used by the
> concatenation
> object, the final string will be copied on top of this
> otherwise
> now-wasted space.
>
> ______________
> THE BENCHMARKS
>
> Benchmark 1:
> def add(a, b, c, ... t): return a + b + c + ... + t
> for i in range(10000000): add("aaa", "bbb", "ccc", ..., "ttt")
> Benchmark 2:
> for i in range(10000000): x = "aaa" + "bbb" + "ccc" + ... +
> "ttt"
> Benchmark 3:
> for i in range(10000000): x = "".join(["aaa", "bbb", "ccc",
> ..., "ttt"])
> Benchmark 4:
> for i in range(10000000):
> x = "aaa"
> x += "bbb"
> x += "ccc"
> ...
> x += "ttt"
> Benchmark 5:
> for i in range(10000000):
> array = []
> array.append("aaa")
> array.append("bbb")
> array.append("ccc")
> ...
> array.append("ttt")
> x = "".join(array)
> Benchmark 6:
> for i in range(10000000):
> x = cStringIO.StringIO()
> x.write("aaa")
> x.write("bbb")
> x.write("ccc")
> ...
> x.write("ttt")
>
> Benchmark | 2.5 release | 2.5 locally built | 2.5 modified | %
> improvement
> 1 | 32.1s | 32.8s | 26.7s |
> 22.8%
> 2 | 18.7s | 18.7s | 15.1s |
> 23.8%
> 3 | 16.1s | 16.2s | 16.6s |
> -1.1%
> 4 | 64.6s | 64.0s | 57.1s |
> 12.1%
> 5 | 74.1s | 79.4s | 76.7s |
> 3.5%
> 6 | 126.0s | too slow, didn't bother!
>
> ______________
> THE SUBTLE BUG
>
> While developing this patch, I discovered a subtle bug in Python.
> It happened only on one gruesome test in the middle of test_descr.py.
> The problem was: '' == '' was returning False.
>
> Starting on line 1116 of stringobject.c we have this:
>
> if (op == Py_EQ) {
> /* Supporting Py_NE here as well does not save
> much time, since Py_NE is rarely used. */
> if (a->ob_size == b->ob_size
> && (a->ob_sval[0] == b->ob_sval[0]
> && memcmp(a->ob_sval, b->ob_sval,
> a->ob_size) == 0)) {
> result = Py_True;
> } else {
> result = Py_False;
> }
> goto out;
> }
>
> The bug is with the "a->ob_sval[0] == b->ob_sval[0]". This is
> apparently
> a speed optimization; check that the first byte is the same before
> proceeding with the memcmp(). But it does this even when *both* strings
> are *zero length*! I doubt the Python source tree has an official
> position on what the first byte of a zero-length string should be, but
> I assert that it should be undefined. In practice, it's seemingly
> always
> set to the same value in the released Python, but with my changes it's
> now possible for them to be different.
>
> I think it's sinful to compare the first bytes of zero-length strings.
> Therefore I suggest this bugfix:
>
> if (a->ob_size == b->ob_size
> /* >> change >> */ && ( (!a->ob_size || a->ob_sval[0] ==
> b->ob_sval[0])
> && memcmp(a->ob_sval, b->ob_sval,
>
> I've already incorporated this fix in my code.
>
> __________
> THE FUTURE
>
> If this patch is accepted, it paves the way for another cheap
> optimization: string slices could be done without copying.
> Add another field into the PyString structure: a reference
> to another PyString *. If non-NULL, it means the local ob_sval
> actually points into the ob_sval owned by that other PyString *.
> It'd be another four bytes, but we'd only need to incur that
> when creating a slice; we could set a bit in ob_sstate indicating
> "this is a slice", and if so we'd have to copy-on-write ob_sval,
> and free the reference when the object was destroyed.
>
> ______________
> THE SUBMISSION
>
> I don't know the protocol from this point out; should I email the patch
> somewhere? Put it up on a web page? Post some diff output? (Or just
> forget about it entirely?)
>
> Also, for what it's worth: I fixed the .dsp so pythoncore builds under
> VC6 again. I would be happy to submit that separately. (What can I
> say, old habits die hard. I can't stand the newer Microsoft IDEs.)
>
>
> Cheers,
>
>
> /larry/
>
> p.s. No, I haven't looked at a Unicode port yet. If there's interest
> in
> this patch at all, I might take a stab at it.
>
More information about the Python-list
mailing list