[Python-Dev] The "lazy strings" patch
Talin
talin at acm.org
Sat Oct 21 08:37:45 CEST 2006
Interesting - is it possible that the same technique could be used to
hide differences in character width? Specifically, if I concatenate an
ascii string with a UTF-32 string, can the up-conversion to UTF-32 also
be done lazily? If that could be done efficiently, it would resolve some
outstanding issues that have come up on the Python-3000 list with
regards to str/unicode convergence.
Larry Hastings wrote:
>
> I've significantly enhanced my string-concatenation patch, to the point
> where that name is no longer accurate. So I've redubbed it the "lazy
> strings" patch.
>
> The major new feature is that string *slices* are also represented with
> a lazy-evaluation placeholder for the actual string, just as
> concatenated strings were in my original patch. The lazy slice object
> stores a reference to the original PyStringObject * it is sliced from,
> and the desired start and stop slice markers. (It only supports step =
> 1.) Its ob_sval is NULL until the string is rendered--but that rarely
> happens! Not only does this mean string slices are faster, but I bet
> this generally reduces overall memory usage for slices too.
>
> Now, one rule of the Python programming API is that "all strings are
> zero-terminated". That part of makes the life of a Python extension
> author sane--they don't have to deal with some exotic Python string
> class, they can just assume C-style strings everywhere. Ordinarily,
> this means a string slice couldn't simply point into the original
> string; if it did, and you executed
> x = "abcde"
> y = x[1:4]
> internally y->ob_sval[3] would not be 0, it would be 'e', breaking the
> API's rule about strings.
>
> However! When a PyStringObject lives out its life purely within the
> Python VM, the only code that strenuously examines its internals is
> stringobject.c. And that code almost never needs the trailing zero*.
> So I've added a new static method in stringobject.c:
> char * PyString_AsUnterminatedString(PyStringObject *)
> If you call it on a lazy-evaluation slice object, it gives you back a
> pointer into the original string's ob_sval. The s->ob_size'th element
> of this *might not* be zero, but if you call this function you're saying
> that's a-okay, you promise not to look at it. (If the PyStringObject *
> is any other variety, it calls into PyString_AsString, which renders
> string concatenation objects then returns ob_sval.)
>
> Again: this behavior is *never* observed by anyone outside of
> stringobject.c. External users of PyStringObjects call
> PyString_AS_STRING(), which renders all lazy concatenation and lazy
> slices so they look just like normal zero-terminated PyStringObjects.
> With my patch applied, trunk still passes all expected tests.
>
> Of course, lazy slice objects aren't just for literal slices created
> with [x:y]. There are lots of string methods that return what are
> effectively string slices, like lstrip() and split().
>
> With this code in place, string slices that aren't examined by modules
> are very rarely rendered. I ran "pybench -n 2" (two rounds, warp 10
> (whatever that means)) while collecting some statistics. When it
> finished, the interpreter had created a total of 640,041 lazy slices, of
> which only *19* were ever rendered.
>
>
> Apart from lazy slices, there's only one more enhancement when compared
> with v1: string prepending now reuses lazy concatenation objects much
> more often. There was an optimization in string_concatenate
> (Python/ceval.c) that said: "if the left-side string has two references,
> and we're about to overwrite the second reference by storing this
> concatenation to an object, tell that object to drop its reference".
> That often meant the reference on the string dropped to 1, which meant
> PyString_Resize could just resize the left-side string in place and
> append the right-side. I modified it so it drops the reference to the
> right-hand operand too. With this change, even with a reduction in the
> allowable stack depth for right-hand recursion (so it's less likely to
> blow the stack), I was able to prepend over 86k strings before it forced
> a render. (Oh, for the record: I ensure depth limits are enforced when
> combining lazy slices and lazy concatenations, so you still won't blow
> your stack when you mix them together.)
>
>
> Here are the highlights of a single apples-to-apples pybench run, 2.6
> trunk revision 52413 ("this") versus that same revision with my patch
> applied ("other"):
>
> Test minimum run-time average run-time
> this other diff this other
> diff
> -------------------------------------------------------------------------------
>
> ConcatStrings: 204ms 76ms +168.4% 213ms 77ms
> +177.7%
> CreateStringsWithConcat: 159ms 138ms +15.7% 163ms 142ms
> +15.1%
> StringSlicing: 142ms 86ms +65.5% 145ms 88ms
> +64.6%
> -------------------------------------------------------------------------------
>
> Totals: 7976ms 7713ms +3.4% 8257ms
> 7975ms +3.5%
>
> I also ran this totally unfair benchmark:
> x = "abcde" * (20000) # 100k characters
> for i in xrange(10000000):
> y = x[1:-1]
> and found my patched version to be 9759% faster. (You heard that right,
> 98x faster.)
>
>
> I'm ready to post the patch. However, as a result of this work, the
> description on the original patch page is really no longer accurate:
>
> http://sourceforge.net/tracker/index.php?func=detail&aid=1569040&group_id=5470&atid=305470
>
> Shall I close/delete that patch and submit a new patch with a more
> modern description? After all, there's not a lot of activity on the old
> patch page...
>
>
> Cheers,
>
>
> /larry/
>
> * As I recall, stringobject.c needs the trailing zero in exactly *one*
> place: when comparing two zero-length strings. My patch ensures that
> zero-length slices and concatenations still return nullstring, so this
> still works as expected.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/talin%40acm.org
More information about the Python-Dev
mailing list