[Python-3000] Poll: Lazy Unicode Strings For Py3k

Wed Jan 31 11:40:52 CET 2007

I'd like to start a (hopefully final) round of discussion on
the "lazy strings" series of patches.  What follows is a summary
on the current state of the patches, followed by five poll
questions.

I apologize in advance for the length of this posting.

A NOTE BEFORE WE BEGIN

The "lazy strings" patches are actually two separate patches:
"lazy concatentation" and "lazy slices".  I did them together
to save work, but I fear I have created some confusion by doing
so.  *Please* keep in mind that "lazy concatenation" and "lazy
slices" are separate things, and should be independently
considered for inclusion.

"LAZY CONCATENATION"

Lazy string concatenation changes the + operator when adding
two strings.  Instead of returning the result of the concatenation,
it returns a placeholder (a PyUnicodeConcatenationObject) that
represents the concatenation.  The placeholder holds references
to the two strings; the first time any caller asks for the
string's value, the string is "rendered": memory is allocated
for the string, its value is copied in from the referenced strings
(recursively only where necessary), the references to the strings
are freed, and the newly computed value is returned.

Lazy concatenation changes the behavior of the Python C API in two
subtle ways:
1) All C API users asking for the value of a string *must* use
   the macro PyUnicode_AS_UNICODE() or the function
   PyUnicode_AsUnicode().  It is no longer permissable to
   directly access the object's "str" member.
2) It is now possible for PyUnicode_AS_UNICODE() and
   PyUnicode_AsUnicode() to *fail*,  returning NULL under
   low-memory conditions.  All code calling either of these
   would have to be amended to check for NULL; few of these
   checks have been added as of yet.  Josiah Carlson suggests
   there are 400+ such places in the Py3k source tree alone.

The implementation also adds complexity underlying a
PyUnicodeObject, though this is effectively hidden from anyone
not peering into "Objects/unicodeobject.c".

With this patch in place, Unicode string concatenation is
approximately 300% faster.  The Pybench "ConcatUnicode" benchmark
finishes in 1/4 the time when compared to an unpatched Py3k.  This
makes using + for string concatenation fast enough that the Python
idiom of "".join([]) is no longer necessary, thus making + the
"one obvious way to do it".

"LAZY SLICES"

Lazy string slicing changes the slicing operator when acting
upon a string and all string methods that return a slice of
the string (split(), strip(), partition(), etc).  Instead of
returning the result of the slice, it returns a placeholder
(a PyUnicodeSliceObject) that represents the slice.  The
placeholder holds a reference to the original string, and the
desired slice's start and end offsets.  (Lazy slice objects only
support slices with a stride of 1; slices with a stride greater
than 1 are computed directly as before.)  The first time any
caller asks for the value of the string, the slice is
"rendered", and its value is returned.

This by itself would not speed anything up; it would actually be
a net slowdown.  The speedup is achieved by avoiding rendering
wherever possible.  Most Unicode string processing is done entirely
within "Objects/unicodeobject.c", and it is relatively simple to
change code within this file such that it operates on *unrendered*
slices.  With such changes in place, most slices are never rendered,
thus obviating the need for allocating memory and memcpy()ing to
store the slice's value.

Lazy concatenation changes the behavior of Python in three subtle
ways.  First, it adds the same two changes in the C API behavior
that "lazy concatenation" does: requiring use of the accessor
macro/function, and stipulating that these can now fail.

Lazy slices also add a third wrinkle, a change to Python's
memory allocation behavior: slicing a string means the
slice holds a reference to the original string.  If the slice
is long-lived, that means the original string becomes long-lived.
This can produce undesirable memory consumption when a program
contains long-lived small slices to short-lived big strings. In
this patch's defence, I point out there are other scenarios when
this behavior would actually result in *less* overall memory use,
mainly when taking multiple long-lived slices of a single string.
Also, it is possible to tune your memory use of lazy slices by
explicitly forcing the slice to render.  Still, the worst-case
behavior here is bad enough to give definite pause.  (As We Shall
Soon See.)

Also, as with lazy concatenation, lazy slices add complexity to
PyUnicodeObject, though again this is all hidden within
unicodeobject.c.

With this patch in place, Unicode string slicing is approximately
110% faster.  The Pybench "UnicodeSlicing" benchmark finishes in
less than 1/2 the time when compared to an unpatched Py3k.

GUIDO'S EXAMINATION

Guido examined a "rollup" patch that contained both lazy slices
and lazy concatenation.  His major criticism of the patch was
the "third wrinkle" added by lazy slices, where long-lived small
slices of short-lived big strings cause the big strings to become
long-lived.  Here's his test case:
  a = []
  while True:
      x = u"x"*1000000
      x = x[30:60]  # Short slice of long string
      a.append(x)

An unpatched Py3k can run all day doing this; add the "lazy slice"
patch and it quickly throws a MemoryError (after 1035 iterations
on my XP box).

Guido was unwilling to accept a patch with this behavior.
He agreed (via a short email exchange) that more discussion
on the Py3k list was a good idea.

"V2 LAZY SLICES"

After meditating under a waterfall on a distant mountaintop,
I concieved of an alternate implementation of lazy slices that
wouldn't have this memory behavior while still preserving much
of the benefit of the original lazy slices patch.  I call this
approach "V2 lazy slices".

The implementation is, in a word, complicated.  In V2 lazy
slices, a lazy slice is still a placeholder, with a pointer
back to the original string.  However, it does *not* retain
an official Python reference to the original string object--
ob_refcnt is unchanged.  Instead, there is a sinister second
internal reference, hidden from the outside world.  The original
string maintains a circularly-linked list of all slices from it.
When the original string is dealloc()'d, it walks this list and
forces each slices to render.  As if that weren't enough, the
code must handle some gnarly boundary cases in low-memory
conditions.  This approach necessitated a change to the object
structure; no longer are lazy slices (and lazy concatenations)
internal subtypes of PyUnicodeObject.  Instead, a PyUnicodeObject
has a pointer to a "PyUnusualUnicodeObject" structure which reflects
the string's "unusual" state:
  * I am a lazy concatenation.
  * I am a ("V2") lazy slice.
  * I am a conventional string, but lazy slices point to me.
Apart from this change in reference counting and storage, the V2
lazy slice patch is much the same as the original lazy slice patch.

As before, V2 lazy slices change the behavior of the Python
C API in the same two ways that "lazy concatenation" does:
requiring use of the accessor macro/function, and stipulating
that these can now fail.  It does *not* share the change
in memory consumption added by the original "lazy slices" patch;
Guido's test case above now runs for as many iterations as you're
willing to sit around for.  It *does* still hide all this complexity
inside unicodeobject.c.

There is one more wrinkle to consider.  Consider the following:
        def isNotFoo(fileHandle):
                a = unicode(fileHandle.readline())
                x = a.strip()
                return x != u"foo"
Here we allocate the string a, then the V2 lazy slice of a.
When the function exits, it pops the frame, and the locals are
freed.  But!  "a" is freed *before* "x", which means the Unicode
object gets dereferenced, which means x is forced to render--
even though it's about to be freed.  So this rendering is a
waste of time.

I asked on Python-Dev about this.  The consensus I got was:
 * In CPython locals are freed in the order they were added
   to the dictionary, and
 * this behavior is not a defined part of Python and should
   not be relied on.

So I did what any loser hacker would do: I reversed the order.
Now local variables are freed in the reverse order of their
being added to the locals dictionary.  Obviously this does not
guarantee that all local slices will be freed before their local
original string is.  But I *suspect* that it's better for V2 lazy
slices than the other order, and I've been assured that nobody
can rely on the order here, so it should be a safe (cheap) hack.
This change certainly doesn't seem to break anything.

LOCAL FREELIST RETOOLING

While working on the V2 lazy slice patch, I realized that the
local freelist code in unicodeobject.c could use some renovation.
To make a long story slightly less long, I revamped it to have four
freelists: one for PyUnicodeObjects themselves, one for very short
string buffers, and two for the specialized objects I use with lazy
evaluation.  Even if we don't keep any of the rest of my patches,
I suspect this would be of interest.  I speculate that it will make
Python *slightly* faster and use *slightly* less memory, but I admit
I have not tried it in isolation.

THE NET RESULT

I have a Py3k with "V2 lazy slices", "lazy concatenation" (retooled
to mate better with the V2 lazy slices), and the "local freelist
retooling", all applied at once.  Preliminary benchmarks from one
run of Pybench, 10 rounds at "warp 10", on my main XP box:
        "ConcatUnicode":           249.1% faster
        "CreateUnicodeWithConcat":  72.0% faster
        "UnicodeSlicing":           90.2% faster
        Overall:                     3.3% faster
I would be remiss if I didn't point out: these tests are really
just microbenchmarks.  "UnicodeSlicing" creates 49 slices of the
same string, then throws them away--neither examining the slices'
values nor storing them in a local variable.  The effect of these
patches on real-world code is as yet untested as far as I know.

I haven't published this rollup patch because it flunks the
regression tests right now.

THE POLLS

Finally, the polls.

#1: Should Py3k incorporate the "lazy concatenation" patch?

#2: Should Py3k incorporate the original "lazy slices" patch?

#3: Should Py3k incorporate the "V2 lazy slices" patch?

#4: Assuming you are +0 or better on #3, is it okay to change the
    frame object such that locals are released in the reverse order
    of their being added to the dictionary?

#5: Should Py3k incorporate the "local freelist retooling" patch?

My votes are as follows:

1: +1
2: -1
3: +0.5
4: +1
4: +1

The reason I'm not fully behind "V2 lazy slices" is because they're
just so gosh-darned complicated.  Yes, it *is* faster, but at the
cost of a lot of possible programmer puzzlement.  My feeling is that
we Python core developers are willing to shoulder the brunt of a certain
amount of complexity to make the lives of Python users happier, but this
only goes so far.  Really I only implemented them because of a misguided
dogged relentlessness.  But I don't claim to have my finger on the pulse
of the Python core development community; perhaps you'll surprise me with
a "damn the complexity, full speed ahead" attitude and express great
interest.

If you'd prefer to see the "V2 lazy slices" patch before passing sentence,
post a request to the list (or email me) and I'll make one happen.

I await your judgment,

/larry/