[Patches] [ python-Patches-1629305 ] The Unicode "lazy strings" patches

Sun Jan 14 00:59:51 CET 2007

Patches item #1629305, was opened at 2007-01-06 04:37
Message generated for change (Comment added) made by gvanrossum
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1629305&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Core (C code)
Group: Python 3000
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Larry Hastings (lhastings)
>Assigned to: Guido van Rossum (gvanrossum)
Summary: The Unicode "lazy strings" patches

Initial Comment:
These are patches to add lazy processing to Unicode strings for Python 3000.  I plan to post separate patches for both "lazy concatenation" and "lazy slices", as I suspect "lazy concatenation" has a much higher chance of being accepted.

There is a long discussion about "lazy concatenation" here:
http://mail.python.org/pipermail/python-dev/2006-October/069224.html
And another long discussion about "lazy slices" here:
http://mail.python.org/pipermail/python-dev/2006-October/069506.html

Note that, unlike the 8-bit-character strings patches, I don't expect the "lazy slices" patch to be dependent on the "lazy concatenation" patch.  Unicode objects are stored differently, and already use a pointer to a separately-allocated buffer.   This was the big (and mildly controversial) change made by the 8-bit-character "lazy concatenation" patch, and "lazy slices" needed it too.  Since Unicode objects already look like that, the Unicode lazy patches should be independent.

----------------------------------------------------------------------

>Comment By: Guido van Rossum (gvanrossum)
Date: 2007-01-13 18:59

Message:
Logged In: YES 
user_id=6380
Originator: NO

Problems so far:

- Style: you set your tab stops to 4 spaces.  That is an absolute
no-no!  You can indent using 4 spaces, but you should NEVER assume
that a TAB character is anything except 8 spaces.

- Segfault in test_array. It seems that it's receiving a unicode slice
object and treating it like a "classic" unicode object.

- I got it to come to a grinding halt with the following worst-case
scenario:

  a = []
  while True:
      x = u"x"*1000000
      x = x[30:60]  # Short slice of long string
      a.append(x)

If you can't do better than that, I'll have to reject it.

PS I used your combined patch, if it matters.

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-12 19:03

Message:
Logged In: YES 
user_id=364875
Originator: YES

File Added: pybench.first.results.zip

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-12 12:57

Message:
Logged In: YES 
user_id=364875
Originator: YES

josiahcarlson:

I think you misunderstood options 2 and 3.  The empty string (option 2)
or
nonempty but fixed size string (option 3) would *only* be returned in the
event of an allocation failure, aka "the process is out of memory". 
Since
it's out of memory yet trying to allocate more, it has *already* failed.
My goal in proposing options 2 and 3 was that, when this happens (and it
eventually will), Python would fail *gracefully* with an exception,
rather
than *miserably* with a bus error.

As for writing a wrapper, I'm just not interested.  I'm a strong believer
in "There should be one--and preferably only one--obvious way to do it",
and I feel a special-purpose wrapper class for good string performance
adds mental clutter.  The obvious way to do string concatenation is with
"+"; the obvious way to to string slices is with "[:]".  My goal is to
make those fast so that you can use them *everywhere*--even in
performance-critical code.  I don't want a wrapper class, and have no
interest in contributing to one.

For what it's worth, I came up with a fifth approach this morning while
posting to the Python-3000 mailing list: pre-allocate the str buffer,
updating it to the correct size whenever the lazy object changes size.
That would certainly fix the problem; the error would occur in a much
more reportable place.  But it would also slow down the code quite a lot,
negating many of the speed gains of this approach.

----------------------------------------------------------------------

Comment By: Josiah Carlson (josiahcarlson)
Date: 2007-01-12 01:55

Message:
Logged In: YES 
user_id=341410
Originator: NO

I don't think that changing the possible return of PyUnicode_AS_UNICODE is
reasonable. (option 1)

Option 2 breaks the buffer interface.

Option 3 severely limits the size of potential unicode strings.  If you
are only manipulating tiny unicode strings (8k?), then the effect of fast
concatenation, slicing, etc., isn't terribly significant.

Option 4 is possible, but I know I would feel bad if all of this work went
to waste.

Note what M. A. Lemburg mentioned.  The functionality is useful, it's the
polymorphic representation that is the issue.  Rather than attempting to
change the unicode representation, what about a wrapper type?  Keep the
base unicode representation simple (both Guido and M. A. have talked about
this).  Guido has also stated that he wouldn't be against views (slicing
and/or concatenation) if they could be shown to have real use-cases.  The
use-cases you have offered here are still applicable, and because it
wouldn't necessitate a (not insignificant) change in semantics and 3rd
party code, would make it acceptable.

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-11 23:32

Message:
Logged In: YES 
user_id=364875
Originator: YES

Just fixed the build under Linux--sorry, should have done that before
posting the original patch.  Patches now built and tested under Win32 and
Linux, and produce the same output as an unpatched py3k trunk.

lemburg: A minor correction: the full "lazy strings" patch (with "lazy
slices") also touches "stringlib/partition.h", "stringlib/readme.txt", and
"Objects/stringobject.c", in addition to the two unicodeobject.* files. 
The changes to these three files are minuscule, and don't affect their
maintainability, so the gist of my statements still hold.  (Besides, all
three of those files will probably go away before Py3k ships.)
File Added: lch.py3k.unicode.lazy.slice.and.concat.patch.53392.txt

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-11 23:25

Message:
Logged In: YES 
user_id=364875
Originator: YES

File Added: lch.py3k.unicode.lazy.concat.patch.53392.txt

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-11 22:12

Message:
Logged In: YES 
user_id=364875
Originator: YES

Attached below you will find the full "lazy strings" patch, which has both
"lazy concatenation" and "lazy slices".  The diff is against the current
revision of the Py3k branch, #53392.  On my machine (Win32) rt.bat produces
identical output before and after the patch, for both debug and release
builds.

As I mentioned in a previous comment, you can read the description (and
ensuing conversation) about "lazy slices" here:
http://mail.python.org/pipermail/python-dev/2006-October/069506.html

One new feature of this version: I added a method on a Unicode string,
s.simplify(), which forces the string to "render" if it's one of my exotic
string subtypes (a lazy concatenation or lazy slice).  My goal is to
assuage fears about pathological memory-use cases where you have long-lived
tiny slices of gigantic strings.  If you realize you're having that
problem, simply add calls to .simplify() on the slices and the problem
should go away.

As for the semantics of .simplify(), it returns a reference to the string
s.  Honestly I wasn't sure whether it should return a new string or just
monkey with the existing string.  Really, rendering doesn't change the
string; it's the same string, with the exact same external behavior, just
with different bits floating around underneath.  For now it monkeys with
the existing string, as that seemed best.  (But I'd be happy to switch it
to returning a new string if it'd help.)

I had planned to make the "lazy slices" patch independent of the "lazy
concatenation" patch.  However, it wound up being a bigger pain that I
thought, and anyway I figure the likelyhood that "lazy slices" would be
accepted and "lazy concatenation" would not is effectively zero.  So I
didn't bother.  If there's genuine interest in "lazy slices" without "lazy
concatenation", I can produce such a thing.
File Added: lch.py3k.unicode.lazy.slice.and.concat.patch.53392.txt

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-11 21:50

Message:
Logged In: YES 
user_id=364875
Originator: YES

File Added: lch.py3k.unicode.lazy.concat.patch.53392.txt

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-11 21:42

Message:
Logged In: YES 
user_id=364875
Originator: YES

lemburg:

You're right, the possibility of PyUnicode_AS_UNICODE() returning NULL is
new behavior, and this could conceivably result in crashes.  To be clear:
NULL return values will only happen when allocation of the final "str"
buffer fails during lazy rendering.  This will only happen in out-of-memory
conditions; for right now, while the patch is under early review, I suspect
that's okay.

So far I've come up with four possible ways to resolve this problem, which
I will list here from least-likely to most-likely:

1. Redefine the API such that PyUnicode_AS_UNICODE() is allowed to return
NULL, and fix every place in the Python source tree that calls it to check
for a NULL return.  Document this with strong language for external C
module authors.
2. Change the length to 0 and return a constant empty string.  Suggest
that users of the Unicode API ask for the pointer *first* and the length
*second*.
3. Change the length to 0 and return a previously-allocated buffer of some
hopefully-big-enough-size (4096 bytes? 8192 bytes?), such that even if the
caller iterates over the buffer, odds are good they'll stop before they hit
the end.  Again, suggest that users of the Unicode API ask for the pointer
*first* and the length *second*.
4. The patch is not accepted.

Of course, I'm open to suggestions of other approaches.  (Not to mention
patches!)

Regarding your memory usage and "slice integers" comments, perhaps you'll
be interested in the full lazy patch, which I hope to post later today. 
"Lazy concatenation" is only one of the features of the full patch; the
other is "lazy slices".  For a full description of my "lazy slices"
implementation, see this posting (and the subsequent conversation) to
Python-Dev:
http://mail.python.org/pipermail/python-dev/2006-October/069506.html
And yes, lazy slices suffer from the same
possible-NULL-return-from-PyUnicode_AS_UNICODE() problem that lazy
concatenation does.

As for your final statement, I never claimed that this was a particularly
clean design. I merely claim it makes things faster and is (so far)
self-contained.  For the Unicode versions of my lazy strings patches, the
only files I touched were "Include/unicodeobject.h" and
"Objects/unicodeobject.c".  I freely admit my patch makes those files *even
fussier* to work on than they already are.  But if you don't touch those
files, you won't notice the difference*, and the patch makes some Python
string operations faster without making anything else slower.  At the very
least I suggest the patches are worthy of examination.

* Barring API changes to rectify the possible NULL return from
PyUnicode_AS_UNICODE() problem, that is.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2007-01-10 15:59

Message:
Logged In: YES 
user_id=38388
Originator: NO

Larry, I probably wasn't clear enough:

PyUnicode_AS_UNICODE() returns a pointer to the underlying Py_UNICODE
buffer. No API using this macro checks for a NULL return value of the macro
since a Unicode object is guaranteed to have a non-NULL Py_UNICODE buffer.
As a result, a memory caused during the concatenation process cannot be
passed back up the call stack. The NULL return value would result in a
plain segfault in the calling API.

Regarding the tradeoff and trying such an approach: I've done such tests
myself (not with Unicode but with 8-bit strings) and it didn't pay off. The
memory consumption outweighs the performance you gain by using the 'x += y'
approach. The ''.join(list) approach also doesn't really help if you're
after performance (for much the same reasons). 

In mxTextTools I used slice integers pointing into the original parsed
string to work around these problems, which works great and avoids creating
short strings altogether (so you gain speed and memory).

A patch I would find a lot more useful is one to create a Unicode
alternative to cStringIO - for strings, this is by far the most performant
way of creating a larger string from lots of small pieces. To complement
this, a smart slice type might also be an attractive target; one that
breaks up a larger string into slices and provides operations on these,
including joining them to form a new string.

I'm not convinced that murking with the underlying object type and doing
"subtyping" on-the-fly is a clean design.

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-10 15:30

Message:
Logged In: YES 
user_id=364875
Originator: YES

Much of what I do in Python is text processing.  My largest Python project
to date was an IDL which spewed out loads of text; I've also written an
HTML formatter or two.  I seem to do an awful lot of string concatenation
in Python, and I'd like it to be fast.  I'm not alone in this, as there
have been several patches to Python in recent years to speed up string
concatenation.

Perhaps you aren't familiar with my original justification for the patch. 
I've always hated the "".join() idiom for string concatenation, as it
violates the "There should be one--and preferably only one--obvious way to
do it" principle (and arguably others).  With lazy concatenation, the
obvious way (using +) becomes competitive with "".join(), thus dispensing
with the need for this inobvious and distracting idiom.

For a more thorough dissection of the (original) patch, including its
implementation and lots of discussion from other people, please see the
original thread on c.l.p:
http://groups.google.com/group/comp.lang.python/browse_frm/thread/b8a8f20bc3c81bcf
Please ignore the benchmarks there, as they were quite flawed.

And, no, I haven't seen a lot of code manipulating Unicode strings yet,
but then I'm not a Python shaker-and-mover.  Obviously I expect to see a
whole lot more when Py3k is adopted.

----------------------------------------------------------------------

Comment By: Josiah Carlson (josiahcarlson)
Date: 2007-01-10 13:24

Message:
Logged In: YES 
user_id=341410
Originator: NO

>From what I understand, the point of the lazy strings patch is to make
certain operations faster.  What operations?  Generally speaking, looped
concatenation (x += y), and other looping operations that have
traditionally been slow; O(n^2).

While this error is still common among new users of Python, generally
users only get bit once.  They ask about it on python-list and are told: z
= []; z.append(y); x = ''.join(z) .

Then again, the only place where I've seen the iterative building up of
*text* is really in document reformatting (like textwrap).  Basically all
other use-cases (that I have seen) generally involve the manipulation of
binary data.  Larry, out of curiosity, have you found code out there that
currently loops and concatenates unicode?

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-08 20:26

Message:
Logged In: YES 
user_id=364875
Originator: YES

Continuing the comedy of errors, concat patch #2 was actually the same as
#1, it didn't have the fix for detecting a NULL return of PyMem_NEW(). 
Fixed in concat patch #3.  (Deleting concat patch #2.)
File Added: lch.py3k.unicode.lazy.concat.patch.3.txt

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-08 20:10

Message:
Logged In: YES 
user_id=364875
Originator: YES

Revised the lazy concatenation patch to add (doh!) a check for when
PyMem_NEW() fails in PyUnicode_AsUnicode().
File Added: lch.py3k.unicode.lazy.concat.patch.2.txt

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-08 13:50

Message:
Logged In: YES 
user_id=364875
Originator: YES

jcarlson:
The first time someone calls PyUnicode_AsUnicode() on a concatenation
object, it renders the string, and that's an O(something) operation.  In
general this rendering is O(i), aka linear time, though linear related to
*what* depends.  (It iterates over the m concatenated strings, and each of
the n characters in those strings, and whether n or m is more important
depends on their values.)  After rendering, the object behaves like any
other Unicode string, including O(1) for array element lookup.

If you're referring to GvR's statement "I mention performance because s[i]
should remain an O(1) operation.", here:
http://mail.python.org/pipermail/python-3000/2006-December/005281.html
I suspect this refers to the UCS-2 vs. UTF-16 debate.

lemberg:
Your criticisms are fair; lazy evaluation is a tradeoff.  In general my
response to theories about how it will affect performance is "I invite you
to try it and see".

As for causing memory errors, the only problem I see is not checking for a
NULL return from PyMem_NEW() in PyUnicode_AsUnicode().  But that's a bug,
not a flaw in my approach, and I'll fix that bug today.  I don't see how
"[my] approach can cause memory errors" in any sort of larger sense.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2007-01-08 05:59

Message:
Logged In: YES 
user_id=38388
Originator: NO

While I don't think the added complexity in the implementation is worth
it, given that there are other ways of achieving the same kind of
performance (e.g. list of Unicode strings), some comments:

 * you add a long field to every Unicode object - so every single object
in the system pays 4-8 bytes for the small performance advantage

 * Unicode objects are often references using PyUnicode_AS_UNICODE(); this
operation doesn't allow passing back errors, yet your lazy evaluation
approach can cause memory errors - how are you going to deal with them ? 
(currently you don't even test for them)

 * the lazy approach keeps all partial Unicode objects alive until they
finally get concatenated; if you have lots of those (e.g. if you use x += y
in a loop), then you pay the complete Python object overhead for every
single partial Unicode object in the list of strings - given that most such
operations use short strings, you are likely creating a memory overhead far
greater than the the total length of all the strings

----------------------------------------------------------------------

Comment By: Josiah Carlson (josiahcarlson)
Date: 2007-01-07 00:08

Message:
Logged In: YES 
user_id=341410
Originator: NO

What are the performance characteristics of each operation?  I presume
that a + b for unicode strings a and b is O(1) time (if I understand your
implementation correctly).  But according to my reading, (a + b + c +
...)[i] is O(number of concatenations performed).  Is this correct?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1629305&group_id=5470