[Python-3000] Lazy strings (was Re: Py3k release schedule worries)

Fri Jan 12 19:25:54 CET 2007

[Larry Hastings]
> As discussed on that page, the current version of the patch could cause
>  crashes in low-memory conditions.  I welcome suggestions on how best to
>  resolve this problem.  Apart from that fly in the ointment I'm pretty
>  happy with how it all turned out.

[Guido]
>  What kind of crashes? The right thing to do is to raise MemoryError.
>  Is there anything besides sheer will power that prevents that? Nothing
> *has* prevented that; the relevant code already calls PyErr_NoMemory().  The
> problem is that *its* callers currently won't notice, and continue on their
> merry way.

[Larry]
>  My patch adds a new wrinkle to the API: PyUnicode_AS_UNICODE() can now
> fail.  And currently when it fails it returns NULL.  (Why could it fail?
> Under the covers, PyUnicode_AS_UNICODE() may attempt to allocate memory.)
>
>  Without the patch PyUnicode_AS_UNICODE() always works.  Since no caller
> ever expects it to fail, code looks like this:
>
> static
>  int fixupper(PyUnicodeObject *self)
>  {
>      Py_ssize_t len = self->length;
>      Py_UNICODE *s = PyUnicode_AS_UNICODE(self);
>      int status = 0;
>
>      while (len-- > 0) {
>      register Py_UNICODE ch;
>
>      ch = Py_UNICODE_TOUPPER(*s);
>      ...
>  And there you are; when s is NULL, Python crashes.

Which is unacceptable. We might as well not have MemoryError and just
say "if you run out of memory, behavior of any program is undefined".
And I'll never go for that; it would take away a major advantage of
using Python. This would also directly affect security.

>  In the patch comments I proposed four possible solutions for this problem,
> listed in order of least-likely to most-likely.  I just came up with a fifth
> one, and I'll include it here.

Thanks -- I haven't had the time to look at the patch yet (but I will).

> 1. Redefine the API such that PyUnicode_AS_UNICODE() is allowed to return NULL,
> and fix every place in the Python source tree that calls it to check for a
> NULL return.  Document this with strong language for external C module
> authors.
> 2. Pre-allocate the str buffer used to render the lazy string objects.  Update
> this buffer whenever the size of the string changes.  That moves the failure
> to a better place for error reporting; once again PyUnicode_AS_UNICODE() can
> never fail.  But this approach also negates a healthy chunk of what made the
> patch faster.
> 3. Change the length to 0 and return a constant empty string.  Suggest that
> users of the Unicode API ask for the pointer *first* and the length
> *second*.
> 4. Change the length to 0 and return a previously-allocated buffer of some
> hopefully-big-enough-size (4096 bytes? 8192 bytes?), such that even if the
> caller iterates over the buffer, odds are good they'll stop before they hit
> the end.  Again, suggest that users of the Unicode API ask for the pointer
> *first* and the length *second*.
> 5. The patch is not accepted. (You see what an optimist I am.)
>
>  I'm open to suggestions (and patches!) of other approaches to solve this
> problem.

#1 would be my preference. We should probably rename the macro (and
change the case since it's no longer a macro) so as to *force* folks
to consider this.
#2 is a different way of spelling #5; it would defeat the purpose.
I don't understand what you mean by #3 and #4; change *which* length?
The phrasing of #3 using "hopefully-big-enough" and "odds" immediately
makes me think "buffer overflow attack" which is a non-starter.

Finally (unrelated to the memory problem) I'd like to see some
benchmarks to prove that this is really worth it.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)