[Python-Dev] unicode vs buffer (array) design issue can crash interpreter

Thu Apr 13 07:13:51 CEST 2006

On 3/31/06, M.-A. Lemburg <mal at egenix.com> wrote:
> Martin v. Löwis wrote:
> > Neal Norwitz wrote:
> >> See http://python.org/sf/1454485 for the gory details.  Basically if
> >> you create a unicode array (array.array('u')) and try to append an
> >> 8-bit string (ie, not unicode), you can crash the interpreter.
> >>
> >> The problem is that the string is converted without question to a
> >> unicode buffer.  Within unicode, it assumes the data to be valid, but
> >> this isn't necessarily the case.  We wind up accessing an array with a
> >> negative index and boom.
> >
> > There are several problems combined here, which might need discussion:
> >
> > - why does the 'u#' converter use the buffer interface if available?
> >   it should just support Unicode objects. The buffer object makes
> >   no promise that the buffer actually is meaningful UCS-2/UCS-4, so
> >   u# shouldn't guess that it is.
> >   (FWIW, it currently truncates the buffer size to the next-smaller
> >    multiple of sizeof(Py_UNICODE), and silently so)
> >
> >   I think that part should just go: u# should be restricted to unicode
> >   objects.
>
> 'u#' is intended to match 's#' which also uses the buffer
> interface. It expects the buffer returned by the object
> to a be a Py_UNICODE* buffer, hence the calculation of the
> length.
>
> However, we already have 'es#' which is a lot safer to use
> in this respect: you can explicity define the encoding you
> want to see, e.g. 'unicode-internal' and the associated
> codec also takes care of range checks, etc.
>
> So, I'm +1 on restricting 'u#' to Unicode objects.

Note:  2.5 no longer crashes, 2.4 does.

Does this mean you would like to see this patch checked in to 2.5? 
What should we do about 2.4?

Index: Python/getargs.c
===================================================================

--- Python/getargs.c    (revision 45333)
+++ Python/getargs.c    (working copy)
@@ -1042,11 +1042,8 @@
                                STORE_SIZE(PyUnicode_GET_SIZE(arg));
                        }
                        else {
-                       char *buf;
-                       Py_ssize_t count = convertbuffer(arg, p, &buf);
-                       if (count < 0)
-                               return converterr(buf, arg, msgbuf, bufsize);
-                       STORE_SIZE(count/(sizeof(Py_UNICODE)));
+                               return converterr("cannot convert raw buffers"",
+                                                 arg, msgbuf, bufsize);
                        }
                        format++;
                } else {

> > - should Python guarantee that all characters in a Unicode object
> >   are between 0 and sys.maxunicode? Currently, it is possible to
> >   create Unicode strings with either negative or very large Py_UNICODE
> >   elements.
> >
> > - if the answer to the last question is no (i.e. if it is intentional
> >   that a unicode object can contain arbitrary Py_UNICODE values): should
> >   Python then guarantee that Py_UNICODE is an unsigned type?
>
> Py_UNICODE must always be unsigned. The whole implementation
> relies on this and has been designed with this in mind (see
> PEP 100). AFAICT, the configure does check that Py_UNICODE
> is always unsigned.

Martin fixed the crashing problem in 2.5 by making wchar_t unsigned
which was a bug.  (A configure test was reversed IIRC.)  Can this
change to wchar_t be made in 2.4?  That technically changes all the
interfaces even though it was a mistake.  What should be done for 2.4?

n