[Python-Dev] unicode vs buffer (array) design issue can crash interpreter
Neal Norwitz
nnorwitz at gmail.com
Thu Apr 13 07:13:51 CEST 2006
On 3/31/06, M.-A. Lemburg <mal at egenix.com> wrote:
> Martin v. Löwis wrote:
> > Neal Norwitz wrote:
> >> See http://python.org/sf/1454485 for the gory details. Basically if
> >> you create a unicode array (array.array('u')) and try to append an
> >> 8-bit string (ie, not unicode), you can crash the interpreter.
> >>
> >> The problem is that the string is converted without question to a
> >> unicode buffer. Within unicode, it assumes the data to be valid, but
> >> this isn't necessarily the case. We wind up accessing an array with a
> >> negative index and boom.
> >
> > There are several problems combined here, which might need discussion:
> >
> > - why does the 'u#' converter use the buffer interface if available?
> > it should just support Unicode objects. The buffer object makes
> > no promise that the buffer actually is meaningful UCS-2/UCS-4, so
> > u# shouldn't guess that it is.
> > (FWIW, it currently truncates the buffer size to the next-smaller
> > multiple of sizeof(Py_UNICODE), and silently so)
> >
> > I think that part should just go: u# should be restricted to unicode
> > objects.
>
> 'u#' is intended to match 's#' which also uses the buffer
> interface. It expects the buffer returned by the object
> to a be a Py_UNICODE* buffer, hence the calculation of the
> length.
>
> However, we already have 'es#' which is a lot safer to use
> in this respect: you can explicity define the encoding you
> want to see, e.g. 'unicode-internal' and the associated
> codec also takes care of range checks, etc.
>
> So, I'm +1 on restricting 'u#' to Unicode objects.
Note: 2.5 no longer crashes, 2.4 does.
Does this mean you would like to see this patch checked in to 2.5?
What should we do about 2.4?
Index: Python/getargs.c
===================================================================
--- Python/getargs.c (revision 45333)
+++ Python/getargs.c (working copy)
@@ -1042,11 +1042,8 @@
STORE_SIZE(PyUnicode_GET_SIZE(arg));
}
else {
- char *buf;
- Py_ssize_t count = convertbuffer(arg, p, &buf);
- if (count < 0)
- return converterr(buf, arg, msgbuf, bufsize);
- STORE_SIZE(count/(sizeof(Py_UNICODE)));
+ return converterr("cannot convert raw buffers"",
+ arg, msgbuf, bufsize);
}
format++;
} else {
> > - should Python guarantee that all characters in a Unicode object
> > are between 0 and sys.maxunicode? Currently, it is possible to
> > create Unicode strings with either negative or very large Py_UNICODE
> > elements.
> >
> > - if the answer to the last question is no (i.e. if it is intentional
> > that a unicode object can contain arbitrary Py_UNICODE values): should
> > Python then guarantee that Py_UNICODE is an unsigned type?
>
> Py_UNICODE must always be unsigned. The whole implementation
> relies on this and has been designed with this in mind (see
> PEP 100). AFAICT, the configure does check that Py_UNICODE
> is always unsigned.
Martin fixed the crashing problem in 2.5 by making wchar_t unsigned
which was a bug. (A configure test was reversed IIRC.) Can this
change to wchar_t be made in 2.4? That technically changes all the
interfaces even though it was a mistake. What should be done for 2.4?
n
More information about the Python-Dev
mailing list