unicode vs buffer (array) design issue can crash interpreter

See http://python.org/sf/1454485 for the gory details. Basically if you create a unicode array (array.array('u')) and try to append an 8-bit string (ie, not unicode), you can crash the interpreter. The problem is that the string is converted without question to a unicode buffer. Within unicode, it assumes the data to be valid, but this isn't necessarily the case. We wind up accessing an array with a negative index and boom. This is perhaps somewhat simplified and not exactly accurate. Probably best to look over the patch and make comments there on how best to fix this issue. n

Neal Norwitz wrote:
See http://python.org/sf/1454485 for the gory details. Basically if you create a unicode array (array.array('u')) and try to append an 8-bit string (ie, not unicode), you can crash the interpreter.
The problem is that the string is converted without question to a unicode buffer. Within unicode, it assumes the data to be valid, but this isn't necessarily the case. We wind up accessing an array with a negative index and boom.
There are several problems combined here, which might need discussion: - why does the 'u#' converter use the buffer interface if available? it should just support Unicode objects. The buffer object makes no promise that the buffer actually is meaningful UCS-2/UCS-4, so u# shouldn't guess that it is. (FWIW, it currently truncates the buffer size to the next-smaller multiple of sizeof(Py_UNICODE), and silently so) I think that part should just go: u# should be restricted to unicode objects. - should Python guarantee that all characters in a Unicode object are between 0 and sys.maxunicode? Currently, it is possible to create Unicode strings with either negative or very large Py_UNICODE elements. - if the answer to the last question is no (i.e. if it is intentional that a unicode object can contain arbitrary Py_UNICODE values): should Python then guarantee that Py_UNICODE is an unsigned type? Regards, Martin

Martin v. Löwis wrote:
Neal Norwitz wrote:
See http://python.org/sf/1454485 for the gory details. Basically if you create a unicode array (array.array('u')) and try to append an 8-bit string (ie, not unicode), you can crash the interpreter.
The problem is that the string is converted without question to a unicode buffer. Within unicode, it assumes the data to be valid, but this isn't necessarily the case. We wind up accessing an array with a negative index and boom.
There are several problems combined here, which might need discussion:
- why does the 'u#' converter use the buffer interface if available? it should just support Unicode objects. The buffer object makes no promise that the buffer actually is meaningful UCS-2/UCS-4, so u# shouldn't guess that it is. (FWIW, it currently truncates the buffer size to the next-smaller multiple of sizeof(Py_UNICODE), and silently so)
I think that part should just go: u# should be restricted to unicode objects.
'u#' is intended to match 's#' which also uses the buffer interface. It expects the buffer returned by the object to a be a Py_UNICODE* buffer, hence the calculation of the length. However, we already have 'es#' which is a lot safer to use in this respect: you can explicity define the encoding you want to see, e.g. 'unicode-internal' and the associated codec also takes care of range checks, etc. So, I'm +1 on restricting 'u#' to Unicode objects.
- should Python guarantee that all characters in a Unicode object are between 0 and sys.maxunicode? Currently, it is possible to create Unicode strings with either negative or very large Py_UNICODE elements.
- if the answer to the last question is no (i.e. if it is intentional that a unicode object can contain arbitrary Py_UNICODE values): should Python then guarantee that Py_UNICODE is an unsigned type?
Py_UNICODE must always be unsigned. The whole implementation relies on this and has been designed with this in mind (see PEP 100). AFAICT, the configure does check that Py_UNICODE is always unsigned. Regarding the permitted range of values, I think the necessary overhead to check that all Py_UNICODE* array values are within the currently permitted range would unnecessarily slow down the implementation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 31 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On 3/31/06, M.-A. Lemburg <mal@egenix.com> wrote:
Martin v. Löwis wrote:
Neal Norwitz wrote:
See http://python.org/sf/1454485 for the gory details. Basically if you create a unicode array (array.array('u')) and try to append an 8-bit string (ie, not unicode), you can crash the interpreter.
The problem is that the string is converted without question to a unicode buffer. Within unicode, it assumes the data to be valid, but this isn't necessarily the case. We wind up accessing an array with a negative index and boom.
There are several problems combined here, which might need discussion:
- why does the 'u#' converter use the buffer interface if available? it should just support Unicode objects. The buffer object makes no promise that the buffer actually is meaningful UCS-2/UCS-4, so u# shouldn't guess that it is. (FWIW, it currently truncates the buffer size to the next-smaller multiple of sizeof(Py_UNICODE), and silently so)
I think that part should just go: u# should be restricted to unicode objects.
'u#' is intended to match 's#' which also uses the buffer interface. It expects the buffer returned by the object to a be a Py_UNICODE* buffer, hence the calculation of the length.
However, we already have 'es#' which is a lot safer to use in this respect: you can explicity define the encoding you want to see, e.g. 'unicode-internal' and the associated codec also takes care of range checks, etc.
So, I'm +1 on restricting 'u#' to Unicode objects.
Note: 2.5 no longer crashes, 2.4 does. Does this mean you would like to see this patch checked in to 2.5? What should we do about 2.4? Index: Python/getargs.c =================================================================== --- Python/getargs.c (revision 45333) +++ Python/getargs.c (working copy) @@ -1042,11 +1042,8 @@ STORE_SIZE(PyUnicode_GET_SIZE(arg)); } else { - char *buf; - Py_ssize_t count = convertbuffer(arg, p, &buf); - if (count < 0) - return converterr(buf, arg, msgbuf, bufsize); - STORE_SIZE(count/(sizeof(Py_UNICODE))); + return converterr("cannot convert raw buffers"", + arg, msgbuf, bufsize); } format++; } else {
- should Python guarantee that all characters in a Unicode object are between 0 and sys.maxunicode? Currently, it is possible to create Unicode strings with either negative or very large Py_UNICODE elements.
- if the answer to the last question is no (i.e. if it is intentional that a unicode object can contain arbitrary Py_UNICODE values): should Python then guarantee that Py_UNICODE is an unsigned type?
Py_UNICODE must always be unsigned. The whole implementation relies on this and has been designed with this in mind (see PEP 100). AFAICT, the configure does check that Py_UNICODE is always unsigned.
Martin fixed the crashing problem in 2.5 by making wchar_t unsigned which was a bug. (A configure test was reversed IIRC.) Can this change to wchar_t be made in 2.4? That technically changes all the interfaces even though it was a mistake. What should be done for 2.4? n

Neal Norwitz wrote:
On 3/31/06, M.-A. Lemburg <mal@egenix.com> wrote:
Martin v. Löwis wrote:
Neal Norwitz wrote:
See http://python.org/sf/1454485 for the gory details. Basically if you create a unicode array (array.array('u')) and try to append an 8-bit string (ie, not unicode), you can crash the interpreter.
The problem is that the string is converted without question to a unicode buffer. Within unicode, it assumes the data to be valid, but this isn't necessarily the case. We wind up accessing an array with a negative index and boom. There are several problems combined here, which might need discussion:
- why does the 'u#' converter use the buffer interface if available? it should just support Unicode objects. The buffer object makes no promise that the buffer actually is meaningful UCS-2/UCS-4, so u# shouldn't guess that it is. (FWIW, it currently truncates the buffer size to the next-smaller multiple of sizeof(Py_UNICODE), and silently so)
I think that part should just go: u# should be restricted to unicode objects. 'u#' is intended to match 's#' which also uses the buffer interface. It expects the buffer returned by the object to a be a Py_UNICODE* buffer, hence the calculation of the length.
However, we already have 'es#' which is a lot safer to use in this respect: you can explicity define the encoding you want to see, e.g. 'unicode-internal' and the associated codec also takes care of range checks, etc.
So, I'm +1 on restricting 'u#' to Unicode objects.
Note: 2.5 no longer crashes, 2.4 does.
Does this mean you would like to see this patch checked in to 2.5?
Yes.
What should we do about 2.4?
Perhaps you could add a warning that is displayed when using u# with non-Unicode objects ?!
Index: Python/getargs.c =================================================================== --- Python/getargs.c (revision 45333) +++ Python/getargs.c (working copy) @@ -1042,11 +1042,8 @@ STORE_SIZE(PyUnicode_GET_SIZE(arg)); } else { - char *buf; - Py_ssize_t count = convertbuffer(arg, p, &buf); - if (count < 0) - return converterr(buf, arg, msgbuf, bufsize); - STORE_SIZE(count/(sizeof(Py_UNICODE))); + return converterr("cannot convert raw buffers"", + arg, msgbuf, bufsize); } format++; } else {
- should Python guarantee that all characters in a Unicode object are between 0 and sys.maxunicode? Currently, it is possible to create Unicode strings with either negative or very large Py_UNICODE elements.
- if the answer to the last question is no (i.e. if it is intentional that a unicode object can contain arbitrary Py_UNICODE values): should Python then guarantee that Py_UNICODE is an unsigned type? Py_UNICODE must always be unsigned. The whole implementation relies on this and has been designed with this in mind (see PEP 100). AFAICT, the configure does check that Py_UNICODE is always unsigned.
Martin fixed the crashing problem in 2.5 by making wchar_t unsigned which was a bug. (A configure test was reversed IIRC.) Can this change to wchar_t be made in 2.4? That technically changes all the interfaces even though it was a mistake. What should be done for 2.4?
If users want to interface from wchar_t to Python's Unicode type they have to go through the PyUnicode_FromWideChar() and PyUnicode_AsWideChar() interfaces. Assuming that Py_UNICODE is the same as wchar_t is simply wrong (and always was). I also think that changing the type from signed to unsigned by backporting the configure fix will only make things safer for the user, since extensions will probably not even be aware of the fact that Py_UNICODE could be signed (it has always been documented to be unsigned). So +1 on backporting the configure test fix to 2.4. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 13 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On 4/13/06, M.-A. Lemburg <mal@egenix.com> wrote:
Does this mean you would like to see this patch checked in to 2.5?
Yes.
Ok, I checked this in to 2.5 (minus the syntax error).
I also think that changing the type from signed to unsigned by backporting the configure fix will only make things safer for the user, since extensions will probably not even be aware of the fact that Py_UNICODE could be signed (it has always been documented to be unsigned).
So +1 on backporting the configure test fix to 2.4.
I'll leave this decision to Martin or someone else, since I'm not familiar with the ramifications. Since it was documented as unsigned, I think it's reasonable to consider changing. Though it could create signed-ness warnings in other modules. I'm not sure but it's possible it could create problems for C++ compilers since they are pickier. n

Neal Norwitz wrote:
I'll leave this decision to Martin or someone else, since I'm not familiar with the ramifications. Since it was documented as unsigned, I think it's reasonable to consider changing. Though it could create signed-ness warnings in other modules. I'm not sure but it's possible it could create problems for C++ compilers since they are pickier.
My concern is not so much that it becomes unsigned in 2.4.4, but that it stops being a typedef for wchar_t on Linux. C++ code that uses that assumption might stop compiling. Regards, Martin

Martin v. Löwis wrote:
Neal Norwitz wrote:
I'll leave this decision to Martin or someone else, since I'm not familiar with the ramifications. Since it was documented as unsigned, I think it's reasonable to consider changing. Though it could create signed-ness warnings in other modules. I'm not sure but it's possible it could create problems for C++ compilers since they are pickier.
My concern is not so much that it becomes unsigned in 2.4.4, but that it stops being a typedef for wchar_t on Linux. C++ code that uses that assumption might stop compiling.
I'd argue that such code is broken anyway: 3rd party code simply cannot make any assumptions on the typedef behind Py_UNICODE. Note that you'd only see this change when compiling Python in the non-standard UCS4 setting on Linux. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 14 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
I'd argue that such code is broken anyway: 3rd party code simply cannot make any assumptions on the typedef behind Py_UNICODE.
The code would work just fine now, but will stop working with the change. Users of the code might not be open to arguments. In any case, it's up to the release manager to decide.
Note that you'd only see this change when compiling Python in the non-standard UCS4 setting on Linux.
Sure. That happens to be the default on many Linux distributions, though. Regards, Martin

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
I'd argue that such code is broken anyway: 3rd party code simply cannot make any assumptions on the typedef behind Py_UNICODE.
The code would work just fine now, but will stop working with the change. Users of the code might not be open to arguments.
Fair enough. Let's leave things as they are for 2.4, then.
In any case, it's up to the release manager to decide.
Note that you'd only see this change when compiling Python in the non-standard UCS4 setting on Linux.
Sure. That happens to be the default on many Linux distributions, though.
Unfortunately, that's true. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 14 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
participants (3)
-
"Martin v. Löwis"
-
M.-A. Lemburg
-
Neal Norwitz