[Python-Dev] unicode vs buffer (array) design issue can crash interpreter

Fri Mar 31 11:04:59 CEST 2006

Martin v. Löwis wrote:
> Neal Norwitz wrote:
>> See http://python.org/sf/1454485 for the gory details.  Basically if
>> you create a unicode array (array.array('u')) and try to append an
>> 8-bit string (ie, not unicode), you can crash the interpreter.
>>
>> The problem is that the string is converted without question to a
>> unicode buffer.  Within unicode, it assumes the data to be valid, but
>> this isn't necessarily the case.  We wind up accessing an array with a
>> negative index and boom.
> 
> There are several problems combined here, which might need discussion:
> 
> - why does the 'u#' converter use the buffer interface if available?
>   it should just support Unicode objects. The buffer object makes
>   no promise that the buffer actually is meaningful UCS-2/UCS-4, so
>   u# shouldn't guess that it is.
>   (FWIW, it currently truncates the buffer size to the next-smaller
>    multiple of sizeof(Py_UNICODE), and silently so)
> 
>   I think that part should just go: u# should be restricted to unicode
>   objects.

'u#' is intended to match 's#' which also uses the buffer
interface. It expects the buffer returned by the object
to a be a Py_UNICODE* buffer, hence the calculation of the
length.

However, we already have 'es#' which is a lot safer to use
in this respect: you can explicity define the encoding you
want to see, e.g. 'unicode-internal' and the associated
codec also takes care of range checks, etc.

So, I'm +1 on restricting 'u#' to Unicode objects.

> - should Python guarantee that all characters in a Unicode object
>   are between 0 and sys.maxunicode? Currently, it is possible to
>   create Unicode strings with either negative or very large Py_UNICODE
>   elements.
> 
> - if the answer to the last question is no (i.e. if it is intentional
>   that a unicode object can contain arbitrary Py_UNICODE values): should
>   Python then guarantee that Py_UNICODE is an unsigned type?

Py_UNICODE must always be unsigned. The whole implementation
relies on this and has been designed with this in mind (see
PEP 100). AFAICT, the configure does check that Py_UNICODE
is always unsigned.

Regarding the permitted range of values, I think the necessary
overhead to check that all Py_UNICODE* array values are within
the currently permitted range would unnecessarily slow down
the implementation.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 31 2006)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::