[Python-Dev] unicode vs buffer (array) design issue can crash interpreter

Thu Mar 30 19:08:37 CEST 2006

Neal Norwitz wrote:
> See http://python.org/sf/1454485 for the gory details.  Basically if
> you create a unicode array (array.array('u')) and try to append an
> 8-bit string (ie, not unicode), you can crash the interpreter.
> 
> The problem is that the string is converted without question to a
> unicode buffer.  Within unicode, it assumes the data to be valid, but
> this isn't necessarily the case.  We wind up accessing an array with a
> negative index and boom.

There are several problems combined here, which might need discussion:

- why does the 'u#' converter use the buffer interface if available?
  it should just support Unicode objects. The buffer object makes
  no promise that the buffer actually is meaningful UCS-2/UCS-4, so
  u# shouldn't guess that it is.
  (FWIW, it currently truncates the buffer size to the next-smaller
   multiple of sizeof(Py_UNICODE), and silently so)

  I think that part should just go: u# should be restricted to unicode
  objects.

- should Python guarantee that all characters in a Unicode object
  are between 0 and sys.maxunicode? Currently, it is possible to
  create Unicode strings with either negative or very large Py_UNICODE
  elements.

- if the answer to the last question is no (i.e. if it is intentional
  that a unicode object can contain arbitrary Py_UNICODE values): should
  Python then guarantee that Py_UNICODE is an unsigned type?

Regards,
Martin