[Numpy-discussion] Should we allow arrays with "empty string" dtypes?

Fri Oct 9 13:06:45 EDT 2015

Hi all,

This is a post about strings--for the purpose of discussion then I'll
be assuming Python 2 and string means non-unicode strings.  However,
the discussion applies all the same to unicode strings.

For a long time Numpy has had the following behavior: When creating an
array with a zero-width string dtype like 'S0', Numpy automatically
increases the width of the dtype to support the longest string in the
input, like so:

>>> np.array(['abc', 'de'], dtype='S0')  # or equivalently dtype=str
array(['abc', 'de'],
      dtype='|S3')

But it *always* converts to a one character string dtype, at a
minimum.  So even when passing in a list of empty strings:

>>> np.array(['', '', ''], dtype='S0')
array(['', '', ''],
      dtype='|S1')

Or even

>>> np.zeros(3, dtype='S0')
array(['', '', ''],
      dtype='|S1')

This behavior is encoded in PyArray_NewFromDescr_int [1] and is very
old (since 2006) [2].  This made sense at the time, certainly, since
the logic for handling zero-sized strides was shaky, but most issues
with that have long since been worked out.

However, there's an oversight associated with this that it *is*
possible to make a structured dtype that has a zero-width string as
one of its fields.  But since even PyArray_View goes through
PyArray_NewFromDescr, viewing such a field results in a non-empty view
that contains garbage and allows writing garbage into a structured
array.  This is documented in several issues, such as #473 [3].

A fixed I've proposed in #6430 [4] takes a conservative approach of
keeping all the existing behavior *except* in the case of structured
arrays, where views with a dtype of 'S0' would be allowed.  However, a
simpler fix would be to just remove the restriction on creating arrays
of dtype 'S0' in general (with my first example above being one
exception--given a list of strings it will still convert 'S0' to a
dtype that can hold the longest string in the list).

I think I would prefer the general fix, but it would be a slight
change in behavior for any code using PyArray_NewFromDescr to create
string arrays.  But would anyone actually be negatively impacted by
such a change?  It seems to me that any code actually relies on the
existing behavior would smell fishy anyways.

Thanks,
Erik

[1] https://github.com/numpy/numpy/blob/8cb3ec6ab804f594daf553e53e7cf7478656bebd/numpy/core/src/multiarray/ctors.c#L940-L956

[2] https://github.com/numpy/numpy/commit/b022765aa487070866663b1707e4a2a0d8ead2e8

[3] https://github.com/numpy/numpy/issues/473

[4] https://github.com/numpy/numpy/pull/6430