[Numpy-discussion] Should we allow arrays with "empty string" dtypes?
erik.m.bray+numpy at gmail.com
Fri Oct 9 13:06:45 EDT 2015
This is a post about strings--for the purpose of discussion then I'll
be assuming Python 2 and string means non-unicode strings. However,
the discussion applies all the same to unicode strings.
For a long time Numpy has had the following behavior: When creating an
array with a zero-width string dtype like 'S0', Numpy automatically
increases the width of the dtype to support the longest string in the
input, like so:
>>> np.array(['abc', 'de'], dtype='S0') # or equivalently dtype=str
But it *always* converts to a one character string dtype, at a
minimum. So even when passing in a list of empty strings:
>>> np.array(['', '', ''], dtype='S0')
array(['', '', ''],
>>> np.zeros(3, dtype='S0')
array(['', '', ''],
This behavior is encoded in PyArray_NewFromDescr_int  and is very
old (since 2006) . This made sense at the time, certainly, since
the logic for handling zero-sized strides was shaky, but most issues
with that have long since been worked out.
However, there's an oversight associated with this that it *is*
possible to make a structured dtype that has a zero-width string as
one of its fields. But since even PyArray_View goes through
PyArray_NewFromDescr, viewing such a field results in a non-empty view
that contains garbage and allows writing garbage into a structured
array. This is documented in several issues, such as #473 .
A fixed I've proposed in #6430  takes a conservative approach of
keeping all the existing behavior *except* in the case of structured
arrays, where views with a dtype of 'S0' would be allowed. However, a
simpler fix would be to just remove the restriction on creating arrays
of dtype 'S0' in general (with my first example above being one
exception--given a list of strings it will still convert 'S0' to a
dtype that can hold the longest string in the list).
I think I would prefer the general fix, but it would be a slight
change in behavior for any code using PyArray_NewFromDescr to create
string arrays. But would anyone actually be negatively impacted by
such a change? It seems to me that any code actually relies on the
existing behavior would smell fishy anyways.
More information about the NumPy-Discussion