[Numpy-discussion] String type again.

Thu Jul 17 11:48:19 EDT 2014

On Tue, Jul 15, 2014 at 4:29 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
> Thinking more about it, the easiest thing to do might be to make the S dtype
> a UTF-8 encoding. Most of the machinery to deal with that is already in
> place. That change might affect some users though, and we might need to do
> some work to make it backwards compatible with python 2.

I'd be very concerned about backcompat for existing code that uses
e.g. "S128" as a dtype to mean "128 arbitrary bytes". An example is
this file format reading code:
   https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L123
The file format says there are 128 bytes there, and their
interpretation depends on other fields in the header -- but in one
case, for "large montages", there's an encoding where every 3 bytes
represents 4 characters using an ad hoc 6-bit character set:
   https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py#L133

Perhaps this case could be handled better by using a u8 subarray or
something (that code also goes to some efforts to work around nul
padding), and that particular project hasn't been ported to py3 yet so
technically wouldn't be affected if we changed the meaning of "S" on
py3. But it does seem useful to have a "fixed length bytes" dtype even
in py3, and if we declare that be "S" then it avoids breaking any
existing code depending on it...

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org