On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith <njs@pobox.com> wrote:

On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
<aldcroft@head.cfa.harvard.edu> wrote:
> The idea of a one-byte string dtype has been extensively discussed twice
> before, with a lot of good input and ideas, but no action [1, 2].
>
> tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string
> dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3
> usage in the near term?

I think this is a good idea. I think overall it would be good for
numpy to switch to using variable-length strings in most cases (cf.
pandas), which is a different kind of change, but fixed-length 8-bit
encoded text is obviously a common on-disk format in scientific
applications, so numpy will still need some way to deal with it
conveniently. In the long run we'd like to have more flexibility (e.g.
allowing choice of character encoding), but since this proposal is a
subset of that functionality, then it won't interfere with later
improvements. I can see an argument for utf8 over latin1, but it
really doesn't matter that much so whatever, blue and purple bikesheds
are both fine.

The tricky bit here is "just" :-). Do you want to implement this? Do
you know someone who does? It's possible but will be somewhat
annoying, since to do it directly without refactoring how dtypes work
first then you'll have to add lots of copy-paste code to all the
different ufuncs.

I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal prototype. From there I can do the "annoying" copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have).

- Tom

-n

--
Nathaniel J. Smith -- http://vorpus.org

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion