[Numpy-discussion] One-byte string dtype: third time's the charm?

Nathaniel Smith njs at pobox.com
Sun Feb 22 18:44:57 EST 2015


On Feb 22, 2015 3:39 PM, "Aldcroft, Thomas" <aldcroft at head.cfa.harvard.edu>
wrote:
>
>
>
> On Sun, Feb 22, 2015 at 5:56 PM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>>
>>
>>
>> On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris <
charlesr.harris at gmail.com> wrote:
>>>
>>>
>>>
>>> On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>>>>
>>>>
>>>>
>>>> On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>>>
>>>>> On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
>>>>> <aldcroft at head.cfa.harvard.edu> wrote:
>>>>> > The idea of a one-byte string dtype has been extensively discussed
twice
>>>>> > before, with a lot of good input and ideas, but no action [1, 2].
>>>>> >
>>>>> > tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte
string
>>>>> > dtype named 's' that uses latin-1 encoding as a bridge to enable
Python 3
>>>>> > usage in the near term?
>>>>>
>>>>> I think this is a good idea. I think overall it would be good for
>>>>> numpy to switch to using variable-length strings in most cases (cf.
>>>>> pandas), which is a different kind of change, but fixed-length 8-bit
>>>>> encoded text is obviously a common on-disk format in scientific
>>>>> applications, so numpy will still need some way to deal with it
>>>>> conveniently. In the long run we'd like to have more flexibility (e.g.
>>>>> allowing choice of character encoding), but since this proposal is a
>>>>> subset of that functionality, then it won't interfere with later
>>>>> improvements. I can see an argument for utf8 over latin1, but it
>>>>> really doesn't matter that much so whatever, blue and purple bikesheds
>>>>> are both fine.
>>>>>
>>>>> The tricky bit here is "just" :-). Do you want to implement this? Do
>>>>> you know someone who does? It's possible but will be somewhat
>>>>> annoying, since to do it directly without refactoring how dtypes work
>>>>> first then you'll have to add lots of copy-paste code to all the
>>>>> different ufuncs.
>>>>
>>>>
>>>> I'm would be happy to have a go at this, with the caveat that someone
who understands numpy would need to get me started with a minimal
prototype.  From there I can do the "annoying" copy-paste for ufuncs etc,
writing tests and docs.  I'm assuming that with a prototype then the rest
can be done without any deep understanding of numpy internals (which I do
not have).
>>>>
>>>> - Tom
>>>>
>>>
>>>
>>> The last two new types added to numpy were float16 and datetime64.
Might be worth looking at the steps needed to implement those. There was
also a user type, `rational` that got added, that could also provide a
template. Maybe we need to have a way to add 'numpy certified' user data
types. It might also be possible to reuse the `c` data type, currently
implemented as `S1` IIRC, but that could cause some problems.
>>
>>
>> OK I'll have a look at those.
>
>
> On second thought..  Maybe I'm being naive, but I think that starting
from scratch looking at entirely new dtypes is harder than it needs to be,
or at least not the most straightforward path [EDIT: just saw email from
Nathan agreeing here].  What is being proposed is essentially:
>
> - For Python 2, the 's' type is exactly a clone of 'S'.  In other words
's' will interface with Python as a bytes (aka str) object just like 'S'.
> - For Python 3, the 's' type is internally the same as 'S' (np.bytes_) in
all operations, but interfaces with Python as a latin-1 encoded string.  So
the only difference is at the interface layer with Python (initialization,
comparison, iteration, etc).
>
> So as a starting point we would want to clone 'S' to 's', then fix up the
interface to Python 3.  Does that sound about right?

Sounds reasonable to me.

You'll also want to consider interactions between the dtypes -- mixed
operations like
  array("a", dtype="s") == array("a", dtype="U")
should do the right thing, and casting s<->U ditto.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150222/d37242fd/attachment.html>


More information about the NumPy-Discussion mailing list