[Numpy-discussion] One-byte string dtype: third time's the charm?

Aldcroft, Thomas aldcroft at head.cfa.harvard.edu
Sun Feb 22 18:38:54 EST 2015


On Sun, Feb 22, 2015 at 5:56 PM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:

>
>
> On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris <
> charlesr.harris at gmail.com> wrote:
>
>>
>>
>> On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas <
>> aldcroft at head.cfa.harvard.edu> wrote:
>>
>>>
>>>
>>> On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>
>>>> On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
>>>> <aldcroft at head.cfa.harvard.edu> wrote:
>>>> > The idea of a one-byte string dtype has been extensively discussed
>>>> twice
>>>> > before, with a lot of good input and ideas, but no action [1, 2].
>>>> >
>>>> > tl;dr: Perfect is the enemy of good.  Can numpy just add a one-byte
>>>> string
>>>> > dtype named 's' that uses latin-1 encoding as a bridge to enable
>>>> Python 3
>>>> > usage in the near term?
>>>>
>>>> I think this is a good idea. I think overall it would be good for
>>>> numpy to switch to using variable-length strings in most cases (cf.
>>>> pandas), which is a different kind of change, but fixed-length 8-bit
>>>> encoded text is obviously a common on-disk format in scientific
>>>> applications, so numpy will still need some way to deal with it
>>>> conveniently. In the long run we'd like to have more flexibility (e.g.
>>>> allowing choice of character encoding), but since this proposal is a
>>>> subset of that functionality, then it won't interfere with later
>>>> improvements. I can see an argument for utf8 over latin1, but it
>>>> really doesn't matter that much so whatever, blue and purple bikesheds
>>>> are both fine.
>>>>
>>>> The tricky bit here is "just" :-). Do you want to implement this? Do
>>>> you know someone who does? It's possible but will be somewhat
>>>> annoying, since to do it directly without refactoring how dtypes work
>>>> first then you'll have to add lots of copy-paste code to all the
>>>> different ufuncs.
>>>>
>>>
>>> I'm would be happy to have a go at this, with the caveat that someone
>>> who understands numpy would need to get me started with a minimal
>>> prototype.  From there I can do the "annoying" copy-paste for ufuncs etc,
>>> writing tests and docs.  I'm assuming that with a prototype then the rest
>>> can be done without any deep understanding of numpy internals (which I do
>>> not have).
>>>
>>> - Tom
>>>
>>>
>>
>> The last two new types added to numpy were float16 and datetime64. Might
>> be worth looking at the steps needed to implement those. There was also a
>> user type, `rational` that got added, that could also provide a template.
>> Maybe we need to have a way to add 'numpy certified' user data types. It
>> might also be possible to reuse the `c` data type, currently implemented as
>> `S1` IIRC, but that could cause some problems.
>>
>
> OK I'll have a look at those.
>

On second thought..  Maybe I'm being naive, but I think that starting from
scratch looking at entirely new dtypes is harder than it needs to be, or at
least not the most straightforward path [EDIT: just saw email from Nathan
agreeing here].  What is being proposed is essentially:

- For Python 2, the 's' type is exactly a clone of 'S'.  In other words 's'
will interface with Python as a bytes (aka str) object just like 'S'.
- For Python 3, the 's' type is internally the same as 'S' (np.bytes_) in
all operations, but interfaces with Python as a latin-1 encoded string.  So
the only difference is at the interface layer with Python (initialization,
comparison, iteration, etc).

So as a starting point we would want to clone 'S' to 's', then fix up the
interface to Python 3.  Does that sound about right?

- Tom


>
> Thanks,
> Tom
>
>
>>
>> Chuck
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150222/9844bc34/attachment.html>


More information about the NumPy-Discussion mailing list