[Numpy-discussion] Text array dtype for numpy

Fri Jan 24 23:19:44 EST 2014

On Fri, Jan 24, 2014 at 5:43 PM, Chris Barker <chris.barker at noaa.gov> wrote:
> Oscar,
>
> Cool stuff, thanks!
>
> I'm wondering though what the use-case really is. The P3 text  model
> (actually the py2 one, too), is quite clear that you want users to think of,
> and work with, text as text -- and not care how things are encoding in the
> underlying implementation. You only want the user to think about encodings
> on I/O -- transferring stuff between systems where you can't avoid it. And
> you might choose different encodings based on different needs.
>
> So why have a different, the-user-needs-to-think-about-encodings numpy
> dtype? We already have 'U' for full-on unicode support for text. There is a
> good argument for a more compact internal representation for text compatible
> with one-byte-per-char encoding, thus the suggestion for such a dtype. But I
> don't see the need for quite this. Maybe I'm not being a creative enough
> thinker.

In my opinion something like Oscar's class would be very useful (with
some adjustments, especially making it easy to create an S view or put
a encoding view on top of an S array).

(Disclaimer: My only experience is in converting some examples in
statsmodels to bytes in py 3 and to play with some examples.)

My guess is that 'S'/bytes is very convenient for library code,
because it doesn't care about encodings (assuming we have enough
control that all bytes are in the same encoding), and we don't have
any overhead to convert to strings when comparing or working with
"byte strings".
'S' is also very flexible because it doesn't tie us down to a minimum
size for the encoding nor any specific encoding.

The problem of 'S'/bytes is in input output and interactive work, as
in the examples of Tom Aldcroft. The textarray dtype would allow us to
view any 'S' array so we can have text/string interaction with python
and get the correct encoding on input and output.

Whether you live in an ascii, latin1, cp1252, iso8859_5 or in any
other world, you could get your favorite minimal memory
S/bytes/strings.

I think this is useful as a complement to the current 'S' type, and to
make that more useful on python 3, independent of what other small
memory unicode dtype with predefined encoding numpy could get.

>
> Also, we may want numpy to interact at a low level with other libs that
> might have binary encoded text (HDF, etc) -- in which case we need a bytes
> dtype that can store that data, and perhaps encoding and decoding ufuncs.
>
> If we want a more efficient and compact unicode implementation  then the py3
> one is a good  place to start -it's pretty slick! Though maybe harder to due
> in numpy as text in numpy probably wouldn't be immutable.
>
>> To make a slightly more concrete proposal, I've implemented a pure
>> Python ndarray subclass that I believe can consistently handle
>> text/bytes in Python 3.
>
>
> this scares me right there -- is it text or bytes??? We really don't want
> something that is both.

Most users won't care about the internal representation of anything.
But when we want or find it useful we can view the memory with any
compatible dtype. That is, with numpy we always have also raw "bytes".
And there are lot's of ways to shoot yourself

why would you want to to that? :
>>> a = np.arange(5)
>>> b = a.view('S4')
>>> b[1] = 'h'
>>> a
array([  0, 104,   2,   3,   4])

>>> a[1] = 'h'
Traceback (most recent call last):
  File "<pyshell#22>", line 1, in <module>
    a[1] = 'h'
ValueError: invalid literal for int() with base 10: 'h'

>
>>
>> The idea is that the array has an encoding. It stores strings as
>> bytes. The bytes are encoded/decoded on insertion/access. Methods
>> accessing the binary content of the array will see the encoded bytes.
>> Methods accessing the elements of the array will see unicode strings.
>>
>> I believe it would not be as hard to implement as the proposals for
>> variable length string arrays.
>
>
> except that with some encodings, the number of bytes required is a function
> of what the content of teh text is -- so it either has to be variable
> length, or a fixed number of bytes, which is not a fixed number of
> characters  which require both careful truncation (a pain), and surprising
> results for users  "why can't I fit 10 characters is a length-10 text
> object? And I can if they are different characters?)

not really different to other places where you have to pay attention
to the underlying dtype, and a question of providing the underlying
information. (like itemsize)

1 - 1e-20     I had code like that when I wasn't thinking properly or
wasn't paying enough attention to what I was typing.

>
>>
>> The one caveat is that it will strip
>> null characters from the end of any string.
>
>
> which is fatal, but you do want a new dtype after all, which presumably
> wouldn't do that.

The only place so far that I found where this really hurts is in the
decode examples (with utf32LE for example).
That's why I think numpy needs to have decode/encode functions, so it
can access the bytes before they are null truncated, besides being
vectorized.

BTW: I wanted to start a new thread "in defence of (null truncated)
'S' string bytes", but I ran into too many other issues to work out
the examples.

Josef

>
> -Chris
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>