Array and string interoperability

Array and string interoperability Just sharing my thoughts and few ideas about simplification of casting strings to arrays. In examples assume Numpy is in the namespace (from numpy import *) Initialize array from a string currently looks like: s= "012 abc" A= fromstring(s,"u1") print A -> [48 49 50 32 97 98 99] Perfect. Now when writing values it will not work as IMO it should, namley consider this example: B= zeros(7,"u1") B[0]=s[1] print B -> [1 0 0 0 0 0 0] Ugh? It tries to parse the s[1] character "1" as integer and writes 1 to B[0]. First thing I would expect is a value error and I'd never expect it does that high-level manipulations with parsing. IMO ideally it would do the following instead: B[0]=s[1] print B -> [49 0 0 0 0 0 0] So it should just write ord(s[1]) to B. Sounds logical? For me very much. Further, one could write like this: B[:] = s print B-> [48 49 50 32 97 98 99] Namely cast the string into byte array. IMO this would be the logical expected behavior. Currently it just throws the value error if met non-digits in a string, so IMO current casting hardly can be of practical use. Furthermore, I think this code: A= array(s,"u1") Could act exactly same as: A= fromstring(s,"u1") But this is just a side-idea for spelling simplicty/generality. Not really necessary. Further thoughts: If trying to create "u1" array from a Pyhton 3 string, question is, whether it should throw an error, I think yes, and in this case "u4" type should be explicitly specified by initialisation, I suppose. And e.g. translation from unicode to extended ascii (Latin1) or whatever should be done on Python side or with explicit translation. Python3 assumes 4-byte strings but in reality most of the time we deal with 1-byte strings, so there is huge waste of resources when dealing with 4-bytes. For many serious projects it is just not needed. Furthermore I think some of the methods from "chararray" submodule should be possible to use directly on normal integer arrays without conversions to other array types. So I personally don't realy get why the need of additional chararray type, Its all numbers anyway and it's up to the programmer to decide what size of translation tables/value ranges he wants to use. There can be some convinience methods for ascii operations, like eg char.toupper(), but currently they don't seem to work with integer arrays so why not make those potentially useful methots usable and make them work on normal integer arrays? Or even migrate them to the root namespace to e.g. introduce names with prefixes: A=ascii_toupper(A) A=ascii_tolower(A) Many things can be be achieved with general numeric methods, e.g. translate/reduce the array. Here obviosly I mean not dynamical arrays, just fixed-sized arrays. How to deal with dynamically changing array sizes is another problematic, and it depends on how the software is designed in the first place and what it does with the data. For my own text-editing software project I consider fixed allocated 1D and 2D "uint8" arrays only. And specifically I experiment with own encodings, so just as a side-note, I don't think that encoding should be assumed much for creating new array types, it is up to the programmer to decide what 'meanings' the bytes have. Kind regards, Mikhail

On 04/06/17 20:04, Mikhail V wrote: think I would prefer this, actually. However, numpy normally treats strings as objects that can sometimes be cast to numbers, so this behaviour is perfectly logical. For what it's worth, in Python 3 (which you probably should want to be using), everything behaves as you'd expect:
There is also something to be said for the current behaviour:
np.array('100', 'u1') array(100, dtype=uint8)
However, the fact that this works for bytestrings on Python 3 is, in my humble opinion, ridiculous:
np.array(b'100', 'u1') # b'100' IS NOT TEXT array(100, dtype=uint8)
This is of course consistent with the fact that you can cast a bytestring to builtin python int or float (but not complex). Interestingly enough, numpy complex behaves differently from python complex:
If you ask me, passing a unicode string to fromstring with sep='' (i.e. to parse binary data) should ALWAYS raise an error: the semantics only make sense for strings of bytes. Currently, there appears to be some UTF-8 conversion going on, which creates potentially unexpected results:
This is, apparently (https://github.com/numpy/numpy/issues/2152), due to how the internals of Python deal with unicode strings in C code, and not due to anything numpy is doing. Speaking of unexpected results, I'm not sure you realize what fromstring does when you give it a multi-byte dtype:
Give fromstring() a numpy unicode string, and all is right with the world:
IMHO calling fromstring(..., sep='') with a unicode string should be deprecated and perhaps eventually forbidden. (Or fixed, but that would break backwards compatibility)
That's quite enough anglo-centrism, thank you. For when you need byte strings, Python 3 has a type for that. For when your strings contain text, bytes with no information on encoding are not enough.
chararray is deprecated.
Agreed! -- Thomas

Just a few notes: However, the fact that this works for bytestrings on Python 3 is, in my
Yes, that is a mis-feature -- I think due to bytes and string being the same object in py2 -- so on py3, numpy continues to treat a bytes objects as also a 1-byte-per-char string, depending on context. And users want to be able to write numpy code that will run the same on py2 and py3, so we kinda need this kind of thing. Makes me think that an optional "pure-py-3" mode for numpy might be a good idea. If that flag is set, your code will only run on py3 (or at least might run differently).
well, you can pass numbers > 255 into a u1 already: In [*96*]: np.array(456, dtype='u1') Out[*96*]: array(200, dtype=uint8) and it does the wrap-around overflow thing... so why not?
absolutely! If you ask me, passing a unicode string to fromstring with sep='' (i.e.
to parse binary data) should ALWAYS raise an error: the semantics only make sense for strings of bytes.
exactly -- we really should have a "frombytes()" alias for fromstring() and it should only work for atual bytes objects (strings on py2, naturally). and overloading fromstring() to mean both "binary dump of data" and "parse the text" due to whether the sep argument is set was always a bad idea :-( .. and fromstring(s, sep=a_sep_char) has been semi broken (or at least not robust) forever anyway. Currently, there appears to be some UTF-8 conversion going on, which
exactly -- py3 strings are pretty nifty implementation of unicode text -- they have nothing to do with storing binary data, and should not be used that way. There is essentially no reason you would ever want to pass the actual binary representation to any other code. fromstring should be re-named frombytes, and it should raise an exception if you pass something other than a bytes object (or maybe a memoryview or other binary container?) we might want to keep fromstring() for parsing strings, but only if it were fixed... IMHO calling fromstring(..., sep='') with a unicode string should be
deprecated and perhaps eventually forbidden. (Or fixed, but that would break backwards compatibility)
agreed.
There was a big thread about this recently -- it seems to have not quite come to a conclusion. But anglo-centrism aside, there is substantial demand for a "smaller" way to store mostly-ascii text. I _think_ the conversation was steering toward an encoding-specified string dtype, so us anglo-centric folks could use latin-1 or utf-8. But someone would need to write the code. -CHB
I agree here. But if one were to add such a thing (vectorized string operations) -- I'd think the thing to do would be to wrap (or port) the python string methods. But it shoudl only work for actual string dtypes, of course. note that another part of the discussion previously suggested that we have a dtype that wraps a native python string object -- then you'd get all for free. This is essentially an object array with strings in it, which you can do now. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 05/06/17 19:40, Chris Barker wrote:
As it happens, this is pretty much what stdlib bytearray does since 3.2 (http://bugs.python.org/issue8990)
-- Thomas Jollans m ☎ +31 6 42630259 e ✉ tjol@tjol.eu

On Mon, Jun 5, 2017 at 1:51 PM, Thomas Jollans <tjol@tjol.eu> wrote:
I'm not sure that the array.array.fromstring() ever parsed the data string as text, did it? Anyway, This is what array.array now has: array.frombytes(s) Appends items from the string, interpreting the string as an array of machine values (as if it had been read from a file using the fromfile()method). New in version 3.2: fromstring() is renamed to frombytes() for clarity. array.fromfile(f, n) Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do. array.fromstring() Deprecated alias for frombytes(). I think numpy should do the same.And frombytes() should remove the "sep" parameter. If someone wants to write a fast efficient simple text parser, then it should get a new name: fromtext() maybe???And the fromfile() sep argument should be deprecated as well, for the same reasons.array also has: array.fromunicode(s) Extends this array with data from the given unicode string. The array must be a type 'u' array; otherwise a ValueError is raised. Usearray.frombytes(unicodestring.encode(enc)) to append Unicode data to an array of some other type. which I think would be better supported by:np.frombytes(str.encode('UCS-4'), dtype=uint32) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 5 June 2017 at 19:40, Chris Barker <chris.barker@noaa.gov> wrote:
I have started to read that thread, though I've lost in idea transitions. Likely it was about some new string array type...
But anglo-centrism aside, there is substantial demand for a "smaller" way to store mostly-ascii text.
Obviously there is demand. Terror of unicode touches many aspects of programmers life. It is not Numpy's problem though. The realistic scenario for satisfaction for this demand is a hard and wide problem. Foremost, it comes down to the question of defining this "optimal 8-bit character table". And "Latin-1", (exactly as it is) is not that optimal table, at least because of huge amount of accented letters. But, granted, if define most accented letters as "optional", i.e . delete them then it is quite reasonable basic char table to start with. Further comes the question of popularizisng new table (which doesn't even exists yet).
Well here I must admit I don't quite understand the whole idea of "numpy array of string type". How used? What is main bebefit/feature...? Example integer array usage in context of textual data in my case: - holding data in a text editor (mutability+indexing/slicing) - filtering, transformations (e.g. table translations, cryptography, etc.) String type array? Will this be a string array you describe: s= "012 abc" arr = np.array(s) print ("type ", arr.dtype) print ("shape ", arr.shape) print ("my array: ", arr) arr = np.roll(arr[0],2) print ("my array: ", arr) -> type <U7 shape () my array: 012 abc my array: 012 abc So what it does? What's up with shape? e.g. here I wanted to 'roll' the string. How would I replace chars? or delete? What is the general idea behind? Mikhail

On Mon, Jun 5, 2017 at 4:06 PM, Mikhail V <mikhailwas@gmail.com> wrote:
Likely it was about some new string array type...
yes, it was.
Obviously there is demand. Terror of unicode touches many aspects
of programmers life.
I don't know that I'd call it Terror, but frankly, the fact that you need up to 4 bytes for a single character is really not the big issues. Given that computer memory has grown by literally orders of magnitude since Unicode was introduced, I don't know why there is such a hang up about it. But we're scientific programmers we like to be efficient !
there is no such thing as a single "optimal" set of characters when you are limited to 255 of them... latin-1 is pretty darn good for the, well, latin-based languages....
Then you are down to ASCII, no? but anyway, I don't think a new encoding is really the topic at hand here....
here you go -- you can do this now: In [74]: s_arr = np.array([s, "another string"], dtype=np.object) In [75]: In [75]: s_arr Out[75]: array(['012 АБВ', 'another string'], dtype=object) In [76]: s_arr.shape Out[76]: (2,) You now have an array with python string object in it -- thus access to all the string functionality: In [81]: s_arr[1] = s_arr[1].upper() In [82]: s_arr Out[82]: array(['012 АБВ', 'ANOTHER STRING'], dtype=object) and the ability to have each string be a different length. If numpy were to know that those were string objects, rather than arbitrary python objects, it could do vectorized operations on them, etc. You can do that now with numpy.vectorize, but it's pretty klunky. In [87]: np_upper = np.vectorize(str.upper) In [88]: np_upper(s_arr) Out[88]: array(['012 АБВ', 'ANOTHER STRING'], dtype='<U14')
Example integer array usage in context of textual data in my case: - holding data in a text editor (mutability+indexing/slicing)
you really want to use regular old python data structures for that...
- filtering, transformations (e.g. table translations, cryptography, etc.)
that may be something to do with ordinals and numpy -- but then you need to work with ascii or latin-1 and uint8 dtypes, or full Unicode and uint32 dtype -- that's that. String type array? Will this be a string array you describe:
shape is an empty tuple, meaning this is a numpy scalar, containing a single string type '<U7' means little endian, unicode, 7 characters
the numpy string type (unicode type) works with fixed length strings -- not characters, but you can reshape it and make a view: In [89]: s= "012 abc" In [90]: arr.shape = (1,) In [91]: arr.shape Out[91]: (1,) In [93]: c_arr = arr.view(dtype = '<U1') In [97]: np.roll(c_arr, 3) Out[97]: array(['a', 'b', 'c', '0', '1', '2', ' '], dtype='<U1') You could also create it as a character array in the first place by unpacking it into a list first: In [98]: c_arr = np.array(list(s)) In [99]: c_arr Out[99]: array(['0', '1', '2', ' ', 'a', 'b', 'c'], dtype='<U1') -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 4 June 2017 at 23:59, Thomas Jollans <tjol@tjol.eu> wrote:
Ok, examples do best. I think we have to separate cases though. So I will do examples in recent Python 3 now to avoid confusion. Case divisions: -- classify by "forward/backward" conversion: For this time consider only forward, i.e. I copy data from string to numpy array -- classify by " bytes vs ordinals ": a) bytes: If I need raw bytes - in this case e.g. B = bytes(s.encode()) will do it. then I can copy data to array. So currently there are methods coverings this. If I understand correctly the data extracted corresponds to utf-?? byte feed, i.e. non-constant byte-length of chars (1 up to 4 bytes per char for the 'wide' unicode, correct me if I am wrong). b): I need *ordinals* Yes, I need ordinals, so for the bytes() method, if a Python 3 string contains only basic ascii, I can so or so convert to bytes then to integer array and the length will be the same 1byte for each char. Although syntactically seen, and with slicing, this will look e.g. like: s= "012 abc" B = bytes(s.encode()) # convert to bytes k = len(s) arr = np.zeros(k,"u1") # init empty array length k arr[0:2] = list(B[0:2]) print ("my array: ", arr) -> my array: [48 49 0 0 0 0 0] Result seems correct. Note that I also need to use list(B), otherwise the slicing does not work (fills both values with 1, no idea where 1 comes from). Or I can write e.g.: arr[0:2] = np.fromstring(B[0:2], "u1") But looks indeed like a 'hack' and not so sinple. Considering your other examples there is other (better?) way, see below. Note, I personally don't know best practices and many technical nuances here so I repeat it from your words. -- classify "what is maximal ordinal value in the string" Well, say, I don't know what is maximal ordinal, e.g. here I take 3 Cyrillic letters instead of 'abc': s= "012 АБВ" k = len(s) arr = np.zeros(k,"u4") # init empty 32 bit array length k arr[:] = np.fromstring(np.array(s),"u4") -> [ 48 49 50 32 1040 1041 1042] This gives correct results indeed. So I get my ordinals as expected. So this is better/preferred way, right? Ok... Just some further thoughts on the topic: I would want to do the above things, in simpler syntax. For example, if there would be methods taking Python strings: arr = np.ordinals(s) arr[0:2] = np.ordinals(s[0:2]) # with slicing or, e.g. in such format: arr = np.copystr(s) arr[0:2] = np.copystr(s[0:2]) Which would give me same result as your proposed : arr = np.fromstring(np.array(s),"u4") arr[0:2] = np.fromstring(np.array(s[0:2]),"u4") IOW omitting "u4" parameter seems to be OK. E.g. if on the left side of assignment is "u1" array the values would be silently wrapped(?) according to Numpy rules (as Chris pointed out). And in similar way backward conversion to Python string. Though for Python 2 could raise questions why need casting to "u4". Would be cool just to use = without any methods as I've originally supposed, but as I understand now this behaviour is already occupied and would cause backward compatibility issues if touched. So approximately are my ideas. For me it would cover many applicaton cases. Mikhail

On Mon, Jun 5, 2017 at 3:59 PM, Mikhail V <mikhailwas@gmail.com> wrote:
no need to call "bytes" -- encode() returns a bytes object: In [1]: s = "this is a simple ascii-only string" In [2]: b = s.encode() In [3]: type(b) Out[3]: bytes In [4]: b Out[4]: b'this is a simple ascii-only string'
In [5]: s.encode? Docstring: S.encode(encoding='utf-8', errors='strict') -> bytes So the default is utf-8, but you can set any encoding you want (that python supports) In [6]: s.encode('utf-16') Out[6]: b'\xff\xfet\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00 \x00s\x00i\x00m\x00p\x00l\x00e\x00 \x00a\x00s\x00c\x00i\x00i\x00-\x00o\x00n\x00l\x00y\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00'
This can be done more cleanly: In [15]: s= "012 abc" In [16]: b = s.encode('ascii') # you want to use the ascii encoding so you don't get utf-8 cruft if there are non-ascii characters # you could use latin-1 too (Or any other one-byte per char encoding In [17]: arr = np.fromstring(b, np.uint8) # this is using fromstring() to means it's old py definiton - treat teh contenst as bytes # -- it really should be called "frombytes()" # you could also use: In [22]: np.frombuffer(b, dtype=np.uint8) Out[22]: array([48, 49, 50, 32, 97, 98, 99], dtype=uint8)In [18]: print arr In [19]: print(arr) [48 49 50 32 97 98 99] # you got the ordinals In [20]: "".join([chr(i) for i in arr]) Out[20]: '012 abc' # yes, they are the right ones...
that is odd -- I can't explain it right now either...
is the above OK?
so this is making a numpy string, which is a UCS-4 encoding unicode -- i.e. 4 bytes per charactor. Then you care converting that to an 4-byte unsigned int. but no need to do it with fromstring: In [52]: s Out[52]: '012 АБВ' In [53]: s_arr.reshape((1,)).view(np.uint32) Out[53]: array([ 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) we need the reshape() because .view does not work with array scalars -- not sure why not?
This gives correct results indeed. So I get my ordinals as expected. So this is better/preferred way, right?
I would maybe do it more "directly" -- i.e. use python's string to do the encoding: In [64]: s Out[64]: '012 АБВ' In [67]: np.fromstring(s.encode('U32'), dtype=np.uint32) Out[67]: array([65279, 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) that first value is the byte-order mark (I think...), you can strip it off with: In [68]: np.fromstring(s.encode('U32')[4:], dtype=np.uint32) Out[68]: array([ 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) or, probably better simply specify the byte order in the encoding: In [69]: np.fromstring(s.encode('UTF-32LE'), dtype=np.uint32) Out[69]: array([ 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) arr = np.ordinals(s)
I don't think any of this is necessary -- the UCS4 (Or UTF-32) "encoding" is pretty much the ordinals anyway. As you notices, if you make a numpy unicode string array, and change the dtype to unsigned int32, you get what you want. You really don't want to mess with any of this unless you understand unicode and encodings anyway.... Though it is a bit akward -- why is your actual use-case for working with ordinals??? BTW, you can use regular python to get the ordinals first: In [71]: np.array([ord(c) for c in s]) Out[71]: array([ 48, 49, 50, 32, 1040, 1041, 1042]) Though for Python 2 could raise questions why need casting to "u4".
this would all work the same with python 2 if you used unicode objects instead of strings. Maybe good to put: from __future__ import unicode_literals in your source....
So approximately are my ideas. For me it would cover many application cases.
I'm still curious as to your use-cases -- when do you have a bunch of ordinal values?? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 7 June 2017 at 00:05, Chris Barker <chris.barker@noaa.gov> wrote:
On Mon, Jun 5, 2017 at 3:59 PM, Mikhail V <mikhailwas@gmail.com> wrote:
Thanks for clarifying, that makes sense. Also it's a good way to validate the string.
Ok, this gives what I want too. So now for unicode I am by two possible options (apart from possible "fromstring" spelling): with indexing (if I want to copy into already existing array on the fly): arr[0:3] = np.fromstring(np.array(s[0:3]),"u4") arr[0:3] = np.fromstring(s[0:3].encode('UTF-32LE'),"u4")
No I am not implying anything is necessary, just seems to be sort of a pattern. And from Python 3 perspective where string indexing is by wide characters ... well I don't know.
Example integer array usage in context of textual data in my case: - holding data in a text editor (mutability+indexing/slicing)
I am intentionally choosing fixed size array for holding data and writing values using indexes. But wait a moment, characters *are* integers, identities, [put some other name here].
So here it prints ['a', 'b', 'c', '0', '1', '2', ' '] which is the same data, it is just a matter of printing. If we talk about methods available already in particular libs, then well, yes they are set up to work on specific object types only. But generally speaking, if I want to select e.g. specific character values, or I am selecting specific values in some discrete sets... But I have no experience with numpy string types and could not feel the real purposes yet. ------- (Off topic here)
Yeah, depends much on criteria of 'optimality' and many other things ;)
No, then I am down to ASCII plus few vital characters, e.g.: - Dashes (which could solve the painful and old as world problem of "hyphen" vs "minus") - Multiplication sign, degree - Em dash, quotation marks, spaces (non-breaking, half) -- all vital for typesetting ... If you think about it, 255 units is more than enough to define perfect communication standards.
but anyway, I don't think a new encoding is really the topic at hand here....
Yes I think this is off-opic on this list. But intersting indeed, where it is on-topic. Seems like those encodings are coming from some "mysterios castle in the clouds". Mikhail

On 04/06/17 20:04, Mikhail V wrote: think I would prefer this, actually. However, numpy normally treats strings as objects that can sometimes be cast to numbers, so this behaviour is perfectly logical. For what it's worth, in Python 3 (which you probably should want to be using), everything behaves as you'd expect:
There is also something to be said for the current behaviour:
np.array('100', 'u1') array(100, dtype=uint8)
However, the fact that this works for bytestrings on Python 3 is, in my humble opinion, ridiculous:
np.array(b'100', 'u1') # b'100' IS NOT TEXT array(100, dtype=uint8)
This is of course consistent with the fact that you can cast a bytestring to builtin python int or float (but not complex). Interestingly enough, numpy complex behaves differently from python complex:
If you ask me, passing a unicode string to fromstring with sep='' (i.e. to parse binary data) should ALWAYS raise an error: the semantics only make sense for strings of bytes. Currently, there appears to be some UTF-8 conversion going on, which creates potentially unexpected results:
This is, apparently (https://github.com/numpy/numpy/issues/2152), due to how the internals of Python deal with unicode strings in C code, and not due to anything numpy is doing. Speaking of unexpected results, I'm not sure you realize what fromstring does when you give it a multi-byte dtype:
Give fromstring() a numpy unicode string, and all is right with the world:
IMHO calling fromstring(..., sep='') with a unicode string should be deprecated and perhaps eventually forbidden. (Or fixed, but that would break backwards compatibility)
That's quite enough anglo-centrism, thank you. For when you need byte strings, Python 3 has a type for that. For when your strings contain text, bytes with no information on encoding are not enough.
chararray is deprecated.
Agreed! -- Thomas

Just a few notes: However, the fact that this works for bytestrings on Python 3 is, in my
Yes, that is a mis-feature -- I think due to bytes and string being the same object in py2 -- so on py3, numpy continues to treat a bytes objects as also a 1-byte-per-char string, depending on context. And users want to be able to write numpy code that will run the same on py2 and py3, so we kinda need this kind of thing. Makes me think that an optional "pure-py-3" mode for numpy might be a good idea. If that flag is set, your code will only run on py3 (or at least might run differently).
well, you can pass numbers > 255 into a u1 already: In [*96*]: np.array(456, dtype='u1') Out[*96*]: array(200, dtype=uint8) and it does the wrap-around overflow thing... so why not?
absolutely! If you ask me, passing a unicode string to fromstring with sep='' (i.e.
to parse binary data) should ALWAYS raise an error: the semantics only make sense for strings of bytes.
exactly -- we really should have a "frombytes()" alias for fromstring() and it should only work for atual bytes objects (strings on py2, naturally). and overloading fromstring() to mean both "binary dump of data" and "parse the text" due to whether the sep argument is set was always a bad idea :-( .. and fromstring(s, sep=a_sep_char) has been semi broken (or at least not robust) forever anyway. Currently, there appears to be some UTF-8 conversion going on, which
exactly -- py3 strings are pretty nifty implementation of unicode text -- they have nothing to do with storing binary data, and should not be used that way. There is essentially no reason you would ever want to pass the actual binary representation to any other code. fromstring should be re-named frombytes, and it should raise an exception if you pass something other than a bytes object (or maybe a memoryview or other binary container?) we might want to keep fromstring() for parsing strings, but only if it were fixed... IMHO calling fromstring(..., sep='') with a unicode string should be
deprecated and perhaps eventually forbidden. (Or fixed, but that would break backwards compatibility)
agreed.
There was a big thread about this recently -- it seems to have not quite come to a conclusion. But anglo-centrism aside, there is substantial demand for a "smaller" way to store mostly-ascii text. I _think_ the conversation was steering toward an encoding-specified string dtype, so us anglo-centric folks could use latin-1 or utf-8. But someone would need to write the code. -CHB
I agree here. But if one were to add such a thing (vectorized string operations) -- I'd think the thing to do would be to wrap (or port) the python string methods. But it shoudl only work for actual string dtypes, of course. note that another part of the discussion previously suggested that we have a dtype that wraps a native python string object -- then you'd get all for free. This is essentially an object array with strings in it, which you can do now. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 05/06/17 19:40, Chris Barker wrote:
As it happens, this is pretty much what stdlib bytearray does since 3.2 (http://bugs.python.org/issue8990)
-- Thomas Jollans m ☎ +31 6 42630259 e ✉ tjol@tjol.eu

On Mon, Jun 5, 2017 at 1:51 PM, Thomas Jollans <tjol@tjol.eu> wrote:
I'm not sure that the array.array.fromstring() ever parsed the data string as text, did it? Anyway, This is what array.array now has: array.frombytes(s) Appends items from the string, interpreting the string as an array of machine values (as if it had been read from a file using the fromfile()method). New in version 3.2: fromstring() is renamed to frombytes() for clarity. array.fromfile(f, n) Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do. array.fromstring() Deprecated alias for frombytes(). I think numpy should do the same.And frombytes() should remove the "sep" parameter. If someone wants to write a fast efficient simple text parser, then it should get a new name: fromtext() maybe???And the fromfile() sep argument should be deprecated as well, for the same reasons.array also has: array.fromunicode(s) Extends this array with data from the given unicode string. The array must be a type 'u' array; otherwise a ValueError is raised. Usearray.frombytes(unicodestring.encode(enc)) to append Unicode data to an array of some other type. which I think would be better supported by:np.frombytes(str.encode('UCS-4'), dtype=uint32) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 5 June 2017 at 19:40, Chris Barker <chris.barker@noaa.gov> wrote:
I have started to read that thread, though I've lost in idea transitions. Likely it was about some new string array type...
But anglo-centrism aside, there is substantial demand for a "smaller" way to store mostly-ascii text.
Obviously there is demand. Terror of unicode touches many aspects of programmers life. It is not Numpy's problem though. The realistic scenario for satisfaction for this demand is a hard and wide problem. Foremost, it comes down to the question of defining this "optimal 8-bit character table". And "Latin-1", (exactly as it is) is not that optimal table, at least because of huge amount of accented letters. But, granted, if define most accented letters as "optional", i.e . delete them then it is quite reasonable basic char table to start with. Further comes the question of popularizisng new table (which doesn't even exists yet).
Well here I must admit I don't quite understand the whole idea of "numpy array of string type". How used? What is main bebefit/feature...? Example integer array usage in context of textual data in my case: - holding data in a text editor (mutability+indexing/slicing) - filtering, transformations (e.g. table translations, cryptography, etc.) String type array? Will this be a string array you describe: s= "012 abc" arr = np.array(s) print ("type ", arr.dtype) print ("shape ", arr.shape) print ("my array: ", arr) arr = np.roll(arr[0],2) print ("my array: ", arr) -> type <U7 shape () my array: 012 abc my array: 012 abc So what it does? What's up with shape? e.g. here I wanted to 'roll' the string. How would I replace chars? or delete? What is the general idea behind? Mikhail

On Mon, Jun 5, 2017 at 4:06 PM, Mikhail V <mikhailwas@gmail.com> wrote:
Likely it was about some new string array type...
yes, it was.
Obviously there is demand. Terror of unicode touches many aspects
of programmers life.
I don't know that I'd call it Terror, but frankly, the fact that you need up to 4 bytes for a single character is really not the big issues. Given that computer memory has grown by literally orders of magnitude since Unicode was introduced, I don't know why there is such a hang up about it. But we're scientific programmers we like to be efficient !
there is no such thing as a single "optimal" set of characters when you are limited to 255 of them... latin-1 is pretty darn good for the, well, latin-based languages....
Then you are down to ASCII, no? but anyway, I don't think a new encoding is really the topic at hand here....
here you go -- you can do this now: In [74]: s_arr = np.array([s, "another string"], dtype=np.object) In [75]: In [75]: s_arr Out[75]: array(['012 АБВ', 'another string'], dtype=object) In [76]: s_arr.shape Out[76]: (2,) You now have an array with python string object in it -- thus access to all the string functionality: In [81]: s_arr[1] = s_arr[1].upper() In [82]: s_arr Out[82]: array(['012 АБВ', 'ANOTHER STRING'], dtype=object) and the ability to have each string be a different length. If numpy were to know that those were string objects, rather than arbitrary python objects, it could do vectorized operations on them, etc. You can do that now with numpy.vectorize, but it's pretty klunky. In [87]: np_upper = np.vectorize(str.upper) In [88]: np_upper(s_arr) Out[88]: array(['012 АБВ', 'ANOTHER STRING'], dtype='<U14')
Example integer array usage in context of textual data in my case: - holding data in a text editor (mutability+indexing/slicing)
you really want to use regular old python data structures for that...
- filtering, transformations (e.g. table translations, cryptography, etc.)
that may be something to do with ordinals and numpy -- but then you need to work with ascii or latin-1 and uint8 dtypes, or full Unicode and uint32 dtype -- that's that. String type array? Will this be a string array you describe:
shape is an empty tuple, meaning this is a numpy scalar, containing a single string type '<U7' means little endian, unicode, 7 characters
the numpy string type (unicode type) works with fixed length strings -- not characters, but you can reshape it and make a view: In [89]: s= "012 abc" In [90]: arr.shape = (1,) In [91]: arr.shape Out[91]: (1,) In [93]: c_arr = arr.view(dtype = '<U1') In [97]: np.roll(c_arr, 3) Out[97]: array(['a', 'b', 'c', '0', '1', '2', ' '], dtype='<U1') You could also create it as a character array in the first place by unpacking it into a list first: In [98]: c_arr = np.array(list(s)) In [99]: c_arr Out[99]: array(['0', '1', '2', ' ', 'a', 'b', 'c'], dtype='<U1') -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 4 June 2017 at 23:59, Thomas Jollans <tjol@tjol.eu> wrote:
Ok, examples do best. I think we have to separate cases though. So I will do examples in recent Python 3 now to avoid confusion. Case divisions: -- classify by "forward/backward" conversion: For this time consider only forward, i.e. I copy data from string to numpy array -- classify by " bytes vs ordinals ": a) bytes: If I need raw bytes - in this case e.g. B = bytes(s.encode()) will do it. then I can copy data to array. So currently there are methods coverings this. If I understand correctly the data extracted corresponds to utf-?? byte feed, i.e. non-constant byte-length of chars (1 up to 4 bytes per char for the 'wide' unicode, correct me if I am wrong). b): I need *ordinals* Yes, I need ordinals, so for the bytes() method, if a Python 3 string contains only basic ascii, I can so or so convert to bytes then to integer array and the length will be the same 1byte for each char. Although syntactically seen, and with slicing, this will look e.g. like: s= "012 abc" B = bytes(s.encode()) # convert to bytes k = len(s) arr = np.zeros(k,"u1") # init empty array length k arr[0:2] = list(B[0:2]) print ("my array: ", arr) -> my array: [48 49 0 0 0 0 0] Result seems correct. Note that I also need to use list(B), otherwise the slicing does not work (fills both values with 1, no idea where 1 comes from). Or I can write e.g.: arr[0:2] = np.fromstring(B[0:2], "u1") But looks indeed like a 'hack' and not so sinple. Considering your other examples there is other (better?) way, see below. Note, I personally don't know best practices and many technical nuances here so I repeat it from your words. -- classify "what is maximal ordinal value in the string" Well, say, I don't know what is maximal ordinal, e.g. here I take 3 Cyrillic letters instead of 'abc': s= "012 АБВ" k = len(s) arr = np.zeros(k,"u4") # init empty 32 bit array length k arr[:] = np.fromstring(np.array(s),"u4") -> [ 48 49 50 32 1040 1041 1042] This gives correct results indeed. So I get my ordinals as expected. So this is better/preferred way, right? Ok... Just some further thoughts on the topic: I would want to do the above things, in simpler syntax. For example, if there would be methods taking Python strings: arr = np.ordinals(s) arr[0:2] = np.ordinals(s[0:2]) # with slicing or, e.g. in such format: arr = np.copystr(s) arr[0:2] = np.copystr(s[0:2]) Which would give me same result as your proposed : arr = np.fromstring(np.array(s),"u4") arr[0:2] = np.fromstring(np.array(s[0:2]),"u4") IOW omitting "u4" parameter seems to be OK. E.g. if on the left side of assignment is "u1" array the values would be silently wrapped(?) according to Numpy rules (as Chris pointed out). And in similar way backward conversion to Python string. Though for Python 2 could raise questions why need casting to "u4". Would be cool just to use = without any methods as I've originally supposed, but as I understand now this behaviour is already occupied and would cause backward compatibility issues if touched. So approximately are my ideas. For me it would cover many applicaton cases. Mikhail

On Mon, Jun 5, 2017 at 3:59 PM, Mikhail V <mikhailwas@gmail.com> wrote:
no need to call "bytes" -- encode() returns a bytes object: In [1]: s = "this is a simple ascii-only string" In [2]: b = s.encode() In [3]: type(b) Out[3]: bytes In [4]: b Out[4]: b'this is a simple ascii-only string'
In [5]: s.encode? Docstring: S.encode(encoding='utf-8', errors='strict') -> bytes So the default is utf-8, but you can set any encoding you want (that python supports) In [6]: s.encode('utf-16') Out[6]: b'\xff\xfet\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00 \x00s\x00i\x00m\x00p\x00l\x00e\x00 \x00a\x00s\x00c\x00i\x00i\x00-\x00o\x00n\x00l\x00y\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00'
This can be done more cleanly: In [15]: s= "012 abc" In [16]: b = s.encode('ascii') # you want to use the ascii encoding so you don't get utf-8 cruft if there are non-ascii characters # you could use latin-1 too (Or any other one-byte per char encoding In [17]: arr = np.fromstring(b, np.uint8) # this is using fromstring() to means it's old py definiton - treat teh contenst as bytes # -- it really should be called "frombytes()" # you could also use: In [22]: np.frombuffer(b, dtype=np.uint8) Out[22]: array([48, 49, 50, 32, 97, 98, 99], dtype=uint8)In [18]: print arr In [19]: print(arr) [48 49 50 32 97 98 99] # you got the ordinals In [20]: "".join([chr(i) for i in arr]) Out[20]: '012 abc' # yes, they are the right ones...
that is odd -- I can't explain it right now either...
is the above OK?
so this is making a numpy string, which is a UCS-4 encoding unicode -- i.e. 4 bytes per charactor. Then you care converting that to an 4-byte unsigned int. but no need to do it with fromstring: In [52]: s Out[52]: '012 АБВ' In [53]: s_arr.reshape((1,)).view(np.uint32) Out[53]: array([ 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) we need the reshape() because .view does not work with array scalars -- not sure why not?
This gives correct results indeed. So I get my ordinals as expected. So this is better/preferred way, right?
I would maybe do it more "directly" -- i.e. use python's string to do the encoding: In [64]: s Out[64]: '012 АБВ' In [67]: np.fromstring(s.encode('U32'), dtype=np.uint32) Out[67]: array([65279, 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) that first value is the byte-order mark (I think...), you can strip it off with: In [68]: np.fromstring(s.encode('U32')[4:], dtype=np.uint32) Out[68]: array([ 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) or, probably better simply specify the byte order in the encoding: In [69]: np.fromstring(s.encode('UTF-32LE'), dtype=np.uint32) Out[69]: array([ 48, 49, 50, 32, 1040, 1041, 1042], dtype=uint32) arr = np.ordinals(s)
I don't think any of this is necessary -- the UCS4 (Or UTF-32) "encoding" is pretty much the ordinals anyway. As you notices, if you make a numpy unicode string array, and change the dtype to unsigned int32, you get what you want. You really don't want to mess with any of this unless you understand unicode and encodings anyway.... Though it is a bit akward -- why is your actual use-case for working with ordinals??? BTW, you can use regular python to get the ordinals first: In [71]: np.array([ord(c) for c in s]) Out[71]: array([ 48, 49, 50, 32, 1040, 1041, 1042]) Though for Python 2 could raise questions why need casting to "u4".
this would all work the same with python 2 if you used unicode objects instead of strings. Maybe good to put: from __future__ import unicode_literals in your source....
So approximately are my ideas. For me it would cover many application cases.
I'm still curious as to your use-cases -- when do you have a bunch of ordinal values?? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 7 June 2017 at 00:05, Chris Barker <chris.barker@noaa.gov> wrote:
On Mon, Jun 5, 2017 at 3:59 PM, Mikhail V <mikhailwas@gmail.com> wrote:
Thanks for clarifying, that makes sense. Also it's a good way to validate the string.
Ok, this gives what I want too. So now for unicode I am by two possible options (apart from possible "fromstring" spelling): with indexing (if I want to copy into already existing array on the fly): arr[0:3] = np.fromstring(np.array(s[0:3]),"u4") arr[0:3] = np.fromstring(s[0:3].encode('UTF-32LE'),"u4")
No I am not implying anything is necessary, just seems to be sort of a pattern. And from Python 3 perspective where string indexing is by wide characters ... well I don't know.
Example integer array usage in context of textual data in my case: - holding data in a text editor (mutability+indexing/slicing)
I am intentionally choosing fixed size array for holding data and writing values using indexes. But wait a moment, characters *are* integers, identities, [put some other name here].
So here it prints ['a', 'b', 'c', '0', '1', '2', ' '] which is the same data, it is just a matter of printing. If we talk about methods available already in particular libs, then well, yes they are set up to work on specific object types only. But generally speaking, if I want to select e.g. specific character values, or I am selecting specific values in some discrete sets... But I have no experience with numpy string types and could not feel the real purposes yet. ------- (Off topic here)
Yeah, depends much on criteria of 'optimality' and many other things ;)
No, then I am down to ASCII plus few vital characters, e.g.: - Dashes (which could solve the painful and old as world problem of "hyphen" vs "minus") - Multiplication sign, degree - Em dash, quotation marks, spaces (non-breaking, half) -- all vital for typesetting ... If you think about it, 255 units is more than enough to define perfect communication standards.
but anyway, I don't think a new encoding is really the topic at hand here....
Yes I think this is off-opic on this list. But intersting indeed, where it is on-topic. Seems like those encodings are coming from some "mysterios castle in the clouds". Mikhail
participants (3)
-
Chris Barker
-
Mikhail V
-
Thomas Jollans