Hi, I was surprised by this - should I have been? In [35]: e = np.array(['a']) In [36]: e.shape Out[36]: (1,) In [37]: e.size Out[37]: 1 In [38]: e.tostring() Out[38]: 'a' In [39]: f = np.array(['a']) In [40]: f.shape == e.shape Out[40]: True In [41]: f.size == e.size Out[41]: True In [42]: f.tostring() Out[42]: 'a' In [43]: z = np.array(['\x00']) In [44]: z.shape Out[44]: (1,) In [45]: z.size Out[45]: 1 In [46]: z Out[46]: array([''], dtype='|S1') That is, an empty string array seems to be the same as a string array with a single 0 byte, including having shape (1,) and size 1... Best, Matthew
On Wed, Dec 30, 2009 at 8:35 AM, Matthew Brett
Hi,
I was surprised by this - should I have been?
In [35]: e = np.array(['a'])
In [36]: e.shape Out[36]: (1,)
In [37]: e.size Out[37]: 1
In [38]: e.tostring() Out[38]: 'a'
In [39]: f = np.array(['a'])
In [40]: f.shape == e.shape Out[40]: True
In [41]: f.size == e.size Out[41]: True
In [42]: f.tostring() Out[42]: 'a'
In [43]: z = np.array(['\x00'])
In [44]: z.shape Out[44]: (1,)
In [45]: z.size Out[45]: 1
In [46]: z Out[46]: array([''], dtype='|S1')
That is, an empty string array seems to be the same as a string array with a single 0 byte, including having shape (1,) and size 1...
I don't see any empty string in your code ? They all have one byte. The last one is slightly confusing as far as printing is concerned (I would have expected array(["¥x00"]...) instead). It may be a bug in numpy because a byte with value 0 is used a string delimiter in C. cheers, David
Hi,
I don't see any empty string in your code ? They all have one byte. The last one is slightly confusing as far as printing is concerned (I would have expected array(["¥x00"]...) instead). It may be a bug in numpy because a byte with value 0 is used a string delimiter in C.
Sorry - I pasted the wrong code: In [49]: e = np.array(['']) In [50]: e.shape Out[50]: (1,) In [51]: e.size Out[51]: 1 In [52]: f = np.array(['a']) In [53]: f.shape == e.shape Out[53]: True In [54]: f.size == e.size Out[54]: True In [55]: e.tostring() Out[55]: '\x00' In [56]: f.tostring() Out[56]: 'a' In [58]: e == z Out[58]: array([ True], dtype=bool) Thanks, Matthew
On Wed, Dec 30, 2009 at 8:52 AM, Matthew Brett
In [58]: e == z Out[58]: array([ True], dtype=bool)
Ok, it looks like there are at least two issues: - if an item in a string array is set to '¥x00', this seems to be replace with '', but '' != '¥x00'] x = np.array(["¥x00"]) x[0] == '' " # True, but should be False ? - if an item in a string array is set to '', tostring will convert it to '¥x00' : x = np.array([""]) x.tostring() == ["¥00"] # True, but should be False ? I guess the root cause is that there does not seem to be a "|S0" type - but that may be difficult to implement, since the array would have > 0 items, but a 0 size. It may have other, but as quirky behavior. What do you need this for ? cheers, David
Hi,
Ok, it looks like there are at least two issues: - if an item in a string array is set to '¥x00', this seems to be replace with '', but '' != '¥x00']
Sorry - I'm afraid I don't understand. It looks to me as though the buffer contents of [''] is a length 1 string with a 0 byte, and an array.size of 1 - is that also what you think? I guess I think that it should be a length 0 string, with a array.size of 0,
What do you need this for ?
I noticed it when I found that writing an empty string array to matlab resulted in a single character array when loaded into matlab. I guess that I will have to special-case the writing code to detect 'empty' strings, but I can't (I don't think) distinguish a real string with \x00 from an empty string. Thanks a lot, Matthew
On Wed, Dec 30, 2009 at 9:33 AM, Matthew Brett
Hi,
Ok, it looks like there are at least two issues: - if an item in a string array is set to '¥x00', this seems to be replace with '', but '' != '¥x00']
Sorry - I'm afraid I don't understand
Compare this: x = "¥00" arr = np.array([x]) lst = [x] arr[0] == x # False arr[0] == "" # True lst[0] == x # True lst[0] == "" # False
It looks to me as though the buffer contents of [''] is a length 1 string with a 0 byte, and an array.size of 1 - is that also what you think? I guess I think that it should be a length 0 string, with a array.size of 0
Array size of 0 would be very weird: it means it would have no items, whereas it actually has one item (which itself has a size 0). If you create a list with an empty string (x = [""]), you have len(x) == 1 and len(x[0]) == 0. But an empty string has size 0, so the corresponding dtype should have an itemsize of 0 (assuming the array only contains empty strings).
I guess that I will have to special-case the writing code to detect 'empty' strings, but I can't (I don't think) distinguish a real string with \x00 from an empty string.
In python "proper", they are different: "¥x00" != "". The problem is that it does not seem possible ATM to create an numpy array with an empty string. David
Hi,
x = "¥00" arr = np.array([x]) lst = [x]
arr[0] == x # False arr[0] == "" # True
lst[0] == x # True lst[0] == "" # False
Ah - thanks - got it.
It looks to me as though the buffer contents of [''] is a length 1 string with a 0 byte, and an array.size of 1 - is that also what you think? I guess I think that it should be a length 0 string, with a array.size of 0
Array size of 0 would be very weird: it means it would have no items, whereas it actually has one item (which itself has a size 0).
Is this a string-specific thing? I mean, you can have size 0 1d numeric arrays. Sorry if I'm being slow, it's late here. In [70]: np.array([[]]).shape Out[70]: (1, 0) In [71]: np.array([[]]).size Out[71]: 0 Cheers, Matthew
On Wed, Dec 30, 2009 at 10:20 AM, Matthew Brett
Hi,
x = "¥00" arr = np.array([x]) lst = [x]
arr[0] == x # False arr[0] == "" # True
lst[0] == x # True lst[0] == "" # False
Ah - thanks - got it.
It looks to me as though the buffer contents of [''] is a length 1 string with a 0 byte, and an array.size of 1 - is that also what you think? I guess I think that it should be a length 0 string, with a array.size of 0
Array size of 0 would be very weird: it means it would have no items, whereas it actually has one item (which itself has a size 0).
Is this a string-specific thing?
No. I was not very clear: My point was that size 0 array is likely not what you want. What you want is arrays whose *itemsize" is 0.
I mean, you can have size 0 1d numeric arrays. Sorry if I'm being slow, it's late here.
In [70]: np.array([[]]).shape Out[70]: (1, 0)
In [71]: np.array([[]]).size Out[71]: 0
Yes, you can create array with size 0. But I don't think that's what you want - you cannot index them normally (even though the array is 2d, you cannot do arr[0][0], so I don't think that's very useful for your case). David
Hmmm... I don't see where you created "an empty string array" in your examples. All three of your arrays contain one element, so they are not empty. The element in the array z happens to be zero, is all. Here's an example of creating an empty string array: In [2]: e = np.array([],dtype='S') In [3]: e Out[3]: array([], dtype='|S1') In [4]: e.shape Out[4]: (0,) Warren Matthew Brett wrote:
Hi,
I was surprised by this - should I have been?
In [35]: e = np.array(['a'])
In [36]: e.shape Out[36]: (1,)
In [37]: e.size Out[37]: 1
In [38]: e.tostring() Out[38]: 'a'
In [39]: f = np.array(['a'])
In [40]: f.shape == e.shape Out[40]: True
In [41]: f.size == e.size Out[41]: True
In [42]: f.tostring() Out[42]: 'a'
In [43]: z = np.array(['\x00'])
In [44]: z.shape Out[44]: (1,)
In [45]: z.size Out[45]: 1
In [46]: z Out[46]: array([''], dtype='|S1')
That is, an empty string array seems to be the same as a string array with a single 0 byte, including having shape (1,) and size 1...
Best,
Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Dec 29, 2009 at 4:35 PM, Matthew Brett
Hi,
I was surprised by this - should I have been?
In [35]: e = np.array(['a'])
In [36]: e.shape Out[36]: (1,)
In [37]: e.size Out[37]: 1
In [38]: e.tostring() Out[38]: 'a'
In [39]: f = np.array(['a'])
In [40]: f.shape == e.shape Out[40]: True
In [41]: f.size == e.size Out[41]: True
In [42]: f.tostring() Out[42]: 'a'
In [43]: z = np.array(['\x00'])
In [44]: z.shape Out[44]: (1,)
In [45]: z.size Out[45]: 1
In [46]: z Out[46]: array([''], dtype='|S1')
That is, an empty string array seems to be the same as a string array with a single 0 byte, including having shape (1,) and size 1...
It isn't empty: In [3]: array(['\x00']).dtype Out[3]: dtype('|S1') In [4]: array(['\x00']).tostring() Out[4]: '\x00' In [5]: array(['\x00'])[0] Out[5]: '' Looks like a printing problem to me, something in __repr__ for the string array. It seems that trailing zeros are trimmed off. In [11]: array(['a\x00\x00']) Out[11]: array(['a'], dtype='|S3') In [12]: array(['a\x00b']) Out[12]: array(['a\x00b'], dtype='|S3') Chuck
Hi.
It isn't empty:
In [3]: array(['\x00']).dtype Out[3]: dtype('|S1')
In [4]: array(['\x00']).tostring() Out[4]: '\x00'
In [5]: array(['\x00'])[0] Out[5]: ''
No, but my problem was that an empty string is not empty either, and that you can't therefore distinguish between an empty string and a string with all 0 bytes: In [11]: np.array('') == '\x00\x00\x00' Out[11]: array(True, dtype=bool)
Looks like a printing problem to me, something in __repr__ for the string array. It seems that trailing zeros are trimmed off.
In [11]: array(['a\x00\x00']) Out[11]: array(['a'], dtype='|S3')
In [12]: array(['a\x00b']) Out[12]: array(['a\x00b'], dtype='|S3')
I don't think it's a printing problem, I think it's that the trailing zeros are pulled off in the string comparisons, and for printing, even though they are present in memory. I mean, that a.tostring() is right, and the __repr__ and comparisons are - at least to me - confusing. In [2]: a = np.array('a\x00\x00\x00') In [3]: a Out[3]: array('a', dtype='|S4') In [5]: a == 'a' Out[5]: array(True, dtype=bool) In [7]: a == 'a\x00\x00\x00' Out[7]: array(True, dtype=bool) See you, Matthew
On Wed, Dec 30, 2009 at 12:00 PM, Matthew Brett
Hi.
It isn't empty:
In [3]: array(['\x00']).dtype Out[3]: dtype('|S1')
In [4]: array(['\x00']).tostring() Out[4]: '\x00'
In [5]: array(['\x00'])[0] Out[5]: ''
No, but my problem was that an empty string is not empty either, and that you can't therefore distinguish between an empty string and a string with all 0 bytes:
In [11]: np.array('') == '\x00\x00\x00' Out[11]: array(True, dtype=bool)
Looks like a printing problem to me, something in __repr__ for the string array. It seems that trailing zeros are trimmed off.
In [11]: array(['a\x00\x00']) Out[11]: array(['a'], dtype='|S3')
In [12]: array(['a\x00b']) Out[12]: array(['a\x00b'], dtype='|S3')
I don't think it's a printing problem, I think it's that the trailing zeros are pulled off in the string comparisons, and for printing, even though they are present in memory. I mean, that a.tostring() is right, and the __repr__ and comparisons are - at least to me - confusing.
In [2]: a = np.array('a\x00\x00\x00')
In [3]: a Out[3]: array('a', dtype='|S4')
In [5]: a == 'a' Out[5]: array(True, dtype=bool)
In [7]: a == 'a\x00\x00\x00' Out[7]: array(True, dtype=bool)
That is due to type promotion for the ufunc call: In [17]: a1 = np.array('a\x00\x00\x00') n [21]: np.array(['a'], dtype=a1.dtype)[0] Out[21]: 'a' In [22]: np.array(['a'], dtype=a1.dtype).tostring() Out[22]: 'a\x00\x00\x00' Chuck
Charles R Harris wrote:
That is due to type promotion for the ufunc call:
In [17]: a1 = np.array('a\x00\x00\x00')
n [21]: np.array(['a'], dtype=a1.dtype)[0] Out[21]: 'a'
In [22]: np.array(['a'], dtype=a1.dtype).tostring() Out[22]: 'a\x00\x00\x00'
it took me a bit to figure out what this meant, so in case I'm not the only one, I thought I'd spell it out: In [3]: s1 = np.array('a') In [4]: s1.dtype Out[4]: dtype('|S1') so s1's dytype is a length-1 string In [11]: s2 = np.array('a\x00\x00') In [12]: s2.dtype Out[12]: dtype('|S3') and s2's is a length-3 string In [13]: s1 == s2 Out[13]: array(True, dtype=bool) when they are compared, s1's dtype is coerced to a length 3 string by padding with nulls, and thus they compare equal. otherwise, there is nothing special about zero bytes in a string: In [14]: s3 = np.array('\x00a\x00') In [15]: s3 == s2 Out[15]: array(False, dtype=bool) In [16]: s3 == s1 Out[16]: array(False, dtype=bool) The problem is that there is zero bytes are the only way to pad a string. I suppose the comparison could be smarter, by comparing without coercing, but that may not be possible without the ufunc machinery. As for printing, I think it simply reflects that numpy strings are null padded, and most people probably wouldn't want to see all those nulls every time. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Hi,
On Thu, Dec 31, 2009 at 2:08 AM, Christopher Barker
Charles R Harris wrote:
That is due to type promotion for the ufunc call:
In [17]: a1 = np.array('a\x00\x00\x00')
n [21]: np.array(['a'], dtype=a1.dtype)[0] Out[21]: 'a'
In [22]: np.array(['a'], dtype=a1.dtype).tostring() Out[22]: 'a\x00\x00\x00'
it took me a bit to figure out what this meant, so in case I'm not the only one, I thought I'd spell it out:
I think the summary here is 'numpy strings are zero padded; therefore you may run into surprises with a string that has trailing zeros'. I see why that is - the zero terminator is the only way for numpy arrays to see where the end of the string is... Best, Matthew
Matthew Brett wrote:
I think the summary here is 'numpy strings are zero padded; therefore you may run into surprises with a string that has trailing zeros'.
I see why that is - the zero terminator is the only way for numpy arrays to see where the end of the string is...
almost -- it's not quite zero-terminated, you can have embedded zeros: In [35]: np.array('aa\x00bb', dtype='S6') Out[35]: array('aa\x00bb', dtype='|S6') -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (5)
-
Charles R Harris
-
Christopher Barker
-
David Cournapeau
-
Matthew Brett
-
Warren Weckesser