Mailman 3 Empty strings not empty? - NumPy-Discussion

Empty strings not empty?

Matthew Brett

29 Dec 2009 29 Dec '09

11:35 p.m.

Hi, I was surprised by this - should I have been? In [35]: e = np.array(['a']) In [36]: e.shape Out[36]: (1,) In [37]: e.size Out[37]: 1 In [38]: e.tostring() Out[38]: 'a' In [39]: f = np.array(['a']) In [40]: f.shape == e.shape Out[40]: True In [41]: f.size == e.size Out[41]: True In [42]: f.tostring() Out[42]: 'a' In [43]: z = np.array(['\x00']) In [44]: z.shape Out[44]: (1,) In [45]: z.size Out[45]: 1 In [46]: z Out[46]: array([''], dtype='|S1') That is, an empty string array seems to be the same as a string array with a single 0 byte, including having shape (1,) and size 1... Best, Matthew

Show replies by date

David Cournapeau

29 Dec 29 Dec

11:44 p.m.

On Wed, Dec 30, 2009 at 8:35 AM, Matthew Brett wrote:

...

Hi,

I was surprised by this - should I have been?

In [35]: e = np.array(['a'])

In [36]: e.shape Out[36]: (1,)

In [37]: e.size Out[37]: 1

In [38]: e.tostring() Out[38]: 'a'

In [39]: f = np.array(['a'])

In [40]: f.shape == e.shape Out[40]: True

In [41]: f.size == e.size Out[41]: True

In [42]: f.tostring() Out[42]: 'a'

In [43]: z = np.array(['\x00'])

In [44]: z.shape Out[44]: (1,)

In [45]: z.size Out[45]: 1

In [46]: z Out[46]: array([''], dtype='|S1')

That is, an empty string array seems to be the same as a string array with a single 0 byte, including having shape (1,) and size 1...

I don't see any empty string in your code ? They all have one byte. The last one is slightly confusing as far as printing is concerned (I would have expected array(["¥x00"]...) instead). It may be a bug in numpy because a byte with value 0 is used a string delimiter in C. cheers, David

Matthew Brett

11:52 p.m.

Hi,

...

I don't see any empty string in your code ? They all have one byte. The last one is slightly confusing as far as printing is concerned (I would have expected array(["¥x00"]...) instead). It may be a bug in numpy because a byte with value 0 is used a string delimiter in C.

Sorry - I pasted the wrong code: In [49]: e = np.array(['']) In [50]: e.shape Out[50]: (1,) In [51]: e.size Out[51]: 1 In [52]: f = np.array(['a']) In [53]: f.shape == e.shape Out[53]: True In [54]: f.size == e.size Out[54]: True In [55]: e.tostring() Out[55]: '\x00' In [56]: f.tostring() Out[56]: 'a' In [58]: e == z Out[58]: array([ True], dtype=bool) Thanks, Matthew

David Cournapeau

30 Dec 30 Dec

12:18 a.m.

On Wed, Dec 30, 2009 at 8:52 AM, Matthew Brett wrote:

...

In [58]: e == z Out[58]: array([ True], dtype=bool)

Ok, it looks like there are at least two issues: - if an item in a string array is set to '¥x00', this seems to be replace with '', but '' != '¥x00'] x = np.array(["¥x00"]) x[0] == '' " # True, but should be False ? - if an item in a string array is set to '', tostring will convert it to '¥x00' : x = np.array([""]) x.tostring() == ["¥00"] # True, but should be False ? I guess the root cause is that there does not seem to be a "|S0" type - but that may be difficult to implement, since the array would have > 0 items, but a 0 size. It may have other, but as quirky behavior. What do you need this for ? cheers, David

Matthew Brett

12:33 a.m.

Hi,

...

Ok, it looks like there are at least two issues: - if an item in a string array is set to '¥x00', this seems to be replace with '', but '' != '¥x00']

Sorry - I'm afraid I don't understand. It looks to me as though the buffer contents of [''] is a length 1 string with a 0 byte, and an array.size of 1 - is that also what you think? I guess I think that it should be a length 0 string, with a array.size of 0,

...

What do you need this for ?

I noticed it when I found that writing an empty string array to matlab resulted in a single character array when loaded into matlab. I guess that I will have to special-case the writing code to detect 'empty' strings, but I can't (I don't think) distinguish a real string with \x00 from an empty string. Thanks a lot, Matthew

David Cournapeau

12:57 a.m.

On Wed, Dec 30, 2009 at 9:33 AM, Matthew Brett wrote:

...

Hi,

...
Ok, it looks like there are at least two issues: - if an item in a string array is set to '¥x00', this seems to be replace with '', but '' != '¥x00']

Sorry - I'm afraid I don't understand

Compare this: x = "¥00" arr = np.array([x]) lst = [x] arr[0] == x # False arr[0] == "" # True lst[0] == x # True lst[0] == "" # False

...

It looks to me as though the buffer contents of [''] is a length 1 string with a 0 byte, and an array.size of 1 - is that also what you think? I guess I think that it should be a length 0 string, with a array.size of 0

Array size of 0 would be very weird: it means it would have no items, whereas it actually has one item (which itself has a size 0). If you create a list with an empty string (x = [""]), you have len(x) == 1 and len(x[0]) == 0. But an empty string has size 0, so the corresponding dtype should have an itemsize of 0 (assuming the array only contains empty strings).

...

I guess that I will have to special-case the writing code to detect 'empty' strings, but I can't (I don't think) distinguish a real string with \x00 from an empty string.

In python "proper", they are different: "¥x00" != "". The problem is that it does not seem possible ATM to create an numpy array with an empty string. David

Matthew Brett

1:20 a.m.

Hi,

...

x = "¥00" arr = np.array([x]) lst = [x]

arr[0] == x # False arr[0] == "" # True

lst[0] == x # True lst[0] == "" # False

Ah - thanks - got it.

...

...
It looks to me as though the buffer contents of [''] is a length 1 string with a 0 byte, and an array.size of 1 - is that also what you think? I guess I think that it should be a length 0 string, with a array.size of 0

Array size of 0 would be very weird: it means it would have no items, whereas it actually has one item (which itself has a size 0).

Is this a string-specific thing? I mean, you can have size 0 1d numeric arrays. Sorry if I'm being slow, it's late here. In [70]: np.array([[]]).shape Out[70]: (1, 0) In [71]: np.array([[]]).size Out[71]: 0 Cheers, Matthew

David Cournapeau

2:36 a.m.

On Wed, Dec 30, 2009 at 10:20 AM, Matthew Brett wrote:

...

Hi,

...
x = "¥00" arr = np.array([x]) lst = [x]

arr[0] == x # False arr[0] == "" # True

lst[0] == x # True lst[0] == "" # False

Ah - thanks - got it.

...
...
It looks to me as though the buffer contents of [''] is a length 1 string with a 0 byte, and an array.size of 1 - is that also what you think? I guess I think that it should be a length 0 string, with a array.size of 0

Array size of 0 would be very weird: it means it would have no items, whereas it actually has one item (which itself has a size 0).

Is this a string-specific thing?

No. I was not very clear: My point was that size 0 array is likely not what you want. What you want is arrays whose *itemsize" is 0.

...

I mean, you can have size 0 1d numeric arrays. Sorry if I'm being slow, it's late here.

In [70]: np.array([[]]).shape Out[70]: (1, 0)

In [71]: np.array([[]]).size Out[71]: 0

Yes, you can create array with size 0. But I don't think that's what you want - you cannot index them normally (even though the array is 2d, you cannot do arr[0][0], so I don't think that's very useful for your case). David

Warren Weckesser

29 Dec 29 Dec

11:45 p.m.

Hmmm... I don't see where you created "an empty string array" in your examples. All three of your arrays contain one element, so they are not empty. The element in the array z happens to be zero, is all. Here's an example of creating an empty string array: In [2]: e = np.array([],dtype='S') In [3]: e Out[3]: array([], dtype='|S1') In [4]: e.shape Out[4]: (0,) Warren Matthew Brett wrote:

...

Hi,

I was surprised by this - should I have been?

In [35]: e = np.array(['a'])

In [36]: e.shape Out[36]: (1,)

In [37]: e.size Out[37]: 1

In [38]: e.tostring() Out[38]: 'a'

In [39]: f = np.array(['a'])

In [40]: f.shape == e.shape Out[40]: True

In [41]: f.size == e.size Out[41]: True

In [42]: f.tostring() Out[42]: 'a'

In [43]: z = np.array(['\x00'])

In [44]: z.shape Out[44]: (1,)

In [45]: z.size Out[45]: 1

In [46]: z Out[46]: array([''], dtype='|S1')

That is, an empty string array seems to be the same as a string array with a single 0 byte, including having shape (1,) and size 1...

Best,

Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

30 Dec 30 Dec

6:34 p.m.

On Tue, Dec 29, 2009 at 4:35 PM, Matthew Brett wrote:

...

Hi,

I was surprised by this - should I have been?

In [35]: e = np.array(['a'])

In [36]: e.shape Out[36]: (1,)

In [37]: e.size Out[37]: 1

In [38]: e.tostring() Out[38]: 'a'

In [39]: f = np.array(['a'])

In [40]: f.shape == e.shape Out[40]: True

In [41]: f.size == e.size Out[41]: True

In [42]: f.tostring() Out[42]: 'a'

In [43]: z = np.array(['\x00'])

In [44]: z.shape Out[44]: (1,)

In [45]: z.size Out[45]: 1

In [46]: z Out[46]: array([''], dtype='|S1')

That is, an empty string array seems to be the same as a string array with a single 0 byte, including having shape (1,) and size 1...

It isn't empty: In [3]: array(['\x00']).dtype Out[3]: dtype('|S1') In [4]: array(['\x00']).tostring() Out[4]: '\x00' In [5]: array(['\x00'])[0] Out[5]: '' Looks like a printing problem to me, something in __repr__ for the string array. It seems that trailing zeros are trimmed off. In [11]: array(['a\x00\x00']) Out[11]: array(['a'], dtype='|S3') In [12]: array(['a\x00b']) Out[12]: array(['a\x00b'], dtype='|S3') Chuck

Matthew Brett

7 p.m.

Hi.

...

It isn't empty:

In [3]: array(['\x00']).dtype Out[3]: dtype('|S1')

In [4]: array(['\x00']).tostring() Out[4]: '\x00'

In [5]: array(['\x00'])[0] Out[5]: ''

No, but my problem was that an empty string is not empty either, and that you can't therefore distinguish between an empty string and a string with all 0 bytes: In [11]: np.array('') == '\x00\x00\x00' Out[11]: array(True, dtype=bool)

...

Looks like a printing problem to me, something in __repr__ for the string array. It seems that trailing zeros are trimmed off.

In [11]: array(['a\x00\x00']) Out[11]: array(['a'], dtype='|S3')

In [12]: array(['a\x00b']) Out[12]: array(['a\x00b'], dtype='|S3')

I don't think it's a printing problem, I think it's that the trailing zeros are pulled off in the string comparisons, and for printing, even though they are present in memory. I mean, that a.tostring() is right, and the __repr__ and comparisons are - at least to me - confusing. In [2]: a = np.array('a\x00\x00\x00') In [3]: a Out[3]: array('a', dtype='|S4') In [5]: a == 'a' Out[5]: array(True, dtype=bool) In [7]: a == 'a\x00\x00\x00' Out[7]: array(True, dtype=bool) See you, Matthew

Charles R Harris

7:21 p.m.

On Wed, Dec 30, 2009 at 12:00 PM, Matthew Brett wrote:

...

Hi.

...
It isn't empty:

In [3]: array(['\x00']).dtype Out[3]: dtype('|S1')

In [4]: array(['\x00']).tostring() Out[4]: '\x00'

In [5]: array(['\x00'])[0] Out[5]: ''

No, but my problem was that an empty string is not empty either, and that you can't therefore distinguish between an empty string and a string with all 0 bytes:

In [11]: np.array('') == '\x00\x00\x00' Out[11]: array(True, dtype=bool)

...
Looks like a printing problem to me, something in __repr__ for the string array. It seems that trailing zeros are trimmed off.

In [11]: array(['a\x00\x00']) Out[11]: array(['a'], dtype='|S3')

In [12]: array(['a\x00b']) Out[12]: array(['a\x00b'], dtype='|S3')

I don't think it's a printing problem, I think it's that the trailing zeros are pulled off in the string comparisons, and for printing, even though they are present in memory. I mean, that a.tostring() is right, and the __repr__ and comparisons are - at least to me - confusing.

In [2]: a = np.array('a\x00\x00\x00')

In [3]: a Out[3]: array('a', dtype='|S4')

In [5]: a == 'a' Out[5]: array(True, dtype=bool)

In [7]: a == 'a\x00\x00\x00' Out[7]: array(True, dtype=bool)

That is due to type promotion for the ufunc call: In [17]: a1 = np.array('a\x00\x00\x00') n [21]: np.array(['a'], dtype=a1.dtype)[0] Out[21]: 'a' In [22]: np.array(['a'], dtype=a1.dtype).tostring() Out[22]: 'a\x00\x00\x00' Chuck

Christopher Barker

31 Dec 31 Dec

2:08 a.m.

Charles R Harris wrote:

...

That is due to type promotion for the ufunc call:

In [17]: a1 = np.array('a\x00\x00\x00')

n [21]: np.array(['a'], dtype=a1.dtype)[0] Out[21]: 'a'

In [22]: np.array(['a'], dtype=a1.dtype).tostring() Out[22]: 'a\x00\x00\x00'

it took me a bit to figure out what this meant, so in case I'm not the only one, I thought I'd spell it out: In [3]: s1 = np.array('a') In [4]: s1.dtype Out[4]: dtype('|S1') so s1's dytype is a length-1 string In [11]: s2 = np.array('a\x00\x00') In [12]: s2.dtype Out[12]: dtype('|S3') and s2's is a length-3 string In [13]: s1 == s2 Out[13]: array(True, dtype=bool) when they are compared, s1's dtype is coerced to a length 3 string by padding with nulls, and thus they compare equal. otherwise, there is nothing special about zero bytes in a string: In [14]: s3 = np.array('\x00a\x00') In [15]: s3 == s2 Out[15]: array(False, dtype=bool) In [16]: s3 == s1 Out[16]: array(False, dtype=bool) The problem is that there is zero bytes are the only way to pad a string. I suppose the comparison could be smarter, by comparing without coercing, but that may not be possible without the ufunc machinery. As for printing, I think it simply reflects that numpy strings are null padded, and most people probably wouldn't want to see all those nulls every time. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Matthew Brett

10:13 a.m.

Hi, On Thu, Dec 31, 2009 at 2:08 AM, Christopher Barker wrote:

...

Charles R Harris wrote:

...
That is due to type promotion for the ufunc call:

In [17]: a1 = np.array('a\x00\x00\x00')

n [21]: np.array(['a'], dtype=a1.dtype)[0] Out[21]: 'a'

In [22]: np.array(['a'], dtype=a1.dtype).tostring() Out[22]: 'a\x00\x00\x00'

it took me a bit to figure out what this meant, so in case I'm not the only one, I thought I'd spell it out:

I think the summary here is 'numpy strings are zero padded; therefore you may run into surprises with a string that has trailing zeros'. I see why that is - the zero terminator is the only way for numpy arrays to see where the end of the string is... Best, Matthew

Christopher Barker

1 Jan 1 Jan

12:35 a.m.

Matthew Brett wrote:

...

I think the summary here is 'numpy strings are zero padded; therefore you may run into surprises with a string that has trailing zeros'.

I see why that is - the zero terminator is the only way for numpy arrays to see where the end of the string is...

almost -- it's not quite zero-terminated, you can have embedded zeros: In [35]: np.array('aa\x00bb', dtype='S6') Out[35]: array('aa\x00bb', dtype='|S6') -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

5227

Age (days ago)

5230

Last active (days ago)

List overview

Download

14 comments

5 participants

participants (5)

Charles R Harris
Christopher Barker
David Cournapeau
Matthew Brett
Warren Weckesser

Empty strings not empty?

tags

participants (5)