[Numpy-discussion] numpy.array() of mixed integers and strings can truncate data

Thu Dec 1 16:29:15 EST 2011

On 1 Dec 2011, at 21:35, Chris Barker wrote:

> On 12/1/2011 9:15 AM, Derek Homeier wrote:
>>>>> np.array((2, 12,0.001+2j), dtype='|S8')
>>  array(['2', '12', '(0.001+2'], dtype='|S8')
>> 
>> - notice the last value is only truncated because it had first been converted into
>> a "standard" complex representation, so maybe the problem is already in the way
>> Python treats the input.
> 
> no -- it's truncated because you've specified a 8 char long string, and 
> the string representation of complex is longer than that. I assume that 
> numpy is using the objects __str__ or __repr__:
> 
> In [13]: str(0.001+2j)
> Out[13]: '(0.001+2j)'
> 
> In [14]: repr(0.001+2j)
> Out[14]: '(0.001+2j)'
> 
That's what I meant with the "Python-side" of the issue, but you're right, there is no 
numerical conversion involved. 

> I think the only bug we've identified here is that numpy is selecting 
> the string size based on the longest string input, rather than checking 
> to see how long the string representation of the numeric input is as 
> well. if there is a long-enough string in there, it works fine:
> 
> In [15]: np.array([-345,4,2,'ABC', 'abcde'])
> Out[15]:
> array(['-345', '4', '2', 'ABC', 'abcde'],
>       dtype='|S5')
> 
> An open question is what it should do if you specify the length of the 
> string dtype, but one of the values can't be fit into that size. At this 
> point, it truncates, but should it raise an error?

I would probably raise a warning rather than an error - I think if the user explicitly specifies 
a string length, they should be aware that the data might be truncated (and might even 
want this behaviour). 
Another "issue" could be that the string representation can look quite different from what 
has been typed in, like 

In [95]: np.array(('abcdefg', 12,  0.00001+2j), dtype='|S12')
Out[95]: array(['abcdefg', '12', '(1e-05+2j)'], dtype='|S12')

but then I think one has to accept that _ 0.00001+2j _ is not a string and thus cannot be 
guaranteed to be represented in that exact way - it can be either understood as a 
numerical object or not at all (i.e. one should just type it in as a string - with quotes - 
if one wants string-behaviour). 

Cheers,
				Derek