[Numpy-discussion] Py3 merge

Mon Dec 7 09:50:20 EST 2009

Pauli Virtanen wrote:
> ma, 2009-12-07 kello 09:12 -0500, Michael Droettboom kirjoitti:
>   
>>> We need character arrays for the astro people. I assume these will be 
>>> byte arrays. Maybe Michael will weigh in here.
>>>       
>> I can't find in the thread where removing byte arrays (meaning arrays of 
>> fixed-length non-unicode strings) was suggested -- though changing the 
>> dtype specifier for them was.  That is 'S' would change to 'B' in 
>> python3 (with some deprecation period for 'S'), and 'U' would remain 
>> 'U'.  That seems acceptable to me, as long as we have some way to have 
>> fixed-length 8-bit strings.  Hopefully all the new chararray unit tests 
>> will help with this transition.
>>     
>
> Removal was suggested, with the motivation that people should just use
> byte arrays instead. I think we're not going to remove it at the moment,
> though.
>   
Maybe I'm missing something, but those don't seem the same thing.  The 
byte type is fundamentally numeric, whereas byte strings are 
lexicographic.  They construct, repr and sort differently, and many 
numerical operations don't make sense on strings.  It doesn't seem like 
(at present) byte arrays are a reasonable substitute for string arrays.
> The character 'B' is already by unsigned bytes -- I wonder if it's easy
> to support 'B123' and plain 'B' at the same time, or whether we have to
> pick a different letter for "byte strings". 'y' would be free...
>   
It seems to me the motivation to change the 'S' dtype to something else 
is to make things clearer with respect to the new conventions of Python 
3.  (Where str -> bytes, and unicode -> str). In that sense, I'm not 
sure there's any advantage going from "S" to "y" (particularly without 
doing "U" to "S"), whereas there's a strong backward-compatibility 
advantage to keep it as "S", though admittedly it's confusing to someone 
who doesn't know the pre Python 3 history. 

I'm not sure your suggestion of making 'B' and 'B123' both work seems 
like a good one because of the semantic differences between numbers and 
strings. Would np.array(['a', 'b']) have a repr of [97, 98] or ['a', 
'b']?  Sorting them would also not necessarily do the right thing.
> The chararray unit tests are all presently failing, so they are
> definitely useful :)
>   
Glad to help :)

Mike

-- 
Michael Droettboom
Science Software Branch
Operations and Engineering Division
Space Telescope Science Institute
Operated by AURA for NASA