[Numpy-discussion] Bytes vs. Unicode in Python3

Fri Nov 27 17:19:58 EST 2009

Francesc Alted wrote:
> A Friday 27 November 2009 16:41:04 Pauli Virtanen escrigué:
>>>> I think so.  However, I think S is probably closest to bytes... and
>>>> maybe S can be reused for bytes... I'm not sure though.
>>> That could be a good idea because that would ensure compatibility with
>>> existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes',
>>> as it should).  The only thing that I don't like is that that 'S' seems
>>> to be the initial letter for 'string', which is actually 'unicode' in
>>> Python 3 :-/ But, for the sake of compatibility, we can probably live
>>> with that.
>> Well, we can "deprecate" 'S' (ie. never show it in repr, always only 'B'
>> or 'U').
> 
> Well, deprecating 'S' seems a sensible option too.  But why only avoiding 
> showing it in repr?  Why not issue a DeprecationWarning too?

One thing to keep in mind here is that PEP 3118 actually defines a 
standard dtype format string, which is (mostly) incompatible with 
NumPy's. It should probably be supported as well when PEP 3118 is 
implemented.

Just something to keep in the back of ones mind when discussing this. 
For instance one could, instead of inventing something new, adopt the 
characters PEP 3118 uses (if there isn't a conflict):

  - b: Raw byte
  - c: ucs-1 encoding (latin 1, one byte)
  - u: ucs-2 encoding, two bytes
  - w: ucs-4 encoding, four bytes

Long-term I hope the NumPy-specific format string will be deprecated, so 
that repr print out the PEP 3118 format string etc. But, I'm aware that 
API breakage shouldn't happen when porting to Python 3.

-- 
Dag Sverre