[Numpy-discussion] Bytes vs. Unicode in Python3
Dag Sverre Seljebotn
dagss at student.matnat.uio.no
Fri Nov 27 17:19:58 EST 2009
Francesc Alted wrote:
> A Friday 27 November 2009 16:41:04 Pauli Virtanen escrigué:
>>>> I think so. However, I think S is probably closest to bytes... and
>>>> maybe S can be reused for bytes... I'm not sure though.
>>> That could be a good idea because that would ensure compatibility with
>>> existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes',
>>> as it should). The only thing that I don't like is that that 'S' seems
>>> to be the initial letter for 'string', which is actually 'unicode' in
>>> Python 3 :-/ But, for the sake of compatibility, we can probably live
>>> with that.
>> Well, we can "deprecate" 'S' (ie. never show it in repr, always only 'B'
>> or 'U').
> Well, deprecating 'S' seems a sensible option too. But why only avoiding
> showing it in repr? Why not issue a DeprecationWarning too?
One thing to keep in mind here is that PEP 3118 actually defines a
standard dtype format string, which is (mostly) incompatible with
NumPy's. It should probably be supported as well when PEP 3118 is
Just something to keep in the back of ones mind when discussing this.
For instance one could, instead of inventing something new, adopt the
characters PEP 3118 uses (if there isn't a conflict):
- b: Raw byte
- c: ucs-1 encoding (latin 1, one byte)
- u: ucs-2 encoding, two bytes
- w: ucs-4 encoding, four bytes
Long-term I hope the NumPy-specific format string will be deprecated, so
that repr print out the PEP 3118 format string etc. But, I'm aware that
API breakage shouldn't happen when porting to Python 3.
More information about the NumPy-Discussion