Chuck: That sounds like something we want to deprecate, for the same reason that python3 no longer allows str(b'123') to do the right thing.

Specifically, it seems like astype should always be forbidden to go between unicode and byte arrays - so that would need to be written as:

In [1]: a = array([1,2,3], uint8) + 0x30

In [2]: a.view('S1')
Out[2]: 
array(['1', '2', '3'],
      dtype='|S1')

In [3]: a.view('U[ascii]')
Out[3]: 
array([u'1', u'2', u'3'],
      dtype='<U[ascii]1')  

In [4]: a.view('U[ascii]').astype('U[ucs32]')  # re-encoding is a astype operation
Out[4]: 
array([u'1', u'2', u'3'],
      dtype='<U1')     # UCS32 is the current default

In [5]: a.view('U[ascii]').astype('U[ucs32]').view(uint8)
Out [5]:
array([0x31, 0, 0, 0, 0x32, 0, 0, 0, 0x33, 0, 0, 0])

I guess for backwards compatibility, .view('U') would always mean view('U[ucs32]').

As an aside - it’d be nice if parameterized dtypes acquired a non-string syntax, like np.unicode_['ucs32'].

Eric

On Tue, 25 Apr 2017 at 19:19 Charles R Harris charlesr.harris@gmail.com wrote:

On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <peridot.faceted@gmail.com> wrote:

On Tue, Apr 25, 2017 at 7:09 PM Robert Kern <robert.kern@gmail.com> wrote:
* HDF5 supports fixed-length and variable-length string arrays encoded in ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite the documentation claiming that there are more options). In practice, the ASCII strings permit high-bit characters, but the encoding is unspecified. Memory-mapping is rare (but apparently possible). The two major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to support that HDF5 option. Compression is supported for fixed-length string arrays but not variable-length string arrays.

* FITS supports fixed-length string arrays that are NULL-padded. The strings do not have a formal encoding, but in practice, they are typically mostly ASCII characters with the occasional high-bit character from an unspecific encoding. Memory-mapping is a common practice. These arrays can be quite large even if each scalar is reasonably small.

* pandas uses object arrays for flexible in-memory handling of string columns. Lengths are not fixed, and None is used as a marker for missing data. String columns must be written to and read from a variety of formats, including CSV, Excel, and HDF5, some of which are Unicode-aware and work with `unicode/str` objects instead of `bytes`.

* There are a number of sometimes-poorly-documented, often-poorly-adhered-to, aging file format "standards" that include string arrays but do not specify encodings, or such specification is ignored in practice. This can make the usual "Unicode sandwich" at the I/O boundaries difficult to perform.

* In Python 3 environments, `unicode/str` objects are rather more common, and simple operations like equality comparisons no longer work between `bytes` and `unicode/str`, making it difficult to work with numpy string arrays that yield `bytes` scalars.

It seems the greatest challenge is interacting with binary data from other programs and libraries. If we were living entirely in our own data world, Unicode strings in object arrays would generally be pretty satisfactory. So let's try to get what is needed to read and write other people's formats.

I'll note that this is numpy, so variable-width fields (e.g. CSV) don't map directly to numpy arrays; we can store it however we want, as conversion is necessary anyway. 

Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur.

Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated.

For  byte strings, it looks like we need a parameterized type. This is for two uses, display and conversion to (Python) unicode. One could handle the display and conversion using view and astype methods. For instance, we already have

In [1]: a = array([1,2,3], uint8) + 0x30

In [2]: a.view('S1')
Out[2]:
array(['1', '2', '3'],
      dtype='|S1')

In [3]: a.view('S1').astype('U')
Out[3]:
array([u'1', u'2', u'3'],
      dtype='<U1')

Chuck
 
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion