[Numpy-discussion] String type again.

Tue Jul 15 07:26:30 EDT 2014

On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
> As previous posts have pointed out, Numpy's `S` type is currently
> treated as a byte string, which leads to more complicated code in
> python3. OTOH, the unicode type is stored as UCS4, which consumes a
> lot of space, especially for ascii strings. This note proposes to
> adapt the currently existing 'a' type letter, currently aliased to
> 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte
> internal representations for unicode strings, ascii and latin1. Ascii
> has the advantage that it is a subset of UTF-8, whereas latin1 has a
> few more symbols. Another possibility is to just make it an UTF-8
> encoding, but I think this would involve more overhead as Python would
> need to determine the maximum character size. These are just
> preliminary thoughts, comments are welcome.
> 

Just wondering, couldn't we have a type which actually has an
(arbitrary, python supported) encoding (and "bytes" might even just be a
special case of no encoding)? Basically storing bytes and on access do
element[i].decode(specified_encoding) and on storing element[i] =
value.encode(specified_encoding).

There is always the never ending small issue of trailing null bytes. If
we want to be fully compatible, such a type would have to store the
string length explicitly to support trailing null bytes.

- Sebastian

> 
> Chuck  
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion