[Numpy-discussion] String type again.

Charles R Harris charlesr.harris at gmail.com
Tue Jul 15 11:29:13 EDT 2014


On Tue, Jul 15, 2014 at 9:15 AM, Charles R Harris <charlesr.harris at gmail.com
> wrote:

>
>
>
> On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg <
> sebastian at sipsolutions.net> wrote:
>
>> On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
>> > As previous posts have pointed out, Numpy's `S` type is currently
>> > treated as a byte string, which leads to more complicated code in
>> > python3. OTOH, the unicode type is stored as UCS4, which consumes a
>> > lot of space, especially for ascii strings. This note proposes to
>> > adapt the currently existing 'a' type letter, currently aliased to
>> > 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte
>> > internal representations for unicode strings, ascii and latin1. Ascii
>> > has the advantage that it is a subset of UTF-8, whereas latin1 has a
>> > few more symbols. Another possibility is to just make it an UTF-8
>> > encoding, but I think this would involve more overhead as Python would
>> > need to determine the maximum character size. These are just
>> > preliminary thoughts, comments are welcome.
>> >
>>
>> Just wondering, couldn't we have a type which actually has an
>> (arbitrary, python supported) encoding (and "bytes" might even just be a
>> special case of no encoding)? Basically storing bytes and on access do
>> element[i].decode(specified_encoding) and on storing element[i] =
>> value.encode(specified_encoding).
>>
>> There is always the never ending small issue of trailing null bytes. If
>> we want to be fully compatible, such a type would have to store the
>> string length explicitly to support trailing null bytes.
>>
>
> UTF-8 encoding works with null bytes. That is one of the reasons it is so
> popular.
>
>
Thinking more about it, the easiest thing to do might be to make the S
dtype a UTF-8 encoding. Most of the machinery to deal with that is already
in place. That change might affect some users though, and we might need to
do some work to make it backwards compatible with python 2.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140715/72e4ee3b/attachment.html>


More information about the NumPy-Discussion mailing list