[Numpy-discussion] proposal: smaller representation of string arrays

Chris Barker chris.barker at noaa.gov
Wed Apr 26 19:21:26 EDT 2017

On Wed, Apr 26, 2017 at 10:45 AM, Robert Kern <robert.kern at gmail.com> wrote:

> >>> > The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to size arrays by character length. The advantage over UTF-32 is that
> it is easily compressible, probably by a factor of 4 in many cases.

isn't UTF-32 pretty compressible also? lots of zeros in there....

here's an example with pure ascii  Lorem Ipsum text:

In [17]: len(text)
Out[17]: 446

In [18]: len(utf8)
Out[18]: 446

# the same -- it's pure ascii

In [20]: len(utf32)
Out[20]: 1788

# four times a big -- of course.

In [22]: len(bz2.compress(utf8))
Out[22]: 302

# so from 446 to 302, not that great -- probably it would be better for
longer text
# -- but are compressing whole arrays or individual strings?

In [23]: len(bz2.compress(utf32))
Out[23]: 319

# almost as good as the compressed utf-8

And I'm guessing it would be even closer with more non-ascii charactors.

OK -- turns out I'm wrong -- here it with greek -- not a lot of ascii

In [29]: len(text)
Out[29]: 672

In [30]: utf8 = text.encode("utf-8")

In [31]: len(utf8)
Out[31]: 1180

# not bad, really -- still smaller than utf-16 :-)

In [33]: len(bz2.compress(utf8))
Out[33]: 495

# pretty good then -- better than 50%

In [34]: utf32 = text.encode("utf-32")
In [35]: len(utf32)

Out[35]: 2692

In [36]: len(bz2.compress(utf32))
Out[36]: 515

# still not quite as good as utf-8, but close.

So: utf-8 compresses better than utf-32, but only by a little bit -- at
least with bz2.

But it is a lot smaller uncompressed.

>>> The major use case that we have for a UTF-8 array is HDF5, and it
> specifies the width in bytes, not Unicode characters.
> >>
> >> It's not just HDF5. Counting bytes is the Right Way to measure the size
> of UTF-8 encoded text:
> >> http://utf8everywhere.org/#myths

It's really the only way with utf-8 -- which is why it is an impedance
mismatch with python strings.

>> I also firmly believe (though clearly this is not universally agreed
> upon) that UTF-8 is the Right Way to encode strings for *non-legacy*
> applications.

fortunately, we don't need to agree to that to agree that:

> So if we're adding any new string encodings, it needs to be one of them.

Yup -- the most important one to add -- I don't think it is "The Right Way"
for all applications -- but it "The Right Way" for text interchange.

And regardless of what any of us think -- it is widely used.

> (1) object arrays of strings. (We have these already; whether a
> strings-only specialization would permit useful things like string-oriented
> ufuncs is a question for someone who's willing to implement one.)

This is the right way to get variable length strings -- but I'm concerned
that it doesn't mesh well with numpy uses like npz files, raw dumping of
array data, etc. It should not be the only way to get proper Unicode
support, nor the default when you do:

array(["this", "that"])

> > (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data.
> All python encodings should be permitted. An additional function to
> truncate encoded data without mangling the encoding would be handy.

I think necessary -- at least when you pass in a python string...

> I think it makes more sense for this to be NULL-padded than
> NULL-terminated but it may be necessary to support both; note that
> NULL-termination is complicated for encodings like UCS4.

is it if you know it's UCS4? or even know the size of the code-unit (I
think that's the term)

> This also includes the legacy UCS4 strings as a special case.

what's special about them? I think the only thing shold be that they are
the default.

> > (3) a dtype for fixed-length byte strings. This doesn't look very
> different from an array of dtype u8, but given we have the bytes type,
> accessing the data this way makes sense.
> The void dtype is already there for this general purpose and mostly works,
> with a few niggles.

I'd never noticed that! And if I had I never would have guessed I could use
it that way.

> If it worked more transparently and perhaps rigorously with `bytes`, then
> it would be quite suitable.

Then we should fix a bit of those things -- and call it soemthig like
"bytes", please.


> --

Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/de33ba76/attachment-0001.html>

More information about the NumPy-Discussion mailing list