On Wed, Apr 26, 2017 at 10:45 AM, Robert Kern <robert.kern@gmail.com> wrote:
The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases.
isn't UTF-32 pretty compressible also? lots of zeros in there.... here's an example with pure ascii Lorem Ipsum text: In [17]: len(text) Out[17]: 446 In [18]: len(utf8) Out[18]: 446 # the same -- it's pure ascii In [20]: len(utf32) Out[20]: 1788 # four times a big -- of course. In [22]: len(bz2.compress(utf8)) Out[22]: 302 # so from 446 to 302, not that great -- probably it would be better for longer text # -- but are compressing whole arrays or individual strings? In [23]: len(bz2.compress(utf32)) Out[23]: 319 # almost as good as the compressed utf-8 And I'm guessing it would be even closer with more non-ascii charactors. OK -- turns out I'm wrong -- here it with greek -- not a lot of ascii charactors: In [29]: len(text) Out[29]: 672 In [30]: utf8 = text.encode("utf-8") In [31]: len(utf8) Out[31]: 1180 # not bad, really -- still smaller than utf-16 :-) In [33]: len(bz2.compress(utf8)) Out[33]: 495 # pretty good then -- better than 50% In [34]: utf32 = text.encode("utf-32") In [35]: len(utf32) Out[35]: 2692 In [36]: len(bz2.compress(utf32)) Out[36]: 515 # still not quite as good as utf-8, but close. So: utf-8 compresses better than utf-32, but only by a little bit -- at least with bz2. But it is a lot smaller uncompressed.
The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters.
It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text: http://utf8everywhere.org/#myths
It's really the only way with utf-8 -- which is why it is an impedance mismatch with python strings.
I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications.
fortunately, we don't need to agree to that to agree that:
So if we're adding any new string encodings, it needs to be one of them.
Yup -- the most important one to add -- I don't think it is "The Right Way" for all applications -- but it "The Right Way" for text interchange. And regardless of what any of us think -- it is widely used.
(1) object arrays of strings. (We have these already; whether a strings-only specialization would permit useful things like string-oriented ufuncs is a question for someone who's willing to implement one.)
This is the right way to get variable length strings -- but I'm concerned that it doesn't mesh well with numpy uses like npz files, raw dumping of array data, etc. It should not be the only way to get proper Unicode support, nor the default when you do: array(["this", "that"])
(2) a dtype for fixed byte-size, specified-encoding, NULL-padded data. All python encodings should be permitted. An additional function to truncate encoded data without mangling the encoding would be handy.
I think necessary -- at least when you pass in a python string...
I think it makes more sense for this to be NULL-padded than NULL-terminated but it may be necessary to support both; note that NULL-termination is complicated for encodings like UCS4.
is it if you know it's UCS4? or even know the size of the code-unit (I think that's the term)
This also includes the legacy UCS4 strings as a special case.
what's special about them? I think the only thing shold be that they are the default.
(3) a dtype for fixed-length byte strings. This doesn't look very different from an array of dtype u8, but given we have the bytes type, accessing the data this way makes sense.
The void dtype is already there for this general purpose and mostly works, with a few niggles.
I'd never noticed that! And if I had I never would have guessed I could use it that way.
If it worked more transparently and perhaps rigorously with `bytes`, then it would be quite suitable.
Then we should fix a bit of those things -- and call it soemthig like "bytes", please. -CHB
--
Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov