[Numpy-discussion] proposal: smaller representation of string arrays
Chris Barker
chris.barker at noaa.gov
Wed Apr 26 20:02:12 EDT 2017
On Wed, Apr 26, 2017 at 4:30 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
> Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and
> myself have already given), but we seem to be talking past each other here.
>
yeah -- I think it's not clear what the use cases we are talking about are.
> I am still -1 on any new string encoding support unless that includes at
> least UTF-8, with length indicated by the number of bytes.
>
I've said multiple times that utf-8 support is key to any "exchange binary
data" use case (memory mapping?) -- so yes, absolutely.
I _think_ this may be some of the source for the confusion:
The name of this thread is: "proposal: smaller representation of string
arrays".
And I got the impression, maybe mistaken, that folks were suggesting that
internally encoding strings in numpy as "UTF-8, with length indicated by
the number of bytes." was THE solution to the
" the 'U' dtype takes up way too much memory, particularly for
mostly-ascii data" problem.
I do not think it is a good solution to that problem.
I think a good solution to that problem is latin-1 encoding. (bear with me
here...)
But a bunch of folks have brought up that while we're messing around with
string encoding, let's solve another problem:
* Exchanging unicode text at the binary level with other systems that
generally don't use UCS-4.
For THAT -- utf-8 is critical.
But if I understand Julian's proposal -- he wants to create a parameterized
text dtype that you can set the encoding on, and then numpy will use the
encoding (and python's machinery) to encode / decode when passing to/from
python strings.
It seems this would support all our desires:
I'd get a latin-1 encoded type for compact representation of mostly-ascii
data.
Thomas would get latin-1 for binary interchange with mostly-ascii data
The HDF-5 folks would get utf-8 for binary interchange (If we can workout
the null-padding issue)
Even folks that had weird JAVA or Windows-generated UTF-16 data files could
do the binary interchange thing....
I'm now lost as to what the hang-up is.
-CHB
PS: null padding is a pain, python strings seem to preserve the zeros, whic
is odd -- is thre a unicode code-point at x00?
But you can use it to strip properly with the unicode sandwich:
In [63]: ut16 = text.encode('utf-16') + b'\x00\x00\x00\x00\x00\x00'
In [64]: ut16.decode('utf-16')
Out[64]: 'some text\x00\x00\x00'
In [65]: ut16.decode('utf-16').strip('\x00')
Out[65]: 'some text'
In [66]: ut16.decode('utf-16').strip('\x00').encode('utf-16')
Out[66]: b'\xff\xfes\x00o\x00m\x00e\x00 \x00t\x00e\x00x\x00t\x00'
-CHB
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/f882bf97/attachment-0001.html>
More information about the NumPy-Discussion
mailing list