[Numpy-discussion] proposal: smaller representation of string arrays

Julian Taylor jtaylor.debian at googlemail.com
Wed Apr 26 05:15:36 EDT 2017


On 26.04.2017 03:55, josef.pktd at gmail.com wrote:
> On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>>
>>
>> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.kern at gmail.com> wrote:
>>>
>>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal
>>> <chris.barker at noaa.gov> wrote:
>>>
>>>>> Presumably you're getting byte strings (with  unknown encoding.
>>>>
>>>> No -- thus is for creating and using mostly ascii string data with
>>>> python and numpy.
>>>>
>>>> Unknown encoding bytes belong in byte arrays -- they are not text.
>>>
>>> You are welcome to try to convince Thomas of that. That is the status quo
>>> for him, but he is finding that difficult to work with.
>>>
>>>> I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
>>>> with a few extra characters" data. With all the sloppiness over the years,
>>>> there are way to many files like that.
>>>
>>> That sloppiness that you mention is precisely the "unknown encoding"
>>> problem. Your previous advocacy has also touched on using latin-1 to decode
>>> existing files with unknown encodings as well. If you want to advocate for
>>> using latin-1 only for the creation of new data, maybe stop talking about
>>> existing files? :-)
>>>
>>>> Note: the primary use-case I have in mind is working with ascii text in
>>>> numpy arrays efficiently-- folks have called for that. All I'm saying is use
>>>> Latin-1 instead of ascii -- that buys you some useful extra characters.
>>>
>>> For that use case, the alternative in play isn't ASCII, it's UTF-8, which
>>> buys you a whole bunch of useful extra characters. ;-)
>>>
>>> There are several use cases being brought forth here. Some involve file
>>> reading, some involve file writing, and some involve in-memory manipulation.
>>> Whatever change we make is going to impinge somehow on all of the use cases.
>>> If all we do is add a latin-1 dtype for people to use to create new
>>> in-memory data, then someone is going to use it to read existing data in
>>> unknown or ambiguous encodings.
>>
>>
>>
>> The maximum length of an UTF-8 character is 4 bytes, so we could use that to
>> size arrays by character length. The advantage over UTF-32 is that it is
>> easily compressible, probably by a factor of 4 in many cases. That doesn't
>> solve the in memory problem, but does have some advantages on disk as well
>> as making for easy display. We could compress it ourselves after encoding by
>> truncation.
>>
>> Note that for terminal display we will want something supported by the
>> system, which is another problem altogether. Let me break the problem down
>> into four categories
>>
>> Storage -- hdf5, .npy, fits, etc.
>> Display -- ?
>> Modification -- editing
>> Parsing -- fits, etc.
>>
>> There is probably no one solution that is optimal for all of those.
>>
>> Chuck
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> 
> 
> quoting Julian
> 
> '''
> I probably have formulated my goal with the proposal a bit better, I am
> not very interested in a repetition of which encoding to use debate.
> In the end what will be done allows any encoding via a dtype with
> metadata like datetime.
> This allows any codec (including truncated utf8) to be added easily (if
> python supports it) and allows sidestepping the debate.
> 
> My main concern is whether it should be a new dtype or modifying the
> unicode dtype. Though the backward compatibility argument is strongly in
> favour of adding a new dtype that makes the np.unicode type redundant.
> '''
> 
> I don't quite understand why this discussion goes in a direction of an
> either one XOR the other dtype.
> 
> I thought the parameterized 1-byte encoding that Julian mentioned
> initially sounds useful to me.
> 
> (I'm not sure I will use it much, but I also don't use float16 )
> 
> Josef

Indeed,
Most of this discussion is irrelevant to numpy.
Numpy only really deals with the in memory storage of strings. And in
that it is limited to fixed length strings (in bytes/codepoints).
How you get your messy strings into numpy arrays is not very relevant to
the discussion of a smaller representation of strings.
You couldn't get messy strings into numpy without first sorting it out
yourself before, you won't be able to afterwards.
Numpy will offer a set of encodings, the user chooses which one is best
for the use case and if the user screws it up, it is not numpy's problem.

You currently only have a few ways to even construct string arrays:
- array construction and loops
- genfromtxt (which is again just a loop)
- memory mapping which I seriously doubt anyone actually does for the S
and U dtype

Having a new dtype changes nothing here. You still need to create numpy
arrays from python strings which are well defined and clean.
If you put something in that doesn't encode you get an encoding error.
No oddities like surrogate escapes are needed, numpy arrays are not
interfaces to operating systems nor does numpy need to _add_ support for
historical oddities beyond what it already has.
If you want to represent bytes exactly as they came in don't use a text
dtype (which includes the S dtype, use i1).

Concerning variable sized strings, this is simply not going to happen.
Nobody is going to rewrite numpy to support it, especially not just for
something as unimportant as strings.
Best you are going to get (or better already have) is object arrays. It
makes no sense to discuss it unless someone comes up with an actual
proposal and the willingness to code it.


What is a relevant discussion is whether we really need a more compact
but limited representation of text than 4-byte utf32 at all.
Its usecase is for the most part just for python3 porting and saving
some memory in some ascii heavy cases, e.g. astronomy.
It is not that significant anymore as porting to python3 has mostly
already happened via the ugly byte workaround and memory saving is
probably not as significant in the context of numpy which is already
heavy on memory usage.

My initial approach was to not add a new dtype but to make unicode
parametrizable which would have meant almost no cluttering of numpys
internals and keeping the api more or less consistent which would make
this a relatively simple addition of minor functionality for people that
want it.
But adding a completely new partially redundant dtype for this usecase
may be a too large change to the api. Having two partially redundant
string types may confuse users more than our current status quo of our
single string type (U).

Discussing whether we want to support truncated utf8 has some merit as
it is a decision whether to give the users an even larger gun to shot
themselves in the foot with.
But I'd like to focus first on the 1 byte type to add a symmetric API
for python2 and python3.
utf8 can always be added latter should we deem it a good idea.

cheers,
Julian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 845 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/60f0706e/attachment.sig>


More information about the NumPy-Discussion mailing list