On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:

> Indeed,
> Most of this discussion is irrelevant to numpy.
> Numpy only really deals with the in memory storage of strings. And in
> that it is limited to fixed length strings (in bytes/codepoints).
> How you get your messy strings into numpy arrays is not very relevant to
> the discussion of a smaller representation of strings.
> You couldn't get messy strings into numpy without first sorting it out
> yourself before, you won't be able to afterwards.
> Numpy will offer a set of encodings, the user chooses which one is best
> for the use case and if the user screws it up, it is not numpy's problem.
>
> You currently only have a few ways to even construct string arrays:
> - array construction and loops
> - genfromtxt (which is again just a loop)
> - memory mapping which I seriously doubt anyone actually does for the S
> and U dtype

I fear that you decided that the discussion was irrelevant and thus did not read it rather than reading it to decide that it was not relevant. Because several of us have showed that, yes indeed, we do memory-map string arrays.

You can add to this list C APIs, like that of libhdf5, that need to communicate (Unicode) string arrays.

Look, I know I can be tedious, but *please* go back and read this discussion. We have concrete use cases outlined. We can give you more details if you need them. We all feel the pain of the rushed, inadequate implementation of the U dtype. But each of our pains is a little bit different; you obviously aren't experiencing the same pains that I am.

> Having a new dtype changes nothing here. You still need to create numpy
> arrays from python strings which are well defined and clean.
> If you put something in that doesn't encode you get an encoding error.
> No oddities like surrogate escapes are needed, numpy arrays are not
> interfaces to operating systems nor does numpy need to _add_ support for
> historical oddities beyond what it already has.
> If you want to represent bytes exactly as they came in don't use a text
> dtype (which includes the S dtype, use i1).

Thomas Aldcroft has demonstrated the problem with this approach. numpy arrays are often interfaces to files that have tons of historical oddities.

> Concerning variable sized strings, this is simply not going to happen.
> Nobody is going to rewrite numpy to support it, especially not just for
> something as unimportant as strings.
> Best you are going to get (or better already have) is object arrays. It
> makes no sense to discuss it unless someone comes up with an actual
> proposal and the willingness to code it.

No one has suggested such a thing. At most, we've talked about specializing object arrays.

> What is a relevant discussion is whether we really need a more compact
> but limited representation of text than 4-byte utf32 at all.

> Its usecase is for the most part just for python3 porting and saving

> some memory in some ascii heavy cases, e.g. astronomy.
> It is not that significant anymore as porting to python3 has mostly
> already happened via the ugly byte workaround and memory saving is
> probably not as significant in the context of numpy which is already
> heavy on memory usage.
>
> My initial approach was to not add a new dtype but to make unicode
> parametrizable which would have meant almost no cluttering of numpys
> internals and keeping the api more or less consistent which would make
> this a relatively simple addition of minor functionality for people that
> want it.
> But adding a completely new partially redundant dtype for this usecase
> may be a too large change to the api. Having two partially redundant
> string types may confuse users more than our current status quo of our
> single string type (U).
>

> Discussing whether we want to support truncated utf8 has some merit as
> it is a decision whether to give the users an even larger gun to shot
> themselves in the foot with.
> But I'd like to focus first on the 1 byte type to add a symmetric API
> for python2 and python3.
> utf8 can always be added latter should we deem it a good idea.

What is your current proposal? A string dtype parameterized with the encoding (initially supporting the latin-1 that you desire and maybe adding utf-8 later)? Or a latin-1-specific dtype such that we will have to add a second utf-8 dtype at a later date?

If you're not going to support arbitrary encodings right off the bat, I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first as they seem to knock off more use cases straight away.

--
Robert Kern