[Numpy-discussion] Does a `mergesorted` function make sense?

Thu Sep 4 13:55:18 EDT 2014

On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn <
hoogendoorn.eelco at gmail.com> wrote:

>
> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn <
> hoogendoorn.eelco at gmail.com> wrote:
>
>>
>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río <
>> jaime.frio at gmail.com> wrote:
>>
>>>  On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río <
>>> jaime.frio at gmail.com> wrote:
>>>
>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn <
>>>> hoogendoorn.eelco at gmail.com> wrote:
>>>>
>>>>>  Not sure about the hashing. Indeed one can also build an index of a
>>>>> set by means of a hash table, but its questionable if this leads to
>>>>> improved performance over performing an argsort. Hashing may have better
>>>>> asymptotic time complexity in theory, but many datasets used in practice
>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is
>>>>> higher. But more importantly, using a hash-table guarantees poor cache
>>>>> behavior for many operations using this index. By contrast, sorting may
>>>>> (but need not) make one random access pass to build the index, and may (but
>>>>> need not) perform one random access to reorder values for grouping. But
>>>>> insofar as the keys are better behaved than pure random, this coherence
>>>>> will be exploited.
>>>>>
>>>>
>>>> If you want to give it a try, these branch of my numpy fork has hash
>>>> table based implementations of unique (with no extra indices) and in1d:
>>>>
>>>>  https://github.com/jaimefrio/numpy/tree/hash-unique
>>>>
>>>> A use cases where the hash table is clearly better:
>>>>
>>>> In [1]: import numpy as np
>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d
>>>>
>>>> In [3]: a = np.random.randint(10, size=(10000,))
>>>> In [4]: %timeit np.unique(a)
>>>> 1000 loops, best of 3: 258 us per loop
>>>> In [5]: %timeit _unique(a)
>>>> 10000 loops, best of 3: 143 us per loop
>>>> In [6]: %timeit np.sort(_unique(a))
>>>> 10000 loops, best of 3: 149 us per loop
>>>>
>>>> It typically performs between 1.5x and 4x faster than sorting. I
>>>> haven't profiled it properly to know, but there may be quite a bit of
>>>> performance to dig out: have type specific comparison functions, optimize
>>>> the starting hash table size based on the size of the array to avoid
>>>> reinsertions...
>>>>
>>>> If getting the elements sorted is a necessity, and the array contains
>>>> very few or no repeated items, then the hash table approach may even
>>>> perform worse,:
>>>>
>>>> In [8]: a = np.random.randint(10000, size=(5000,))
>>>> In [9]: %timeit np.unique(a)
>>>> 1000 loops, best of 3: 277 us per loop
>>>> In [10]: %timeit np.sort(_unique(a))
>>>> 1000 loops, best of 3: 320 us per loop
>>>>
>>>> But the hash table still wins in extracting the unique items only:
>>>>
>>>> In [11]: %timeit _unique(a)
>>>> 10000 loops, best of 3: 187 us per loop
>>>>
>>>> Where the hash table shines is in more elaborate situations. If you
>>>> keep the first index where it was found, and the number of repeats, in the
>>>> hash table, you can get return_index and return_counts almost for free,
>>>> which means you are performing an extra 3x faster than with sorting.
>>>> return_inverse requires a little more trickery, so I won;t attempt to
>>>> quantify the improvement. But I wouldn't be surprised if, after fine tuning
>>>> it, there is close to an order of magnitude overall improvement
>>>>
>>>> The spped-up for in1d is also nice:
>>>>
>>>> In [16]: a = np.random.randint(1000, size=(1000,))
>>>> In [17]: b = np.random.randint(1000, size=(500,))
>>>> In [18]: %timeit np.in1d(a, b)
>>>> 1000 loops, best of 3: 178 us per loop
>>>> In [19]: %timeit _in1d(a, b)
>>>> 10000 loops, best of 3: 30.1 us per loop
>>>>
>>>> Of course, there is no point in
>>>>
>>>
>>> Ooops!!! Hit the send button too quick. Not to extend myself too long:
>>> if we are going to rethink all of this, we should approach it with an open
>>> mind. Still, and this post is not helping with that either, I am afraid
>>> that we are discussing implementation details, but are missing a broader
>>> vision of what we want to accomplish and why. That vision of what numpy's
>>> grouping functionality, if any, should be, and how it complements or
>>> conflicts with what pandas is providing, should precede anything else. I
>>> now I haven't, but has anyone looked at how pandas implements grouping?
>>> Their documentation on the subject is well worth a read:
>>>
>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html
>>>
>>> Does numpy need to replicate this? What/why/how can we add to that?
>>>
>>> Jaime
>>>
>>> --
>>> (\__/)
>>> ( O.o)
>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>> planes de dominación mundial.
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>> I would certainly not be opposed to having a hashing based indexing
>> mechanism; I think it would make sense design-wise to have a HashIndex
>> class with the same interface as the rest, and use that subclass in those
>> arraysetops where it makes sense. The 'how to' of indexing and its
>> applications are largely orthogonal I think (with some tiny performance
>> compromises which are worth the abstraction imo). For datasets which are
>> not purely random, have many unique items, and which do not fit into cache,
>> I would expect sorting to come out on top, but indeed it depends on the
>> dataset.
>>
>> Yeah, the question how pandas does grouping, and whether we can do
>> better, is a relevant one.
>>
>> From what I understand, pandas relies on cython extensions to get
>> vectorized grouping functionality. This is no longer necessary since the
>> introduction of ufuncs in numpy. I don't know how the implementations
>> compare in terms of performance, but I doubt the difference is huge.
>>
>> I personally use grouping a lot in my code, and I don't like having to
>> use pandas for it. Most importantly, I don't want to go around creating a
>> dataframe for a single one-line hit-and-run association between keys and
>> values. The permanent association of different types of data and their
>> metadata which pandas offers is I think the key difference from numpy,
>> which is all about manipulating just plain ndarrays. Arguably, grouping
>> itself is a pretty elementary manipulating of ndarrays, and adding calls to
>> DataFrame or Series inbetween a statement that could just be
>> simply group_by(keys).mean(values) feels wrong to me. As does including
>> pandas as a dependency just to use this small piece of
>> functionality. Grouping is a more general functionality than any particular
>> method of organizing your data.
>>
>> In terms of features, adding transformations and filtering might be nice
>> too; I hadn't thought about it, but that is because unlike the currently
>> implemented features, the need has never arose for me. Im only a small
>> sample size, and I don't see any fundamental objection to adding such
>> functionality though. It certainly raises the question as to where to draw
>> the line with pandas; but my rule of thumb is that if you can think of it
>> as an elementary operation on ndarrays, then it probably belongs in numpy.
>>
>>
> Oh I forgot to add: with an indexing mechanism based on sorting, unique
> values and counts also come 'for free', not counting the O(N) cost of
> actually creating those arrays. The only time an operating relying on an
> index incurs another nontrivial amount of overhead is in case its 'rank' or
> 'inverse' property is used, which invokes another argsort. But for the vast
> majority of grouping or set operations, these properties are never used.
>

That extra argsort is now gone from master:

https://github.com/numpy/numpy/pull/5012

Even with this improvement, returning any index typically makes `np.unique`
run at least 2x slower:

In [1]: import numpy as np
In [2]: a = np.random.randint(100, size=(1000,))
In [3]: %timeit np.unique(a)
10000 loops, best of 3: 37.3 us per loop
In [4]: %timeit np.unique(a, return_inverse=True)
10000 loops, best of 3: 62.1 us per loop
In [5]: %timeit np.unique(a, return_index=True)
10000 loops, best of 3: 72.8 us per loop
In [6]: %timeit np.unique(a, return_counts=True)
10000 loops, best of 3: 56.4 us per loop
In [7]: %timeit np.unique(a, return_index=True, return_inverse=True,
return_coun
ts=True)
10000 loops, best of 3: 112 us per loop

-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140904/5c291a2a/attachment.html>