[Numpy-discussion] Does a `mergesorted` function make sense?

Eelco Hoogendoorn hoogendoorn.eelco at gmail.com
Thu Sep 4 14:14:22 EDT 2014


I should clarify: I am speaking about my implementation, I havnt looked at
the numpy implementation for a while so im not sure what it is up to. Note
that by 'almost free', we are still talking about three passes over the
whole array plus temp allocations, but I am assuming a use-case where the
various sorts involved are the dominant cost, which I imagine they are, for
out-of-cache sorts. Perhaps this isn't too realistic an assumption about
the average use case though, I don't know. Though I suppose its a
reasonable guideline to assume that either the dataset is big, or
performance isn't that big a concern in the first place.


On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río <
jaime.frio at gmail.com> wrote:

> On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn <
> hoogendoorn.eelco at gmail.com> wrote:
>
>>
>> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn <
>> hoogendoorn.eelco at gmail.com> wrote:
>>
>>>
>>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río <
>>> jaime.frio at gmail.com> wrote:
>>>
>>>>  On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río <
>>>> jaime.frio at gmail.com> wrote:
>>>>
>>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn <
>>>>> hoogendoorn.eelco at gmail.com> wrote:
>>>>>
>>>>>>  Not sure about the hashing. Indeed one can also build an index of a
>>>>>> set by means of a hash table, but its questionable if this leads to
>>>>>> improved performance over performing an argsort. Hashing may have better
>>>>>> asymptotic time complexity in theory, but many datasets used in practice
>>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is
>>>>>> higher. But more importantly, using a hash-table guarantees poor cache
>>>>>> behavior for many operations using this index. By contrast, sorting may
>>>>>> (but need not) make one random access pass to build the index, and may (but
>>>>>> need not) perform one random access to reorder values for grouping. But
>>>>>> insofar as the keys are better behaved than pure random, this coherence
>>>>>> will be exploited.
>>>>>>
>>>>>
>>>>> If you want to give it a try, these branch of my numpy fork has hash
>>>>> table based implementations of unique (with no extra indices) and in1d:
>>>>>
>>>>>  https://github.com/jaimefrio/numpy/tree/hash-unique
>>>>>
>>>>> A use cases where the hash table is clearly better:
>>>>>
>>>>> In [1]: import numpy as np
>>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d
>>>>>
>>>>> In [3]: a = np.random.randint(10, size=(10000,))
>>>>> In [4]: %timeit np.unique(a)
>>>>> 1000 loops, best of 3: 258 us per loop
>>>>> In [5]: %timeit _unique(a)
>>>>> 10000 loops, best of 3: 143 us per loop
>>>>> In [6]: %timeit np.sort(_unique(a))
>>>>> 10000 loops, best of 3: 149 us per loop
>>>>>
>>>>> It typically performs between 1.5x and 4x faster than sorting. I
>>>>> haven't profiled it properly to know, but there may be quite a bit of
>>>>> performance to dig out: have type specific comparison functions, optimize
>>>>> the starting hash table size based on the size of the array to avoid
>>>>> reinsertions...
>>>>>
>>>>> If getting the elements sorted is a necessity, and the array contains
>>>>> very few or no repeated items, then the hash table approach may even
>>>>> perform worse,:
>>>>>
>>>>> In [8]: a = np.random.randint(10000, size=(5000,))
>>>>> In [9]: %timeit np.unique(a)
>>>>> 1000 loops, best of 3: 277 us per loop
>>>>> In [10]: %timeit np.sort(_unique(a))
>>>>> 1000 loops, best of 3: 320 us per loop
>>>>>
>>>>> But the hash table still wins in extracting the unique items only:
>>>>>
>>>>> In [11]: %timeit _unique(a)
>>>>> 10000 loops, best of 3: 187 us per loop
>>>>>
>>>>> Where the hash table shines is in more elaborate situations. If you
>>>>> keep the first index where it was found, and the number of repeats, in the
>>>>> hash table, you can get return_index and return_counts almost for free,
>>>>> which means you are performing an extra 3x faster than with sorting.
>>>>> return_inverse requires a little more trickery, so I won;t attempt to
>>>>> quantify the improvement. But I wouldn't be surprised if, after fine tuning
>>>>> it, there is close to an order of magnitude overall improvement
>>>>>
>>>>> The spped-up for in1d is also nice:
>>>>>
>>>>> In [16]: a = np.random.randint(1000, size=(1000,))
>>>>> In [17]: b = np.random.randint(1000, size=(500,))
>>>>> In [18]: %timeit np.in1d(a, b)
>>>>> 1000 loops, best of 3: 178 us per loop
>>>>> In [19]: %timeit _in1d(a, b)
>>>>> 10000 loops, best of 3: 30.1 us per loop
>>>>>
>>>>> Of course, there is no point in
>>>>>
>>>>
>>>> Ooops!!! Hit the send button too quick. Not to extend myself too long:
>>>> if we are going to rethink all of this, we should approach it with an open
>>>> mind. Still, and this post is not helping with that either, I am afraid
>>>> that we are discussing implementation details, but are missing a broader
>>>> vision of what we want to accomplish and why. That vision of what numpy's
>>>> grouping functionality, if any, should be, and how it complements or
>>>> conflicts with what pandas is providing, should precede anything else. I
>>>> now I haven't, but has anyone looked at how pandas implements grouping?
>>>> Their documentation on the subject is well worth a read:
>>>>
>>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html
>>>>
>>>> Does numpy need to replicate this? What/why/how can we add to that?
>>>>
>>>> Jaime
>>>>
>>>> --
>>>> (\__/)
>>>> ( O.o)
>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>>> planes de dominación mundial.
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>> I would certainly not be opposed to having a hashing based indexing
>>> mechanism; I think it would make sense design-wise to have a HashIndex
>>> class with the same interface as the rest, and use that subclass in those
>>> arraysetops where it makes sense. The 'how to' of indexing and its
>>> applications are largely orthogonal I think (with some tiny performance
>>> compromises which are worth the abstraction imo). For datasets which are
>>> not purely random, have many unique items, and which do not fit into cache,
>>> I would expect sorting to come out on top, but indeed it depends on the
>>> dataset.
>>>
>>> Yeah, the question how pandas does grouping, and whether we can do
>>> better, is a relevant one.
>>>
>>> From what I understand, pandas relies on cython extensions to get
>>> vectorized grouping functionality. This is no longer necessary since the
>>> introduction of ufuncs in numpy. I don't know how the implementations
>>> compare in terms of performance, but I doubt the difference is huge.
>>>
>>> I personally use grouping a lot in my code, and I don't like having to
>>> use pandas for it. Most importantly, I don't want to go around creating a
>>> dataframe for a single one-line hit-and-run association between keys and
>>> values. The permanent association of different types of data and their
>>> metadata which pandas offers is I think the key difference from numpy,
>>> which is all about manipulating just plain ndarrays. Arguably, grouping
>>> itself is a pretty elementary manipulating of ndarrays, and adding calls to
>>> DataFrame or Series inbetween a statement that could just be
>>> simply group_by(keys).mean(values) feels wrong to me. As does including
>>> pandas as a dependency just to use this small piece of
>>> functionality. Grouping is a more general functionality than any particular
>>> method of organizing your data.
>>>
>>> In terms of features, adding transformations and filtering might be nice
>>> too; I hadn't thought about it, but that is because unlike the currently
>>> implemented features, the need has never arose for me. Im only a small
>>> sample size, and I don't see any fundamental objection to adding such
>>> functionality though. It certainly raises the question as to where to draw
>>> the line with pandas; but my rule of thumb is that if you can think of it
>>> as an elementary operation on ndarrays, then it probably belongs in numpy.
>>>
>>>
>> Oh I forgot to add: with an indexing mechanism based on sorting, unique
>> values and counts also come 'for free', not counting the O(N) cost of
>> actually creating those arrays. The only time an operating relying on an
>> index incurs another nontrivial amount of overhead is in case its 'rank' or
>> 'inverse' property is used, which invokes another argsort. But for the vast
>> majority of grouping or set operations, these properties are never used.
>>
>
> That extra argsort is now gone from master:
>
> https://github.com/numpy/numpy/pull/5012
>
> Even with this improvement, returning any index typically makes
> `np.unique` run at least 2x slower:
>
> In [1]: import numpy as np
> In [2]: a = np.random.randint(100, size=(1000,))
> In [3]: %timeit np.unique(a)
> 10000 loops, best of 3: 37.3 us per loop
> In [4]: %timeit np.unique(a, return_inverse=True)
> 10000 loops, best of 3: 62.1 us per loop
> In [5]: %timeit np.unique(a, return_index=True)
> 10000 loops, best of 3: 72.8 us per loop
> In [6]: %timeit np.unique(a, return_counts=True)
> 10000 loops, best of 3: 56.4 us per loop
> In [7]: %timeit np.unique(a, return_index=True, return_inverse=True,
> return_coun
> ts=True)
> 10000 loops, best of 3: 112 us per loop
>
> --
> (\__/)
> ( O.o)
> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
> de dominación mundial.
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140904/d9b3e350/attachment.html>


More information about the NumPy-Discussion mailing list