[Numpy-discussion] Does a `mergesorted` function make sense?

Thu Sep 4 15:19:14 EDT 2014

Naturally, youd want to avoid redoing the indexing where you can, which is
another good reason to factor out the indexing mechanisms into separate
classes. A factor two performance difference does not get me too excited;
again, I think it would be the other way around for an out-of-cache dataset
being grouped. But this by itself is ofcourse another argument for
factoring out the indexing behind a uniform interface, so we can play
around with those implementation details later, and specialize the indexing
to serve different scenarios. Also, it really helps with code
maintainability; most arraysetops are almost trivial to implement once you
have abstracted away the indexing machinery.

On Thu, Sep 4, 2014 at 8:36 PM, Jeff Reback <jeffreback at gmail.com> wrote:

> FYI pandas DOES use a very performant hash table impl for unique (and
> value_counts). Sorted state IS maintained
> by underlying Index implmentation.
> https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx
>
> In [8]: a = np.random.randint(10, size=(10000,))
>
> In [9]: %timeit np.unique(a)
> 1000 loops, best of 3: 284 µs per loop
>
> In [10]: %timeit Series(a).unique()
> 10000 loops, best of 3: 161 µs per loop
>
> In [11]: s = Series(a)
>
> # without the creation overhead
> In [12]: %timeit s.unique()
> 10000 loops, best of 3: 75.3 µs per loop
>
>
>
> On Thu, Sep 4, 2014 at 2:29 PM, Eelco Hoogendoorn <
> hoogendoorn.eelco at gmail.com> wrote:
>
>>
>> On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn <
>> hoogendoorn.eelco at gmail.com> wrote:
>>
>>> I should clarify: I am speaking about my implementation, I havnt looked
>>> at the numpy implementation for a while so im not sure what it is up to.
>>> Note that by 'almost free', we are still talking about three passes over
>>> the whole array plus temp allocations, but I am assuming a use-case where
>>> the various sorts involved are the dominant cost, which I imagine they are,
>>> for out-of-cache sorts. Perhaps this isn't too realistic an assumption
>>> about the average use case though, I don't know. Though I suppose its a
>>> reasonable guideline to assume that either the dataset is big, or
>>> performance isn't that big a concern in the first place.
>>>
>>>
>>> On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río <
>>> jaime.frio at gmail.com> wrote:
>>>
>>>> On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn <
>>>> hoogendoorn.eelco at gmail.com> wrote:
>>>>
>>>>>
>>>>> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn <
>>>>> hoogendoorn.eelco at gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río <
>>>>>> jaime.frio at gmail.com> wrote:
>>>>>>
>>>>>>>  On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río <
>>>>>>> jaime.frio at gmail.com> wrote:
>>>>>>>
>>>>>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn <
>>>>>>>> hoogendoorn.eelco at gmail.com> wrote:
>>>>>>>>
>>>>>>>>>  Not sure about the hashing. Indeed one can also build an index of
>>>>>>>>> a set by means of a hash table, but its questionable if this leads to
>>>>>>>>> improved performance over performing an argsort. Hashing may have better
>>>>>>>>> asymptotic time complexity in theory, but many datasets used in practice
>>>>>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is
>>>>>>>>> higher. But more importantly, using a hash-table guarantees poor cache
>>>>>>>>> behavior for many operations using this index. By contrast, sorting may
>>>>>>>>> (but need not) make one random access pass to build the index, and may (but
>>>>>>>>> need not) perform one random access to reorder values for grouping. But
>>>>>>>>> insofar as the keys are better behaved than pure random, this coherence
>>>>>>>>> will be exploited.
>>>>>>>>>
>>>>>>>>
>>>>>>>> If you want to give it a try, these branch of my numpy fork has
>>>>>>>> hash table based implementations of unique (with no extra indices) and in1d:
>>>>>>>>
>>>>>>>>  https://github.com/jaimefrio/numpy/tree/hash-unique
>>>>>>>>
>>>>>>>> A use cases where the hash table is clearly better:
>>>>>>>>
>>>>>>>> In [1]: import numpy as np
>>>>>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d
>>>>>>>>
>>>>>>>> In [3]: a = np.random.randint(10, size=(10000,))
>>>>>>>> In [4]: %timeit np.unique(a)
>>>>>>>> 1000 loops, best of 3: 258 us per loop
>>>>>>>> In [5]: %timeit _unique(a)
>>>>>>>> 10000 loops, best of 3: 143 us per loop
>>>>>>>> In [6]: %timeit np.sort(_unique(a))
>>>>>>>> 10000 loops, best of 3: 149 us per loop
>>>>>>>>
>>>>>>>> It typically performs between 1.5x and 4x faster than sorting. I
>>>>>>>> haven't profiled it properly to know, but there may be quite a bit of
>>>>>>>> performance to dig out: have type specific comparison functions, optimize
>>>>>>>> the starting hash table size based on the size of the array to avoid
>>>>>>>> reinsertions...
>>>>>>>>
>>>>>>>> If getting the elements sorted is a necessity, and the array
>>>>>>>> contains very few or no repeated items, then the hash table approach may
>>>>>>>> even perform worse,:
>>>>>>>>
>>>>>>>> In [8]: a = np.random.randint(10000, size=(5000,))
>>>>>>>> In [9]: %timeit np.unique(a)
>>>>>>>> 1000 loops, best of 3: 277 us per loop
>>>>>>>> In [10]: %timeit np.sort(_unique(a))
>>>>>>>> 1000 loops, best of 3: 320 us per loop
>>>>>>>>
>>>>>>>> But the hash table still wins in extracting the unique items only:
>>>>>>>>
>>>>>>>> In [11]: %timeit _unique(a)
>>>>>>>> 10000 loops, best of 3: 187 us per loop
>>>>>>>>
>>>>>>>> Where the hash table shines is in more elaborate situations. If you
>>>>>>>> keep the first index where it was found, and the number of repeats, in the
>>>>>>>> hash table, you can get return_index and return_counts almost for free,
>>>>>>>> which means you are performing an extra 3x faster than with sorting.
>>>>>>>> return_inverse requires a little more trickery, so I won;t attempt to
>>>>>>>> quantify the improvement. But I wouldn't be surprised if, after fine tuning
>>>>>>>> it, there is close to an order of magnitude overall improvement
>>>>>>>>
>>>>>>>> The spped-up for in1d is also nice:
>>>>>>>>
>>>>>>>> In [16]: a = np.random.randint(1000, size=(1000,))
>>>>>>>> In [17]: b = np.random.randint(1000, size=(500,))
>>>>>>>> In [18]: %timeit np.in1d(a, b)
>>>>>>>> 1000 loops, best of 3: 178 us per loop
>>>>>>>> In [19]: %timeit _in1d(a, b)
>>>>>>>> 10000 loops, best of 3: 30.1 us per loop
>>>>>>>>
>>>>>>>> Of course, there is no point in
>>>>>>>>
>>>>>>>
>>>>>>> Ooops!!! Hit the send button too quick. Not to extend myself too
>>>>>>> long: if we are going to rethink all of this, we should approach it with an
>>>>>>> open mind. Still, and this post is not helping with that either, I am
>>>>>>> afraid that we are discussing implementation details, but are missing a
>>>>>>> broader vision of what we want to accomplish and why. That vision of what
>>>>>>> numpy's grouping functionality, if any, should be, and how it complements
>>>>>>> or conflicts with what pandas is providing, should precede anything else. I
>>>>>>> now I haven't, but has anyone looked at how pandas implements grouping?
>>>>>>> Their documentation on the subject is well worth a read:
>>>>>>>
>>>>>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html
>>>>>>>
>>>>>>> Does numpy need to replicate this? What/why/how can we add to that?
>>>>>>>
>>>>>>> Jaime
>>>>>>>
>>>>>>> --
>>>>>>> (\__/)
>>>>>>> ( O.o)
>>>>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>>>>>> planes de dominación mundial.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> NumPy-Discussion mailing list
>>>>>>> NumPy-Discussion at scipy.org
>>>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>>>
>>>>>>> I would certainly not be opposed to having a hashing based indexing
>>>>>> mechanism; I think it would make sense design-wise to have a HashIndex
>>>>>> class with the same interface as the rest, and use that subclass in those
>>>>>> arraysetops where it makes sense. The 'how to' of indexing and its
>>>>>> applications are largely orthogonal I think (with some tiny performance
>>>>>> compromises which are worth the abstraction imo). For datasets which are
>>>>>> not purely random, have many unique items, and which do not fit into cache,
>>>>>> I would expect sorting to come out on top, but indeed it depends on the
>>>>>> dataset.
>>>>>>
>>>>>> Yeah, the question how pandas does grouping, and whether we can do
>>>>>> better, is a relevant one.
>>>>>>
>>>>>> From what I understand, pandas relies on cython extensions to get
>>>>>> vectorized grouping functionality. This is no longer necessary since the
>>>>>> introduction of ufuncs in numpy. I don't know how the implementations
>>>>>> compare in terms of performance, but I doubt the difference is huge.
>>>>>>
>>>>>> I personally use grouping a lot in my code, and I don't like
>>>>>> having to use pandas for it. Most importantly, I don't want to go around
>>>>>> creating a dataframe for a single one-line hit-and-run association between
>>>>>> keys and values. The permanent association of different types of data and
>>>>>> their metadata which pandas offers is I think the key difference from
>>>>>> numpy, which is all about manipulating just plain ndarrays. Arguably,
>>>>>> grouping itself is a pretty elementary manipulating of ndarrays, and adding
>>>>>> calls to DataFrame or Series inbetween a statement that could just be
>>>>>> simply group_by(keys).mean(values) feels wrong to me. As does including
>>>>>> pandas as a dependency just to use this small piece of
>>>>>> functionality. Grouping is a more general functionality than any particular
>>>>>> method of organizing your data.
>>>>>>
>>>>>> In terms of features, adding transformations and filtering might be
>>>>>> nice too; I hadn't thought about it, but that is because unlike the
>>>>>> currently implemented features, the need has never arose for me. Im only a
>>>>>> small sample size, and I don't see any fundamental objection to adding such
>>>>>> functionality though. It certainly raises the question as to where to draw
>>>>>> the line with pandas; but my rule of thumb is that if you can think of it
>>>>>> as an elementary operation on ndarrays, then it probably belongs in numpy.
>>>>>>
>>>>>>
>>>>> Oh I forgot to add: with an indexing mechanism based on sorting,
>>>>> unique values and counts also come 'for free', not counting the O(N) cost
>>>>> of actually creating those arrays. The only time an operating relying on an
>>>>> index incurs another nontrivial amount of overhead is in case its 'rank' or
>>>>> 'inverse' property is used, which invokes another argsort. But for the vast
>>>>> majority of grouping or set operations, these properties are never used.
>>>>>
>>>>
>>>> That extra argsort is now gone from master:
>>>>
>>>> https://github.com/numpy/numpy/pull/5012
>>>>
>>>> Even with this improvement, returning any index typically makes
>>>> `np.unique` run at least 2x slower:
>>>>
>>>> In [1]: import numpy as np
>>>> In [2]: a = np.random.randint(100, size=(1000,))
>>>> In [3]: %timeit np.unique(a)
>>>> 10000 loops, best of 3: 37.3 us per loop
>>>> In [4]: %timeit np.unique(a, return_inverse=True)
>>>> 10000 loops, best of 3: 62.1 us per loop
>>>> In [5]: %timeit np.unique(a, return_index=True)
>>>> 10000 loops, best of 3: 72.8 us per loop
>>>> In [6]: %timeit np.unique(a, return_counts=True)
>>>> 10000 loops, best of 3: 56.4 us per loop
>>>> In [7]: %timeit np.unique(a, return_index=True, return_inverse=True,
>>>> return_coun
>>>> ts=True)
>>>> 10000 loops, best of 3: 112 us per loop
>>>>
>>>> --
>>>> (\__/)
>>>> ( O.o)
>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>>> planes de dominación mundial.
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>>
>>>
>> Yeah, I looked at the numpy implementation, and it seems these speed
>> differences are simply the result of the extra O(N) costs involved, so my
>> implementation would have the same characteristics. If another array copy
>> or two has meaningful impact on performance, then you are done as far as
>> optimization within numpy is concerned, id say. You could fuse those loops
>> on the C level, but as you said I think its better to think about these
>> kind of optimizations once we have a more complete picture of the
>> functionality we want.
>>
>> Good call on removing the extra argsort, that hadn't occurred to me.
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140904/f4daba96/attachment.html>