[Numpy-discussion] Does a `mergesorted` function make sense?
Eelco Hoogendoorn
hoogendoorn.eelco at gmail.com
Thu Sep 4 14:29:07 EDT 2014
On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn <
hoogendoorn.eelco at gmail.com> wrote:
> I should clarify: I am speaking about my implementation, I havnt looked at
> the numpy implementation for a while so im not sure what it is up to. Note
> that by 'almost free', we are still talking about three passes over the
> whole array plus temp allocations, but I am assuming a use-case where the
> various sorts involved are the dominant cost, which I imagine they are, for
> out-of-cache sorts. Perhaps this isn't too realistic an assumption about
> the average use case though, I don't know. Though I suppose its a
> reasonable guideline to assume that either the dataset is big, or
> performance isn't that big a concern in the first place.
>
>
> On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río <
> jaime.frio at gmail.com> wrote:
>
>> On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn <
>> hoogendoorn.eelco at gmail.com> wrote:
>>
>>>
>>> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn <
>>> hoogendoorn.eelco at gmail.com> wrote:
>>>
>>>>
>>>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río <
>>>> jaime.frio at gmail.com> wrote:
>>>>
>>>>> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río <
>>>>> jaime.frio at gmail.com> wrote:
>>>>>
>>>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn <
>>>>>> hoogendoorn.eelco at gmail.com> wrote:
>>>>>>
>>>>>>> Not sure about the hashing. Indeed one can also build an index of a
>>>>>>> set by means of a hash table, but its questionable if this leads to
>>>>>>> improved performance over performing an argsort. Hashing may have better
>>>>>>> asymptotic time complexity in theory, but many datasets used in practice
>>>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is
>>>>>>> higher. But more importantly, using a hash-table guarantees poor cache
>>>>>>> behavior for many operations using this index. By contrast, sorting may
>>>>>>> (but need not) make one random access pass to build the index, and may (but
>>>>>>> need not) perform one random access to reorder values for grouping. But
>>>>>>> insofar as the keys are better behaved than pure random, this coherence
>>>>>>> will be exploited.
>>>>>>>
>>>>>>
>>>>>> If you want to give it a try, these branch of my numpy fork has hash
>>>>>> table based implementations of unique (with no extra indices) and in1d:
>>>>>>
>>>>>> https://github.com/jaimefrio/numpy/tree/hash-unique
>>>>>>
>>>>>> A use cases where the hash table is clearly better:
>>>>>>
>>>>>> In [1]: import numpy as np
>>>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d
>>>>>>
>>>>>> In [3]: a = np.random.randint(10, size=(10000,))
>>>>>> In [4]: %timeit np.unique(a)
>>>>>> 1000 loops, best of 3: 258 us per loop
>>>>>> In [5]: %timeit _unique(a)
>>>>>> 10000 loops, best of 3: 143 us per loop
>>>>>> In [6]: %timeit np.sort(_unique(a))
>>>>>> 10000 loops, best of 3: 149 us per loop
>>>>>>
>>>>>> It typically performs between 1.5x and 4x faster than sorting. I
>>>>>> haven't profiled it properly to know, but there may be quite a bit of
>>>>>> performance to dig out: have type specific comparison functions, optimize
>>>>>> the starting hash table size based on the size of the array to avoid
>>>>>> reinsertions...
>>>>>>
>>>>>> If getting the elements sorted is a necessity, and the array contains
>>>>>> very few or no repeated items, then the hash table approach may even
>>>>>> perform worse,:
>>>>>>
>>>>>> In [8]: a = np.random.randint(10000, size=(5000,))
>>>>>> In [9]: %timeit np.unique(a)
>>>>>> 1000 loops, best of 3: 277 us per loop
>>>>>> In [10]: %timeit np.sort(_unique(a))
>>>>>> 1000 loops, best of 3: 320 us per loop
>>>>>>
>>>>>> But the hash table still wins in extracting the unique items only:
>>>>>>
>>>>>> In [11]: %timeit _unique(a)
>>>>>> 10000 loops, best of 3: 187 us per loop
>>>>>>
>>>>>> Where the hash table shines is in more elaborate situations. If you
>>>>>> keep the first index where it was found, and the number of repeats, in the
>>>>>> hash table, you can get return_index and return_counts almost for free,
>>>>>> which means you are performing an extra 3x faster than with sorting.
>>>>>> return_inverse requires a little more trickery, so I won;t attempt to
>>>>>> quantify the improvement. But I wouldn't be surprised if, after fine tuning
>>>>>> it, there is close to an order of magnitude overall improvement
>>>>>>
>>>>>> The spped-up for in1d is also nice:
>>>>>>
>>>>>> In [16]: a = np.random.randint(1000, size=(1000,))
>>>>>> In [17]: b = np.random.randint(1000, size=(500,))
>>>>>> In [18]: %timeit np.in1d(a, b)
>>>>>> 1000 loops, best of 3: 178 us per loop
>>>>>> In [19]: %timeit _in1d(a, b)
>>>>>> 10000 loops, best of 3: 30.1 us per loop
>>>>>>
>>>>>> Of course, there is no point in
>>>>>>
>>>>>
>>>>> Ooops!!! Hit the send button too quick. Not to extend myself too long:
>>>>> if we are going to rethink all of this, we should approach it with an open
>>>>> mind. Still, and this post is not helping with that either, I am afraid
>>>>> that we are discussing implementation details, but are missing a broader
>>>>> vision of what we want to accomplish and why. That vision of what numpy's
>>>>> grouping functionality, if any, should be, and how it complements or
>>>>> conflicts with what pandas is providing, should precede anything else. I
>>>>> now I haven't, but has anyone looked at how pandas implements grouping?
>>>>> Their documentation on the subject is well worth a read:
>>>>>
>>>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html
>>>>>
>>>>> Does numpy need to replicate this? What/why/how can we add to that?
>>>>>
>>>>> Jaime
>>>>>
>>>>> --
>>>>> (\__/)
>>>>> ( O.o)
>>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus
>>>>> planes de dominación mundial.
>>>>>
>>>>> _______________________________________________
>>>>> NumPy-Discussion mailing list
>>>>> NumPy-Discussion at scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>
>>>>> I would certainly not be opposed to having a hashing based indexing
>>>> mechanism; I think it would make sense design-wise to have a HashIndex
>>>> class with the same interface as the rest, and use that subclass in those
>>>> arraysetops where it makes sense. The 'how to' of indexing and its
>>>> applications are largely orthogonal I think (with some tiny performance
>>>> compromises which are worth the abstraction imo). For datasets which are
>>>> not purely random, have many unique items, and which do not fit into cache,
>>>> I would expect sorting to come out on top, but indeed it depends on the
>>>> dataset.
>>>>
>>>> Yeah, the question how pandas does grouping, and whether we can do
>>>> better, is a relevant one.
>>>>
>>>> From what I understand, pandas relies on cython extensions to get
>>>> vectorized grouping functionality. This is no longer necessary since the
>>>> introduction of ufuncs in numpy. I don't know how the implementations
>>>> compare in terms of performance, but I doubt the difference is huge.
>>>>
>>>> I personally use grouping a lot in my code, and I don't like having to
>>>> use pandas for it. Most importantly, I don't want to go around creating a
>>>> dataframe for a single one-line hit-and-run association between keys and
>>>> values. The permanent association of different types of data and their
>>>> metadata which pandas offers is I think the key difference from numpy,
>>>> which is all about manipulating just plain ndarrays. Arguably, grouping
>>>> itself is a pretty elementary manipulating of ndarrays, and adding calls to
>>>> DataFrame or Series inbetween a statement that could just be
>>>> simply group_by(keys).mean(values) feels wrong to me. As does including
>>>> pandas as a dependency just to use this small piece of
>>>> functionality. Grouping is a more general functionality than any particular
>>>> method of organizing your data.
>>>>
>>>> In terms of features, adding transformations and filtering might be
>>>> nice too; I hadn't thought about it, but that is because unlike the
>>>> currently implemented features, the need has never arose for me. Im only a
>>>> small sample size, and I don't see any fundamental objection to adding such
>>>> functionality though. It certainly raises the question as to where to draw
>>>> the line with pandas; but my rule of thumb is that if you can think of it
>>>> as an elementary operation on ndarrays, then it probably belongs in numpy.
>>>>
>>>>
>>> Oh I forgot to add: with an indexing mechanism based on sorting, unique
>>> values and counts also come 'for free', not counting the O(N) cost of
>>> actually creating those arrays. The only time an operating relying on an
>>> index incurs another nontrivial amount of overhead is in case its 'rank' or
>>> 'inverse' property is used, which invokes another argsort. But for the vast
>>> majority of grouping or set operations, these properties are never used.
>>>
>>
>> That extra argsort is now gone from master:
>>
>> https://github.com/numpy/numpy/pull/5012
>>
>> Even with this improvement, returning any index typically makes
>> `np.unique` run at least 2x slower:
>>
>> In [1]: import numpy as np
>> In [2]: a = np.random.randint(100, size=(1000,))
>> In [3]: %timeit np.unique(a)
>> 10000 loops, best of 3: 37.3 us per loop
>> In [4]: %timeit np.unique(a, return_inverse=True)
>> 10000 loops, best of 3: 62.1 us per loop
>> In [5]: %timeit np.unique(a, return_index=True)
>> 10000 loops, best of 3: 72.8 us per loop
>> In [6]: %timeit np.unique(a, return_counts=True)
>> 10000 loops, best of 3: 56.4 us per loop
>> In [7]: %timeit np.unique(a, return_index=True, return_inverse=True,
>> return_coun
>> ts=True)
>> 10000 loops, best of 3: 112 us per loop
>>
>> --
>> (\__/)
>> ( O.o)
>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
>> de dominación mundial.
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
Yeah, I looked at the numpy implementation, and it seems these speed
differences are simply the result of the extra O(N) costs involved, so my
implementation would have the same characteristics. If another array copy
or two has meaningful impact on performance, then you are done as far as
optimization within numpy is concerned, id say. You could fuse those loops
on the C level, but as you said I think its better to think about these
kind of optimizations once we have a more complete picture of the
functionality we want.
Good call on removing the extra argsort, that hadn't occurred to me.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140904/498000dd/attachment.html>
More information about the NumPy-Discussion
mailing list