[Numpy-discussion] New numpy functions: filled, filled_like

Mon Jan 14 11:55:39 EST 2013

On Mon, Jan 14, 2013 at 11:22 AM,  <josef.pktd at gmail.com> wrote:
> On Mon, Jan 14, 2013 at 11:15 AM, Olivier Delalleau <shish at keba.be> wrote:
>> 2013/1/14 Matthew Brett <matthew.brett at gmail.com>:
>>> Hi,
>>>
>>> On Mon, Jan 14, 2013 at 9:02 AM, Dave Hirschfeld
>>> <dave.hirschfeld at gmail.com> wrote:
>>>> Robert Kern <robert.kern <at> gmail.com> writes:
>>>>
>>>>>
>>>>> >>> >
>>>>> >>> > One alternative that does not expand the API with two-liners is to let
>>>>> >>> > the ndarray.fill() method return self:
>>>>> >>> >
>>>>> >>> >   a = np.empty(...).fill(20.0)
>>>>> >>>
>>>>> >>> This violates the convention that in-place operations never return
>>>>> >>> self, to avoid confusion with out-of-place operations. E.g.
>>>>> >>> ndarray.resize() versus ndarray.reshape(), ndarray.sort() versus
>>>>> >>> np.sort(), and in the broader Python world, list.sort() versus
>>>>> >>> sorted(), list.reverse() versus reversed(). (This was an explicit
>>>>> >>> reason given for list.sort to not return self, even.)
>>>>> >>>
>>>>> >>> Maybe enabling this idiom is a good enough reason to break the
>>>>> >>> convention ("Special cases aren't special enough to break the rules. /
>>>>> >>> Although practicality beats purity"), but it at least makes me -0 on
>>>>> >>> this...
>>>>> >>>
>>>>> >>
>>>>> >> I tend to agree with the notion that inplace operations shouldn't return
>>>>> >> self, but I don't know if it's just because I've been conditioned this way.
>>>>> >> Not returning self breaks the fluid interface pattern [1], as noted in a
>>>>> >> similar discussion on pandas [2], FWIW, though there's likely some way to
>>>>> >> have both worlds.
>>>>> >
>>>>> > Ah-hah, here's the email where Guide officially proclaims that there
>>>>> > shall be no "fluent interface" nonsense applied to in-place operators
>>>>> > in Python, because it hurts readability (at least for Dutch people
>>>>> > ):
>>>>> >   http://mail.python.org/pipermail/python-dev/2003-October/038855.html
>>>>>
>>>>> That's a statement about the policy for the stdlib, and just one
>>>>> person's opinion. You, and numpy, are permitted to have a different
>>>>> opinion.
>>>>>
>>>>> In any case, I'm not strongly advocating for it. It's violation of
>>>>> principle ("no fluent interfaces") is roughly in the same ballpark as
>>>>> np.filled() ("not every two-liner needs its own function"), so I
>>>>> thought I would toss it out there for consideration.
>>>>>
>>>>> --
>>>>> Robert Kern
>>>>>
>>>>
>>>> FWIW I'm +1 on the idea. Perhaps because I just don't see many practical
>>>> downsides to breaking the convention but I regularly see a big issue with there
>>>> being no way to instantiate an array with a particular value.
>>>>
>>>> The one obvious way to do it is use ones and multiply by the value you want. I
>>>> work with a lot of inexperienced programmers and I see this idiom all the time.
>>>> It takes a fair amount of numpy knowledge to know that you should do it in two
>>>> lines by using empty and setting a slice.
>>>>
>>>> In [1]: %timeit NaN*ones(10000)
>>>> 1000 loops, best of 3: 1.74 ms per loop
>>>>
>>>> In [2]: %%timeit
>>>>    ...: x = empty(10000, dtype=float)
>>>>    ...: x[:] = NaN
>>>>    ...:
>>>> 10000 loops, best of 3: 28 us per loop
>>>>
>>>> In [3]: 1.74e-3/28e-6
>>>> Out[3]: 62.142857142857146
>>>>
>>>>
>>>> Even when not in the mythical "tight loop" setting an array to one and then
>>>> multiplying uses up a lot of cycles - it's nearly 2 orders of magnitude slower
>>>> than what we know they *should* be doing.
>>>>
>>>> I'm agnostic as to whether fill should be modified or new functions provided but
>>>> I think numpy is currently missing this functionality and that providing it
>>>> would save a lot of new users from shooting themselves in the foot performance-
>>>> wise.
>>>
>>> Is this a fair summary?
>>>
>>> => fill(shape, val), fill_like(arr, val) - new functions, as proposed
>>> For: readable, seems to fit a pattern often used, presence in
>>> namespace may clue people into using the 'fill' rather than * val or +
>>> val
>>> Con: a very simple alias for a = ones(shape) ; a.fill(val), maybe
>>> cluttering already full namespace.
>>>
>>> => empty(shape).fill(val) - by allowing return value from arr.fill(val)
>>> For: readable
>>> Con: breaks guideline not to return anything from in-place operations,
>>> no presence in namespace means users may not find this pattern.
>>>
>>> => no new API
>>> For : easy maintenance
>>> Con : harder for users to discover fill pattern, filling a new array
>>> requires two lines instead of one.
>>>
>>> So maybe the decision rests on:
>>>
>>> How important is it that users see these function names in the
>>> namespace in order to discover the pattern "a = ones(shape) ;
>>> a.fill(val)"?
>>>
>>> How important is it to obey guidelines for no-return-from-in-place?
>>>
>>> How important is it to avoid expanding the namespace?
>>>
>>> How common is this pattern?
>>>
>>> On the last, I'd say that the only common use I have for this pattern
>>> is to fill an array with NaN.
>>
>> My 2 cts from a user perspective:
>>
>> - +1 to have such a function. I usually use numpy.ones * scalar
>> because honestly, spending two lines of code for such a basic
>> operations seems like a waste. Even if it's slower and potentially
>> dangerous due to casting rules.
>> - I think having a noun rather than a verb makes more sense since we
>> have numpy.ones and numpy.zeros (and I always read "numpy.empty" as
>> "give me an empty array", not "empty an array").
>> - I agree the name collision with np.ma.filled is a problem. I have no
>> better suggestion though at this point.
>
> np.array_filled(shape, value, dtype)  ?
> maybe more verbose, but unambiguous AFAICS
>
> BTW
> GAUSS http://en.wikipedia.org/wiki/GAUSS_(software)
> also has zeros and ones. 1st release 1984
>
> np.array_filled((100, 2), -999, int) ?

A quick check of the statsmodels source

20 occassions of np.nan * np.ones(...)
50 occassions of np.emtpy
     a few filled with other values than nan
     many filled in a loop (optimistically, more often used by new contributers)

It's just a two-liner, but if it's a function it hopefully produces better code.
David's argument looks plausible to me.

Josef

>
> Josef
>
>
>>
>> -=- Olivier
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion