[Numpy-discussion] gist gist: 1068264

Mon Jul 11 00:02:01 EDT 2011

Hi Bruce,

I think we have some fundamental misunderstandings about what this
proposal would do. Let me see if I can try to be clearer.

On Sun, Jul 10, 2011 at 7:33 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> On Fri, Jul 8, 2011 at 5:04 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> Each dtype has a bunch of C functions associated with it that say how
>> to do comparisons, assignment, etc. In the miniNEP design, we add a
>> new function to this list called 'isna', which every dtype that wants
>> to support NAs has to define.
>
> Starting to lose me here because you are adding memory that your
> miniNep was not meant to do.

The memory overhead that people have been worrying about is if they
have, say, an 8 gigabyte array full of doubles, are they also going to
need a 1 gigabyte array full of mask bytes.

These new functions we're talking about are defined just once for each
data, per Python invocation. This comes to, at worst, a few kilobytes
total.

Also, I should say that there were a few motivations for wanting to
support dtype-style NAs; memory usage is only one of them.

>> Yes, this does mean that code which wants to treat NAs separately has
>> to check for and call this function if it's present, but that seems to
>> be inevitable... *all* of the dtype C functions are supposedly
>> optional, so we have to check for them before calling them and do
>> something sensible if they aren't defined. We could define a wrapper
>> that calls the function if its defined, or else just fills the
>> provided buffer with zeros (to mean "there are no NAs), and then code
>> which wanted to avoid a special case could use that. But in general we
>> probably do want to handle arrays that might have NAs differently from
>> arrays which don't have NAs, because if there are no NAs present then
>> it's quicker to skip the handling altogether. That's true for any NA
>> implementation.
>
> Second problem is that we need memory for at least a new function. We
> also have code duplication that needs to be in sync.

Both the masking and dtype ideas for NA support would require new code
be written for Numpy to actually implement the functionality, and this
code does take a small amount of memory, yes. But that's true for
every feature ever.

Also, there isn't any code duplication here, at least as far as I can
tell. If you want to add a fast-path then that does use a tiny amount
more memory, but that's sometimes worth it for speed. Anyway, my point
was just that we can and should decide on a case-by-case basis; if a
fast-path isn't worth it in some situation, then we shouldn't add it.

Any checking you have to do for bit-pattern NAs, you also have to do
for masks, and vice-versa. The checking looks slightly different
(comparing for some magic NA value in the array versus checking for
some special bits in the mask), but the actual work involved is
equivalent.

>> Yeah, in the design as written, overflow (among other things) can
>> create accidental NAs. Which kind of sucks. There are a few options:
>>
>> -- Just live with it.
>
> Unfortunately that is impossible and other choice words.

Okay.

>> -- We could add a flag like NPY_NA_AUTO_CHECK, and when this flag is
>> set, the ufunc loop runs 'isna' on its output buffer before returning.
>> If there are any NAs there that did not arise from NAs in the input,
>> then it raises an error. (The reason we would want to make it a flag
>> is that this checking is pointless for dtypes like NA-string, and
>> mostly pointless for dtypes like NA-float.) Also, we'd only want to
>> enable this if we were using the NPY_NA_AUTO_UFUNC ufunc-delegation
>> logic, because if you registered a special ufunc loop *specifically
>> for your NA-dtype*, then presumably it knows what it's doing. This
>> would also allow such an NA-dtype-specific ufunc loop to return NAs on
>> purpose if it wanted to.
>
> This appears to me as masking. But my issue here is the complexity of
> the function involved because ensuring that the calculation is correct
> probably comes with a large performance penalty.

I'm not sure what you mean about "appears as masking".

There would be some overhead for double-checking that output values
didn't accidentally produce NAs, yes. Depending on how caching effects
worked out, this overhead might be zero; the bottleneck for most array
operations is memory, not CPU, and doing these checks wouldn't require
any extra CPU. But every solution does have some trade-offs; if there
was a perfect solution then we wouldn't have anything to debate
:-). The point is that the dtype-NA approach lets you choose which
trade-offs you want to make while still being easy to understand.

>> -- Use a dtype that adds a separate flag next to the actual integer to
>> indicate NA-ness, instead of stealing one of the integer's values. So
>> your NA-int8 would actually be 2 bytes, where the first byte was 1 to
>> indicate NA, or 0 to indicate that the second byte contains an actual
>> int8. If you do this with larger integers, say an int32, then you have
>> a choice: you could store your int32 in 8 bytes, in which case
>> arithmetic etc. is fast, but you waste a bit of memory. Or you could
>> store your int32 in 5 bytes, in which case arithmetic etc. become
>> somewhat slower, but you don't waste any memory. (This latter case
>> would basically be like using an unaligned or byteswapped array in
>> current numpy, in terms of mechanisms and speed.)
>
> But avoiding any increase in memory was one of the benefits of this
> miniNEP. It really doesn't matter which integer size you use because
> you still have the same problem. Also, people use int8 or whatever by
> choice due say memory constraints.

If you insist on functionality that requires an increase in memory,
then you have to accept an increase in memory. Wanting to be able to
store an int8, have the full range of values available, *plus* NA as a
257th value, means that you need to get an extra byte somewhere. I'm
just explaining how you do that :-).

My point is just that the proposal is flexible enough to make
whichever trade-offs you decide are best for your situation, while
still being easy to understand.

>> A related issue is, of the many ways we *can* do integer NA-dtype,
>> which one *should* we do by default. I don't have a strong opinion,
>> really; I haven't heard anyone say that they have huge quantities of
>> integer-plus-NA data that they want to manipulate and
>> memory/speed/allowing the full range of values are all really
>> important. (Maybe that's you?) In the design as written, they're all
>> pretty trivial to implement (you just tweak a few magic numbers in the
>> dtype structure), and probably we should support all of them via
>> more-or-less exotic invocations of np.withNA. (E.g.,
>> 'np.withNA(np.int32, useflag=True, flagsize=1)' to get a 5-byte
>> int32.)
>
> I disagree with the comment that this is 'pretty trivial to
> implement'. I do not think that is trivial to implement with
> acceptable performance and memory costs.

I hope I made clear above the the necessary memory costs you're
thinking of are actually basically non-existent.

I'm not sure what you mean about it not being trivial to
implement. Like I said, giving the different options is literally a
matter of tweaking a few fields, and we want to support those fields
for other reasons, plus they aren't very complicated to start with.

> I am being difficult as I do agree with many of the underlying idea.
> But I want something that works with acceptable performance and memory
> usage (there should be minor penalty of having masked elements over no
> masked elements). I do not find it acceptable when A.dot(B) is slower
> than first creating an array without NAs: C=A.noNA(), C.dot(B). Thus
> to me an API is insufficient to address that.

First, let me say again that this miniNEP is not intended to compete
with the masking idea -- they can coexist.

But if we do want to choose between the two ideas, speed won't help
you make a decision, because they're both going to use very similar
code to do the inner loops, and both are going to be about equally
fast. (And both should be faster than making a whole array copy! If
not, you should complain until someone fixes it...)

If anything, bit-pattern-NAs might be slightly faster than
masking-NAs, because the masking-NAs will force the inner loops to
look at two chunks of memory (one for the mask, and one for the values
to do the actual computation), while the inner loop for a
bit-pattern-NA only needs to look at the values. But again, I suspect
this difference will not be measurable in practice.

-- Nathaniel