Mailman 3 June 2011 - NumPy-Discussion

code review/build & test for datetime business day API
by Mark Wiebe 26 Jul '11

26 Jul '11

These functions are now fully implemented and documented. As always, code reviews are welcome here: https://github.com/numpy/numpy/pull/87 and for those that don't want to dig into review C code, the commit for the documentation is here: https://github.com/m-paradox/numpy/commit/6b5a42a777b16812e774193b06da1b68b… This is probably also another good place to do a merge to master, so if people could test it on Mac/Windows/other platforms that would be much appreciated. Thanks, Mark On Fri, Jun 10, 2011 at 5:49 PM, Mark Wiebe <mwwiebe(a)gmail.com> wrote: > I've implemented the busday_offset function with support for the weekmask > and roll parameters, the commits are tagged 'datetime-bday' in the pull > request here: > > https://github.com/numpy/numpy/pull/87 > > -Mark > > > On Thu, Jun 9, 2011 at 5:23 PM, Mark Wiebe <mwwiebe(a)gmail.com> wrote: > >> Here's a possible design for a business day API for numpy datetimes: >> >> >> The 'B' business day unit will be removed. All business day-related >> calculations will be done using the 'D' day unit. >> >> A class *BusinessDayDef* to encapsulate the definition of the business >> week and holidays. The business day functions will either take one of these >> objects, or separate weekmask and holidays parameters, to specify the >> business day definition. This class serves as both a performance >> optimization and a way to encapsulate the weekmask and holidays together, >> for example if you want to make a dictionary mapping exchange names to their >> trading days definition. >> >> The weekmask can be specified in a number of ways, and internally becomes >> a boolean array with 7 elements with True for the days Monday through Sunday >> which are valid business days. Some different notations are for the 5-day >> week include [1,1,1,1,1,0,0], "1111100" "MonTueWedThuFri". The holidays are >> always specified as a one-dimensional array of dtype 'M8[D]', and are >> internally used in sorted form. >> >> >> A function *is_busday*(datearray, weekmask=, holidays=, busdaydef=) >> returns a boolean array matching the input datearray, with True for the >> valid business days. >> >> A function *busday_offset*(datearray, offsetarray, >> roll='raise', weekmask=, holidays=, busdaydef=) which first applies the >> 'roll' policy to start at a valid business date, then offsets the date by >> the number of business days specified in offsetarray. The arrays datearray >> and offsetarray are broadcast together. The 'roll' parameter can be >> 'forward'/'following', 'backward'/'preceding', 'modifiedfollowing', >> 'modifiedpreceding', or 'raise' (the default). >> >> A function *busday_count*(datearray1, datearray2, weekmask=, holidays=, >> busdaydef=) which calculates the number of business days between datearray1 >> and datearray2, not including the day of datearray2. >> >> >> For example, to find the first Monday in Feb 2011, >> >> >>>np.busday_offset('2011-02', 0, roll='forward', weekmask='Mon') >> >> or to find the number of weekdays in Feb 2011, >> >> >>>np.busday_count('2011-02', '2011-03') >> >> This set of three functions appears to be powerful enough to express the >> business-day computations that I've been shown thus far. >> >> Cheers, >> Mark >> > >

7 25

histogram2d error with empty inputs
by Benjamin Root 06 Jul '11

06 Jul '11

I found another empty input edge case. Somewhat recently, we fixed an issue with np.histogram() and empty inputs (so long as the bins are somehow known). >>> np.histogram([], bins=4) (array([0, 0, 0, 0]), array([ 0. , 0.25, 0.5 , 0.75, 1. ])) However, histogram2d needs the same treatment. >>> np.histogram([], [], bins=4) (array([ 0., 0.]), array([ 0. , 0.25, 0.5 , 0.75, 1. ]), array([ 0. , 0.25, 0.5 , 0.75, 1. ])) The first element in the return tuple needs to be 4x4 (in this case). Thanks, Ben Root

2 2

alterNEP - was: missing data discussion round 2
by Matthew Brett 03 Jul '11

03 Jul '11

Hi, On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith <njs(a)pobox.com> wrote: > Anyway, it's pretty clear that in this particular case, there are two > distinct features that different people want: the missing data > feature, and the masked array feature. The more I think about it, the > less I see how they can be combined into one dessert topping + floor > wax solution. Here are three particular points where they seem to > contradict each other: ... [some proposals] In the interest of making the discussion as concrete as possible, here is my draft of an alternative proposal for NAs and masking, based on Nathaniel's comments. Writing it, it seemed to me that Nathaniel is right, that the ideas become much clearer when the NA idea and the MASK idea are separate. Please do pitch in for things I may have missed or misunderstood: ############################################### A alternative-NEP on masking and missing values ############################################### The principle of this aNEP is to separate the APIs for masking and for missing values, according to * The current implementation of masked arrays * Nathaniel Smith's proposal. This discussion is only of the API, and not of the implementation. ************** Initialization ************** First, missing values can be set and be displayed as ``np.NA, NA``:: >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]') array([1., 2., NA, 7.], dtype='NA[<f8]') As the initialization is not ambiguous, this can be written without the NA dtype:: >>> np.array([1.0, 2.0, np.NA, 7.0]) array([1., 2., NA, 7.], dtype='NA[<f8]') Masked values can be set and be displayed as ``np.MASKED, MASKED``:: >>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True) array([1., 2., MASKED, 7.], masked=True) As the initialization is not ambiguous, this can be written without ``masked=True``:: >>> np.array([1.0, 2.0, np.MASKED, 7.0]) array([1., 2., MASKED, 7.], masked=True) ****** Ufuncs ****** By default, NA values propagate:: >>> na_arr = np.array([1.0, 2.0, np.NA, 7.0]) >>> np.sum(na_arr) NA('float64') unless the ``skipna`` flag is set:: >>> np.sum(na_arr, skipna=True) 10.0 By default, masking does not propagate:: >>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0]) >>> np.sum(masked_arr) 10.0 unless the ``propmsk`` flag is set:: >>> np.sum(masked_arr, propmsk=True) MASKED An array can be masked, and contain NA values:: >>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0]) In the default case, the behavior is obvious:: >>> np.sum(both_arr) NA('float64') It's also obvious what to do with ``skipna=True``:: >>> np.sum(both_arr, skipna=True) 10.0 >>> np.sum(both_arr, skipna=True, propmsk=True) MASKED To break the tie between NA and MSK, NAs propagate harder:: >>> np.sum(both_arr, propmsk=True) NA('float64') ********** Assignment ********** is obvious in the NA case:: >>> arr = np.array([1.0, 2.0, 7.0]) >>> arr[2] = np.NA TypeError('dtype does not support NA') >>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]') >>> na_arr[2] = np.NA >>> na_arr array([1., 2., NA], dtype='NA[<f8]') Direct assignnent in the masked case is magic and confusing, and so happens only via the mask:: >>> masked_array = np.array([1.0, 2.0, 7.0], masked=True) >>> masked_arr[2] = np.NA TypeError('dtype does not support NA') >>> masked_arr[2] = np.MASKED TypeError('float() argument must be a string or a number') >>> masked_arr.visible[2] = False >>> masked_arr array([1., 2., MASKED], masked=True) See y'all, Matthew

15 45

broacasting question
by Thomas K Gamble 02 Jul '11

02 Jul '11

I'm trying to convert some IDL code to python/numpy and i'm having some trouble understanding the rules for boradcasting during some operations. example: given the following arrays: a = array((2048,3577), dtype=float) b = array((256,25088), dtype=float) c = array((2048,3136), dtype=float) d = array((2048,3136), dtype=float) do: a = b * c + d In IDL, the computation is done without complaint and all array sizes are preserved. In ptyhon I get a value error concerning broadcasting. I can force it to work by taking slices, but the resulting size would be a = (256x3136) rather than (2048x3577). I admit that I don't understand IDL (or python to be honest) well enough to know how it handles this to be able to replicate the result properly. Does it only operate on the smallest dimensions ignoring the larger indices leaving their values unchanged? Can someone explain this to me? -- Thomas K. Gamble tkgamble(a)windstream.net

4 9

review request: introductory datetime documentation
by Mark Wiebe 01 Jul '11

01 Jul '11

https://github.com/numpy/numpy/pull/101 Thanks, Mark

3 3

missing data discussion round 2
by Mark Wiebe 30 Jun '11

30 Jun '11

First I'd like to thank everyone for all the feedback you're providing, clearly this is an important topic to many people, and the discussion has helped clarify the ideas for me. I've renamed and updated the NEP, then placed it into the master NumPy repository so it has a more permanent home here: https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst In the NEP, I've tried to address everything that was raised in the original thread and in Nathaniel's followup 'Concepts' thread. To deal with the issue of whether a mask is True or False for a missing value, I've removed the 'mask' attribute entirely, except for ufunc-like functions np.ismissing and np.isavail which return the two styles of masks. Here's a high level summary of how I'm thinking of the topic, and what I will implement: *Missing Data Abstraction* There appear to be two useful ways to think about missing data that are worth supporting. 1) Unknown yet existing data 2) Data that doesn't exist In 1), an NA value causes outputs to become NA except in a small number of exceptions such as boolean logic, and in 2), operations treat the data as if there were a smaller array without the NA values. *Temporarily Ignoring Data* * * In some cases, it is useful to flag data as NA temporarily, possibly in several different ways, for particular calculations or testing out different ways of throwing away outliers. This is independent of the missing data abstraction, still requiring a choice of 1) or 2) above. *Implementation Techniques* * * There are two mechanisms generally used to implement missing data abstractions, * * 1) An NA bit pattern 2) A mask I've described a design in the NEP which can include both techniques using the same interface. The mask approach is strictly more general than the NA bit pattern approach, except for a few things like the idea of supporting the dtype 'NA[f8,InfNan]' which you can read about in the NEP. My intention is to implement the mask-based design, and possibly also implement the NA bit pattern design, but if anything gets cut it will be the NA bit patterns. Thanks again for all your input so far, and thanks in advance for your suggestions for improving this new revision of the NEP. -Mark

15 77

missing data: semantics
by Lluís 30 Jun '11

30 Jun '11

Ok, I think it's time to step back and reformulate the problem by completely ignoring the implementation. Here we have 2 "generic" concepts (i.e., applicable to R), plus another extra concept that is exclusive to numpy: * Assigning np.NA to an array, cannot be undone unless through explicit assignment (i.e., assigning a new arbitrary value, or saving a copy of the original array before assigning np.NA). * np.NA values propagate by default, unless ufuncs have the "skipna = True" argument (or the other way around, it doesn't really matter to this discussion). In order to avoid passing the argument on each ufunc, we either have some per-array variable for the default "skipna" value (undesirable) or we can make a trivial ndarray subclass that will set the "skipna" argument on all ufuncs through the "_ufunc_wrapper_" mechanism. Now, numpy has the concept of views, which adds some more goodies to the list of concepts: * With views, two arrays can share the same physical data, so that assignments to any of them will be seen by others (including NA values). The creation of a view is explicitly stated by the user, so its behaviour should not be perceived as odd (after all, you asked for a view). The good thing is that with views you can avoid costly array copies if you're careful when writing into these views. Now, you can add a new concept: local/temporal/transient missing data. We can take an existing array and create a view with the new argument "transientna = True". Here, both the view and the "transientna = True" are explicitly stated by the user, so it is assumed that she already knows what this is all about. The difference with a regular view is that you also explicitly asked for local/temporal/transient NA values. * Assigning np.NA to an array view with "transientna = True" will *not* be seen by any of the other views (nor the "original" array), but anything else will still work "as usual". After all, this is what *you* asked for when using the "transientna = True" argument. To conclude, say that others *must not* care about whether the arrays they're working with have transient NA values. This way, I can create a view with transient NAs, set to NA some uninteresting data, and pass it to a routine written by someone else that sets to NA elements that, for example, are beyond certain threshold from the mean of the elements. This would be equivalent to storing a copy of the original array before passing it to this 3rd party function, only that "transientna", just as views, provide some handy shortcuts to avoid copies. My main point here is that views and local/temporal/transient NAs are all *explicitly* requested, so that its behaviour should not appear as something unexpected. Is there an agreement on this? Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth

3 3

feedback request: proposal to add masks to the core ndarray
by Mark Wiebe 30 Jun '11

30 Jun '11

Enthought has asked me to look into the "missing data" problem and how NumPy could treat it better. I've considered the different ideas of adding dtype variants with a special signal value and masked arrays, and concluded that adding masks to the core ndarray appears is the best way to deal with the problem in general. I've written a NEP that proposes a particular design, viewable here: https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-arra… There are some questions at the bottom of the NEP which definitely need discussion to find the best design choices. Please read, and let me know of all the errors and gaps you find in the document. Thanks, Mark

16 130

Fitting Log normal or truncated log normal distibution to three points
by Sachin Kumar Sharma 30 Jun '11

30 Jun '11

Hi, I have three points 10800, 81100, 582000. What is the easiest way of fitting a log normal and truncated log normal distribution to these three points using numpy. I would appreciate your reply for the same. Cheers Sachin ************************************************************************ Sachin Kumar Sharma Senior Geomodeler

3 2

Two bugs in recfunctions.join_by with patch
by Skipper Seabold 30 Jun '11

30 Jun '11

These two cases failed in recfunctions.join_by 1) the case for having either r1postfix or r2postfix as an empty string was not handled. 2) If there is more than one key and more than variable with a name collision. Patch and tests in a pull request here: https://github.com/numpy/numpy/pull/100 Skipper

1 0