Re: [Python-ideas] [Python-Dev] minmax() function returning (minimum, maximum) tuple of a sequence

On Mon, 11 Oct 2010 05:57:21 am Paul McGuire wrote:
Perhaps more importantly, it is ideal for the use-case where you have an iterator. You can't call min() and then max(), as min() consumes the iterator leaving nothing for max(). It may be undesirable to convert the iterator to a list first -- it may be that the number of items in the data stream is too large to fit into memory all at once, but even if it is small, it means you're now walking the stream three times when one would do. To my mind, minmax() is as obvious and as useful a built-in as divmod(), but if there is resistance to making such a function a built-in, perhaps it could go into itertools. (I would prefer it to keep the same signature as min() and max(), namely that it will take either a single iterable argument or multiple arguments.) I've experimented with minmax() myself. Not surprisingly, the performance of a pure Python version doesn't even come close to the built-ins. I'm +1 on the idea. Presumably follow-ups should go to python-ideas. -- Steven D'Aprano

This could be generalized and placed into itertools if we create a function (say, apply for lack of a better name at the moment) that takes in an iterable and creates new iterables that yield each from the original (avoiding the need for a list) holding only one in memory. Then you could pass the whatever function you wanted to run the iterables over an get the result back in a tuple. Eg: itertools.apply(iterable, min, max) ~= (min(iterable), max(iterable)) This class that creates 'associated iterables' from an original iterable where each new iterable has to be iterated over at the same time might also be useful in other contexts and could be added to itertools as well. Unfortunately this solution seems incompatable with the implementations with for loops in min and max (EG: How do you switch functions at the right time?) So it might take some tweaking. -- Zachary Burns (407)590-4814 Aim - Zac256FL On Sun, Oct 10, 2010 at 4:17 PM, Steven D'Aprano <steve@pearwood.info>wrote:

On 2010-10-11, at 02:55 , Zac Burns wrote:
As far as I know, there is no way to force lockstep iteration of arbitrary functions in Python. Though an argument could be made for adding coroutine capabilities to builtins and library functions taking iterables, I don't think that's on the books. As a result, this function would devolve into something along the lines of def apply(iterable, *funcs): return map(lambda c: c[0](c[1]), zip(funcs, tee(iterable, len(funcs)))) which would run out of memory on very long or nigh-infinite iterables due to tee memoizing all the content of the iterator.

Masklinn wrote:
We recently needed exactly this -- to do several running calculations in parallel on an iterable. We avoided using co-routines and just created a RunningCalc class with a simple interface, and implemented various running calculations as sub-classes, e.g. min, max, average, variance, n-largest. This isn't very fast, but since generating the iterated values is computationally heavy, this is fast enough for our uses. Having a standard method to do this in Python, with implementations for common calculations in the stdlib, would have been nice. I wouldn't mind trying to work up a PEP for this, if there is support for the idea. - Tal Einat

On 11 October 2010 21:18, Tal Einat <taleinat@gmail.com> wrote:
The "consumer" interface as described in http://effbot.org/zone/consumer.htm sounds about right for this: class Rmin(object): def __init__(self): self.running_min = None def feed(self, val): if self.running_min is None: self.running_min = val else: self.running_min = min(self.running_min, val) def close(self): pass rmin = Rmin() for val in iter: rmin.feed(val) print rmin.running_min Paul.

Paul Moore wrote:
That's what I was thinking about too. How about something along these lines? http://pastebin.com/DReBL56T I just worked that up now and would like some comments and suggestions. It could either turn into a PEP or an external library, depending on popularity here. - Tal Einat

On 12 October 2010 20:41, Tal Einat <taleinat@gmail.com> wrote:
Looks reasonable. I'd suspect it would be more appropriate as an external library rather than going directly into the stdlib. Also, when I've needed something like this in the past (for simulation code, involving iterators with millions of entries) speed has been pretty critical, so something pure-python like this might not have been enough. Maybe it's something that would be appropriate for numpy? But I like the idea in general. I don't see the need for the RunningCalc base class (duck typing rules!) and I'd be tempted to add dummy close methods, to conform to the published consumer protocol (even though it's not a formal Python standard). I wouldn't necessarily use the given apply function, either, but that's a matter of taste (I would suggest you change the name, though, to avoid reusing the old apply builtin's name, which was something entirely different). Paul

On Mon, Oct 11, 2010 at 10:18 PM, Tal Einat wrote:
After some thought, I've found a way to make running several "running calculations" in parallel fast. Speed should be comparable to having used the non-running variants. The method is to give each running calculation "blocks" of values instead of just one at a time. The apply_in_parallel(iterable, block_size=1000, *running_calcs) function would get blocks of values from the iterable and pass them to each running calculation separately. So RunningMax would look something like this: class RunningMax(RunningCalc): def __init__(self): self.max_value = None def feed(self, value): if self.max_value is None or value > self.max_value: self.max_value = value def feedMultiple(self, values): self.feed(max(values)) feedMultiple() would have a naive default implementation in the base class. Now this is non-trivial and can certainly be useful. Thoughts? Comments? - Tal Einat

On Thu, Oct 14, 2010 at 7:54 AM, Tal Einat <taleinat@gmail.com> wrote:
Why use feed() rather than the existing generator send() API? def runningmax(default_max=None): max_value = default_max while 1: value = max(yield max_value) if max_value is None or value > max_value: max_value = value That said, I think this kind of thing requires too many additional assumptions about how things are driven to make a particularly good candidate for standard library inclusion without use in PyPI library first. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, Oct 14, 2010 at 1:14 AM, Nick Coghlan wrote:
I tried using generators for this and it came out very clumsy. For one thing, using generators for this requires first calling next() once to run the generator up to the first yield, which makes the user-facing API very confusing. Generators also have to yield a value at every iteration, which is unnecessary here. Finally, the feedMultiple optimization is impossible with a generator-based implementation.
I'm not sure. "Rolling your own" for this isn't too difficult, so many developers will prefer to do so rather then add another dependency from PyPI. On the other hand, Python's standard library includes various simple utilities that make relatively simple things easier, standardized and well tested. Additionally, I think this fits in very nicely with Python's embracing of iterators, and complements the itertools library well. While I'm at it, I'd like to mention that I am aiming at a single very simple common usage pattern: from RunningCalc import apply_in_parallel, RunningCount, RunningNLargest, RunningNSmallest count, largest10, smallest10 = apply_in_parallel(data, RunningCount(), RunningNLargest(10), RunningNSmallest(10)) Implementing new running calculation classes would also be very simple. - Tal Einat

On 10/13/2010 07:13 PM, Tal Einat wrote:
Something I noticed about the min and max functions is that they treat values and iterable slightly different. # This works
# The gotcha
So you need a function like the following to make it handle single values and single iterables the same. def xmin(value): try: return min(value) except TypeError: return min([value]) Then you can do... @consumer def Running_Min(out_value=None): while 1: in_value = yield out_value if in_value is not None: if out_value is None: out_value = xmin(in_value) else: out_value = xmin(out_value, xmin(in_value)) Or for your class... def xmax(value): try: return max(value) except TypeError: return max([value]) class RunningMax(RunningCalc): def __init__(self): self.max_value = None def feed(self, value): if value is not None: if self.max_value is None: self.max_value = xmax(value) else: self.max_value = xmax(self.max_value, xmax(value)) Now if they could handle None a bit better we might be able to get rid of the None checks too. ;-) Cheers, Ron

Why would you ever want to write min(1)? (Or min(x) where x is not iterable.) --Guido On Thu, Oct 14, 2010 at 7:09 PM, Ron Adam <rrr@ronadam.com> wrote:
-- --Guido van Rossum (python.org/~guido)

My apologies, I clicked "reply" instead of "reply list" last night. After thinking about this a bit more, it isn't a matter of never needing to do it. The min and max functions wouldn't be able to compare a series of lists as individual values without a keyword switch to choose the specific behavior for a single item. ie... list of items or an item that happens to be a list. The below examples would not be able to compare sequences correctly. Ron On 10/14/2010 09:14 PM, Guido van Rossum wrote:
Why would you ever want to write min(1)? (Or min(x) where x is not iterable.)
Basically to allow easier duck typing without having to check weather x is an iterable. This isn't a big deal or a must have. It's just one solution to a problem presented here. My own thoughts is that little tweaks like this may be helpful when using functions in indirect ways where it's nice not to have to do additional value, type, or attribute checking. [Tal also says]
As Guido mentioned, there is never a reason to do max(value) where value is not an iterable.
Well, you can always avoid doing it, but that doesn't mean it wouldn't be nice to have sometimes. Take a look at the following three coroutines that do the same exact thing. Which is easier to read and which would be considered the more Pythonic. def xmin(*args, **kwds): # Allow min to work with a single non-iterable value. if len(args) == 1 and not hasattr(args[0], "__iter__"): return min(args, **kwds) else: return min(*args, **kwds) # Accept values or chunks of values and keep a running minimum. @consumer def Running_Min(out_value=None): while 1: in_value = yield out_value if in_value is not None: if out_value is None: out_value = xmin(in_value) else: out_value = xmin(out_value, xmin(in_value)) @consumer def Running_Min(out_value=None): while 1: in_value = yield out_value if in_value is not None: if not hasattr(in_value, "__iter__"): in_value = [in_value] if out_value is None: out_value = min(in_value) else: out_value = min(out_value, min(in_value)) @consumer def Running_Min(out_value=None): while 1: in_value = yield out_value if in_value is not None: if not hasattr(in_value, "__iter__"): if out_value is None: out_value = in_value else: out_value = min(out_value, in_value) else: if out_value is None: out_value = min(in_value) else: out_value = min(out_value, min(in_value))

Am 15.10.2010 19:13, schrieb Ron Adam:
I don't understand this function. Why wouldn't you simply always call return min(args, **kwds) ? Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

On 10/15/2010 12:27 PM, Georg Brandl wrote:
Because it would always interpret a list of values as a single item. This function looks at args and if its a single value without an "__iter__" method, it passes it to min as min([value], **kwds) instead of min(value, **kwds). Another way to do this would be to use a try-except... try: return min(*args, **kwds) except TypeError: return min(args, **kwds) Ron

Am 15.10.2010 20:09, schrieb Ron Adam:
And that's just gratuitous. If you have the sequence of items to compare already as an iterable, there is absolutely no need to unpack them using *args. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

On 15 October 2010 19:09, Ron Adam <rrr@ronadam.com> wrote:
But there are many iterable objects which are also comparable (hence it makes sense to consider their min/max), for example strings. So we get: xmin("foo", "bar", "baz") == "bar" xmin("foo", "bar") == "bar" but: xmin("foo") == "f" This will create havoc in your running min routine. (Notice the same will hold for min() but at least you know that min(x) considers x as an iterable and complains if it isn't) -- Arnaud

On Fri, Oct 15, 2010 at 4:09 AM, Ron Adam wrote:
Sorry, my bad. The max in "value = max(yield max_value)" was an error, it should have been removed. As Guido mentioned, there is never a reason to do max(value) where value is not an iterable. - Tal

On Thu, 14 Oct 2010 08:54:31 am you wrote:
Speed "should be" comparable? Are you guessing or have you actually timed it? And surely the point of all this extra coding is to make something run *faster*, not "comparable to", the sequential algorithm?
As I understand it, you have a large list of data, and you want to calculate a number of statistics on it. The naive algorithm does each calculation sequentially: a = min(data) b = max(data) c = f(data) # some other statistics d = g(data) ... x = z(data) If the calculations take t1, t2, t3, ..., tx time, then the sequential calculation takes sum(t1, t2, ..., tx) plus a bit of overhead. If you do it in parallel, this should reduce the time to max(t1, t2, ..., tx) plus a bit of overhead, potentially a big saving. But unless I've missed something significant, all you are actually doing is slicing data up into small pieces, then calling each function min, max, f, g, ..., z on each piece sequentially: block = data[:size] a = min(block) b = max(block) c = f(block) ... block = data[size:2*size] a = min(a, min(block)) b = max(b, max(block)) c = f(c, f(block)) ... block = data[2*size:3*size] a = min(a, min(block)) b = max(b, max(block)) c = f(c, f(block)) ... Each function still runs sequentially, but you've increased the amount of overhead a lot: your slicing and dicing the data, plus calling each function multiple times. And since each "running calculation" class needs to be hand-written to suit the specifics of the calculation, that's a lot of extra work just to get something which I expect will run slower than the naive sequential algorithm. I'm also distracted by the names, RunningMax and RunningCalc. RunningFoo doesn't mean "do Foo in parallel", it means to return intermediate calculations. For example, if I ran a function called RunningMax on this list: [1, 2, 1, 5, 7, 3, 4, 6, 8, 6] I would expect it to yield or return: [1, 2, 5, 7, 8] -- Steven D'Aprano

On Thu, Oct 14, 2010 at 1:23 PM, Steven D'Aprano wrote:
The use-case I'm targeting is when you can't hold all of the data in memory, and it is relatively "expensive" to generate it, e.g. a large and complex database query. In this case just running the sequential functions one at a time requires generating the data several times, once per function. My goal is to facilitate running several computations on a single iterator without keeping all of the data in memory. In almost all cases this will be slower than having run each of the sequential functions one at a time, if it were possible to keep all of the data in memory. The grouping optimization aims to reduce the overhead. - Tal Einat

On Thu, 14 Oct 2010 11:05:25 pm you wrote:
Okay, fair enough, but I think that's enough of a specialist need that it doesn't belong as a built-in or even in the standard library. I suspect that, even for your application, a more sensible approach would be to write a single function to walk over the data once, doing all the calculations you need. E.g. if your data is numeric, and you need (say) the min, max, mean (average), standard deviation and standard error, rather than doing a separate pass for each function, you can do them all in a single pass: sum = 0 sum_sq = 0 count = 0 smallest = sys.maxint biggest = -sys.maxint for x in data: count += 1 sum += x sum_sq += x**2 smallest = min(smallest, x) biggest = max(biggest, x) mean = sum/count std_dev = math.sqrt((sum_sq + sum**2)/(count-1)) std_err = std_dev/math.sqrt(count) That expression for the standard deviation is from memory, don't trust it, I've probably got it wrong! Naturally, if you don't know what functions you need to call until runtime it will require a bit more cleverness. A general approach might be a functional approach based on reduce: def multireduce(functions, initial_values, data): values = list(initial_values) for x in data: for i, func in enumerate(functions): values[i] = func(x, values[i]) return values The point is that if generating the data is costly, the best approach is to lazily generate the data once only, with the minimal overhead and maximum flexibility. -- Steven D'Aprano

On Fri, Oct 15, 2010 at 5:12 AM, Steven D'Aprano wrote:
I don't see this as a specialist need. This is relevant to any piece of code which receives an iterator and doesn't know whether it is feasible to keep all of its items in memory. The way I see it, Python's embracing of iterators is what makes this commonly useful.
What you suggest is that each programmer rolls his own code, which is reasonable for tasks which are not very common and are easy enough to implement. The problem is that in this case, the straightforward solution you suggest has both efficiency and numerical stability problems. These are actually quite tricky to understand and sort out. In light of this, a standard implementation which avoids common stumbling blocks and errors could have its place in the standard library. IIRC these were the reasons for the inclusion of the bisect module, for example. Regarding the numerical stability issues, these don't arise just in extreme edge-cases. Even a simple running average calculation for some large numbers, or numbers whose average is near zero, can have significant errors. Variance and standard deviation are even more problematic in this respect.
This is precisely what I am suggesting! The only difference is that I suggest using objects with a simple API instead of functions, to allow more flexibility. Some things are hard to implement using just a function as you suggest, and various optimizations are impossible. - Tal Einat

On Mon, Oct 11, 2010 at 1:17 AM, Steven D'Aprano wrote:
The discussion which followed this up has digressed quite a bit, but I'd like to mention that I'm +1 on having an efficient minmax() function available. - Tal

This could be generalized and placed into itertools if we create a function (say, apply for lack of a better name at the moment) that takes in an iterable and creates new iterables that yield each from the original (avoiding the need for a list) holding only one in memory. Then you could pass the whatever function you wanted to run the iterables over an get the result back in a tuple. Eg: itertools.apply(iterable, min, max) ~= (min(iterable), max(iterable)) This class that creates 'associated iterables' from an original iterable where each new iterable has to be iterated over at the same time might also be useful in other contexts and could be added to itertools as well. Unfortunately this solution seems incompatable with the implementations with for loops in min and max (EG: How do you switch functions at the right time?) So it might take some tweaking. -- Zachary Burns (407)590-4814 Aim - Zac256FL On Sun, Oct 10, 2010 at 4:17 PM, Steven D'Aprano <steve@pearwood.info>wrote:

On 2010-10-11, at 02:55 , Zac Burns wrote:
As far as I know, there is no way to force lockstep iteration of arbitrary functions in Python. Though an argument could be made for adding coroutine capabilities to builtins and library functions taking iterables, I don't think that's on the books. As a result, this function would devolve into something along the lines of def apply(iterable, *funcs): return map(lambda c: c[0](c[1]), zip(funcs, tee(iterable, len(funcs)))) which would run out of memory on very long or nigh-infinite iterables due to tee memoizing all the content of the iterator.

Masklinn wrote:
We recently needed exactly this -- to do several running calculations in parallel on an iterable. We avoided using co-routines and just created a RunningCalc class with a simple interface, and implemented various running calculations as sub-classes, e.g. min, max, average, variance, n-largest. This isn't very fast, but since generating the iterated values is computationally heavy, this is fast enough for our uses. Having a standard method to do this in Python, with implementations for common calculations in the stdlib, would have been nice. I wouldn't mind trying to work up a PEP for this, if there is support for the idea. - Tal Einat

On 11 October 2010 21:18, Tal Einat <taleinat@gmail.com> wrote:
The "consumer" interface as described in http://effbot.org/zone/consumer.htm sounds about right for this: class Rmin(object): def __init__(self): self.running_min = None def feed(self, val): if self.running_min is None: self.running_min = val else: self.running_min = min(self.running_min, val) def close(self): pass rmin = Rmin() for val in iter: rmin.feed(val) print rmin.running_min Paul.

Paul Moore wrote:
That's what I was thinking about too. How about something along these lines? http://pastebin.com/DReBL56T I just worked that up now and would like some comments and suggestions. It could either turn into a PEP or an external library, depending on popularity here. - Tal Einat

On 12 October 2010 20:41, Tal Einat <taleinat@gmail.com> wrote:
Looks reasonable. I'd suspect it would be more appropriate as an external library rather than going directly into the stdlib. Also, when I've needed something like this in the past (for simulation code, involving iterators with millions of entries) speed has been pretty critical, so something pure-python like this might not have been enough. Maybe it's something that would be appropriate for numpy? But I like the idea in general. I don't see the need for the RunningCalc base class (duck typing rules!) and I'd be tempted to add dummy close methods, to conform to the published consumer protocol (even though it's not a formal Python standard). I wouldn't necessarily use the given apply function, either, but that's a matter of taste (I would suggest you change the name, though, to avoid reusing the old apply builtin's name, which was something entirely different). Paul

On Mon, Oct 11, 2010 at 10:18 PM, Tal Einat wrote:
After some thought, I've found a way to make running several "running calculations" in parallel fast. Speed should be comparable to having used the non-running variants. The method is to give each running calculation "blocks" of values instead of just one at a time. The apply_in_parallel(iterable, block_size=1000, *running_calcs) function would get blocks of values from the iterable and pass them to each running calculation separately. So RunningMax would look something like this: class RunningMax(RunningCalc): def __init__(self): self.max_value = None def feed(self, value): if self.max_value is None or value > self.max_value: self.max_value = value def feedMultiple(self, values): self.feed(max(values)) feedMultiple() would have a naive default implementation in the base class. Now this is non-trivial and can certainly be useful. Thoughts? Comments? - Tal Einat

On Thu, Oct 14, 2010 at 7:54 AM, Tal Einat <taleinat@gmail.com> wrote:
Why use feed() rather than the existing generator send() API? def runningmax(default_max=None): max_value = default_max while 1: value = max(yield max_value) if max_value is None or value > max_value: max_value = value That said, I think this kind of thing requires too many additional assumptions about how things are driven to make a particularly good candidate for standard library inclusion without use in PyPI library first. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, Oct 14, 2010 at 1:14 AM, Nick Coghlan wrote:
I tried using generators for this and it came out very clumsy. For one thing, using generators for this requires first calling next() once to run the generator up to the first yield, which makes the user-facing API very confusing. Generators also have to yield a value at every iteration, which is unnecessary here. Finally, the feedMultiple optimization is impossible with a generator-based implementation.
I'm not sure. "Rolling your own" for this isn't too difficult, so many developers will prefer to do so rather then add another dependency from PyPI. On the other hand, Python's standard library includes various simple utilities that make relatively simple things easier, standardized and well tested. Additionally, I think this fits in very nicely with Python's embracing of iterators, and complements the itertools library well. While I'm at it, I'd like to mention that I am aiming at a single very simple common usage pattern: from RunningCalc import apply_in_parallel, RunningCount, RunningNLargest, RunningNSmallest count, largest10, smallest10 = apply_in_parallel(data, RunningCount(), RunningNLargest(10), RunningNSmallest(10)) Implementing new running calculation classes would also be very simple. - Tal Einat

On 10/13/2010 07:13 PM, Tal Einat wrote:
Something I noticed about the min and max functions is that they treat values and iterable slightly different. # This works
# The gotcha
So you need a function like the following to make it handle single values and single iterables the same. def xmin(value): try: return min(value) except TypeError: return min([value]) Then you can do... @consumer def Running_Min(out_value=None): while 1: in_value = yield out_value if in_value is not None: if out_value is None: out_value = xmin(in_value) else: out_value = xmin(out_value, xmin(in_value)) Or for your class... def xmax(value): try: return max(value) except TypeError: return max([value]) class RunningMax(RunningCalc): def __init__(self): self.max_value = None def feed(self, value): if value is not None: if self.max_value is None: self.max_value = xmax(value) else: self.max_value = xmax(self.max_value, xmax(value)) Now if they could handle None a bit better we might be able to get rid of the None checks too. ;-) Cheers, Ron

Why would you ever want to write min(1)? (Or min(x) where x is not iterable.) --Guido On Thu, Oct 14, 2010 at 7:09 PM, Ron Adam <rrr@ronadam.com> wrote:
-- --Guido van Rossum (python.org/~guido)

My apologies, I clicked "reply" instead of "reply list" last night. After thinking about this a bit more, it isn't a matter of never needing to do it. The min and max functions wouldn't be able to compare a series of lists as individual values without a keyword switch to choose the specific behavior for a single item. ie... list of items or an item that happens to be a list. The below examples would not be able to compare sequences correctly. Ron On 10/14/2010 09:14 PM, Guido van Rossum wrote:
Why would you ever want to write min(1)? (Or min(x) where x is not iterable.)
Basically to allow easier duck typing without having to check weather x is an iterable. This isn't a big deal or a must have. It's just one solution to a problem presented here. My own thoughts is that little tweaks like this may be helpful when using functions in indirect ways where it's nice not to have to do additional value, type, or attribute checking. [Tal also says]
As Guido mentioned, there is never a reason to do max(value) where value is not an iterable.
Well, you can always avoid doing it, but that doesn't mean it wouldn't be nice to have sometimes. Take a look at the following three coroutines that do the same exact thing. Which is easier to read and which would be considered the more Pythonic. def xmin(*args, **kwds): # Allow min to work with a single non-iterable value. if len(args) == 1 and not hasattr(args[0], "__iter__"): return min(args, **kwds) else: return min(*args, **kwds) # Accept values or chunks of values and keep a running minimum. @consumer def Running_Min(out_value=None): while 1: in_value = yield out_value if in_value is not None: if out_value is None: out_value = xmin(in_value) else: out_value = xmin(out_value, xmin(in_value)) @consumer def Running_Min(out_value=None): while 1: in_value = yield out_value if in_value is not None: if not hasattr(in_value, "__iter__"): in_value = [in_value] if out_value is None: out_value = min(in_value) else: out_value = min(out_value, min(in_value)) @consumer def Running_Min(out_value=None): while 1: in_value = yield out_value if in_value is not None: if not hasattr(in_value, "__iter__"): if out_value is None: out_value = in_value else: out_value = min(out_value, in_value) else: if out_value is None: out_value = min(in_value) else: out_value = min(out_value, min(in_value))

Am 15.10.2010 19:13, schrieb Ron Adam:
I don't understand this function. Why wouldn't you simply always call return min(args, **kwds) ? Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

On 10/15/2010 12:27 PM, Georg Brandl wrote:
Because it would always interpret a list of values as a single item. This function looks at args and if its a single value without an "__iter__" method, it passes it to min as min([value], **kwds) instead of min(value, **kwds). Another way to do this would be to use a try-except... try: return min(*args, **kwds) except TypeError: return min(args, **kwds) Ron

Am 15.10.2010 20:09, schrieb Ron Adam:
And that's just gratuitous. If you have the sequence of items to compare already as an iterable, there is absolutely no need to unpack them using *args. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

On 15 October 2010 19:09, Ron Adam <rrr@ronadam.com> wrote:
But there are many iterable objects which are also comparable (hence it makes sense to consider their min/max), for example strings. So we get: xmin("foo", "bar", "baz") == "bar" xmin("foo", "bar") == "bar" but: xmin("foo") == "f" This will create havoc in your running min routine. (Notice the same will hold for min() but at least you know that min(x) considers x as an iterable and complains if it isn't) -- Arnaud

On Fri, Oct 15, 2010 at 4:09 AM, Ron Adam wrote:
Sorry, my bad. The max in "value = max(yield max_value)" was an error, it should have been removed. As Guido mentioned, there is never a reason to do max(value) where value is not an iterable. - Tal

On Thu, 14 Oct 2010 08:54:31 am you wrote:
Speed "should be" comparable? Are you guessing or have you actually timed it? And surely the point of all this extra coding is to make something run *faster*, not "comparable to", the sequential algorithm?
As I understand it, you have a large list of data, and you want to calculate a number of statistics on it. The naive algorithm does each calculation sequentially: a = min(data) b = max(data) c = f(data) # some other statistics d = g(data) ... x = z(data) If the calculations take t1, t2, t3, ..., tx time, then the sequential calculation takes sum(t1, t2, ..., tx) plus a bit of overhead. If you do it in parallel, this should reduce the time to max(t1, t2, ..., tx) plus a bit of overhead, potentially a big saving. But unless I've missed something significant, all you are actually doing is slicing data up into small pieces, then calling each function min, max, f, g, ..., z on each piece sequentially: block = data[:size] a = min(block) b = max(block) c = f(block) ... block = data[size:2*size] a = min(a, min(block)) b = max(b, max(block)) c = f(c, f(block)) ... block = data[2*size:3*size] a = min(a, min(block)) b = max(b, max(block)) c = f(c, f(block)) ... Each function still runs sequentially, but you've increased the amount of overhead a lot: your slicing and dicing the data, plus calling each function multiple times. And since each "running calculation" class needs to be hand-written to suit the specifics of the calculation, that's a lot of extra work just to get something which I expect will run slower than the naive sequential algorithm. I'm also distracted by the names, RunningMax and RunningCalc. RunningFoo doesn't mean "do Foo in parallel", it means to return intermediate calculations. For example, if I ran a function called RunningMax on this list: [1, 2, 1, 5, 7, 3, 4, 6, 8, 6] I would expect it to yield or return: [1, 2, 5, 7, 8] -- Steven D'Aprano

On Thu, Oct 14, 2010 at 1:23 PM, Steven D'Aprano wrote:
The use-case I'm targeting is when you can't hold all of the data in memory, and it is relatively "expensive" to generate it, e.g. a large and complex database query. In this case just running the sequential functions one at a time requires generating the data several times, once per function. My goal is to facilitate running several computations on a single iterator without keeping all of the data in memory. In almost all cases this will be slower than having run each of the sequential functions one at a time, if it were possible to keep all of the data in memory. The grouping optimization aims to reduce the overhead. - Tal Einat

On Thu, 14 Oct 2010 11:05:25 pm you wrote:
Okay, fair enough, but I think that's enough of a specialist need that it doesn't belong as a built-in or even in the standard library. I suspect that, even for your application, a more sensible approach would be to write a single function to walk over the data once, doing all the calculations you need. E.g. if your data is numeric, and you need (say) the min, max, mean (average), standard deviation and standard error, rather than doing a separate pass for each function, you can do them all in a single pass: sum = 0 sum_sq = 0 count = 0 smallest = sys.maxint biggest = -sys.maxint for x in data: count += 1 sum += x sum_sq += x**2 smallest = min(smallest, x) biggest = max(biggest, x) mean = sum/count std_dev = math.sqrt((sum_sq + sum**2)/(count-1)) std_err = std_dev/math.sqrt(count) That expression for the standard deviation is from memory, don't trust it, I've probably got it wrong! Naturally, if you don't know what functions you need to call until runtime it will require a bit more cleverness. A general approach might be a functional approach based on reduce: def multireduce(functions, initial_values, data): values = list(initial_values) for x in data: for i, func in enumerate(functions): values[i] = func(x, values[i]) return values The point is that if generating the data is costly, the best approach is to lazily generate the data once only, with the minimal overhead and maximum flexibility. -- Steven D'Aprano

On Fri, Oct 15, 2010 at 5:12 AM, Steven D'Aprano wrote:
I don't see this as a specialist need. This is relevant to any piece of code which receives an iterator and doesn't know whether it is feasible to keep all of its items in memory. The way I see it, Python's embracing of iterators is what makes this commonly useful.
What you suggest is that each programmer rolls his own code, which is reasonable for tasks which are not very common and are easy enough to implement. The problem is that in this case, the straightforward solution you suggest has both efficiency and numerical stability problems. These are actually quite tricky to understand and sort out. In light of this, a standard implementation which avoids common stumbling blocks and errors could have its place in the standard library. IIRC these were the reasons for the inclusion of the bisect module, for example. Regarding the numerical stability issues, these don't arise just in extreme edge-cases. Even a simple running average calculation for some large numbers, or numbers whose average is near zero, can have significant errors. Variance and standard deviation are even more problematic in this respect.
This is precisely what I am suggesting! The only difference is that I suggest using objects with a simple API instead of functions, to allow more flexibility. Some things are hard to implement using just a function as you suggest, and various optimizations are impossible. - Tal Einat

On Mon, Oct 11, 2010 at 1:17 AM, Steven D'Aprano wrote:
The discussion which followed this up has digressed quite a bit, but I'd like to mention that I'm +1 on having an efficient minmax() function available. - Tal
participants (10)
-
Arnaud Delobelle
-
Georg Brandl
-
Guido van Rossum
-
Masklinn
-
Nick Coghlan
-
Paul Moore
-
Ron Adam
-
Steven D'Aprano
-
Tal Einat
-
Zac Burns