Standard Forth versus Python: a case study

John Doty jpd at whispertel.LoseTheH.net
Thu Oct 12 17:35:34 CEST 2006


Paul McGuire wrote:
> <idknow at gmail.com> wrote in message 
> news:1160652722.908731.213650 at i42g2000cwa.googlegroups.com...
>> Paul McGuire wrote:
>>> <idknow at gmail.com> wrote in message
>>> news:1160619958.438049.53390 at h48g2000cwc.googlegroups.com...
>>>> [snip]
>>>>
>>>> no sort() is needed to calculate the median of a list.
>>>>
>>>> you just need one temp var.
>>>>
>>> Ok, I'll bite.  How do you compute the median of a list using just a 
>>> single
>>> temp var?
>>>
>>> -- Paul
>> hi Paul; well when this was a stats-class assignment (back when pascal
>> was popular :) i just stepped through the vector and compared it
>>
>> (pseudo-code)
>>
>> ptr p = [with values].
>>
>> fun median {
>> var x = 0.
>>  while( *p++) {
>>    if( (*p) > x) x = *p.
>>  }
>>  return x.
>> }
>>
>> of course, pascal is more verbose but that's median()
>>
> 
> No, that's the maximum.  The median value is the value that is in the middle 
> of the list when the list is sorted.  Many analyses prefer median to mean 
> (also known as "average") because the median is less sensitive to wild 
> outlier points.
> 
> My original question was in response to your post, that sort() wasn't 
> required but only a temp variable.  I am very interested in seeing your 
> solution that does not require the data to be sorted.  (This is not just an 
> academic exercise - given a large historical data set, sorting the data is 
> one of the costliest parts of computing the median, and I would greatly 
> appreciate seeing an alternative algorithm.)

Here's a K&R C function I wrote almost 20 years ago. It's a general 
purpose quantile. The essential idea is to choose an element at random 
(thus mostly avoiding perverse behavior with already sorted data) and 
divide the set into two pieces around it. Then you figure out which 
piece contains the quantile you want, and what quantile it is within 
that piece, and recurse. When you see enough identical elements in the 
piece you're processing, it's done. In the extreme case you'll get down 
to one element.

ixrand(n) generates a random integer in the range 0..n-1. I think that's 
the only nonstandard function used.

The style is torqued by what Unisoft C could and couldn't optimize: I 
wouldn't write it quite like that today. One of my students pointed out 
that all of the recursion is tail recursion so it should be easy to 
flatten. Left as an exercise to the reader...

Somewhere, in Knuth I think, I saw a somewhat similar algorithm that 
promised a little better performance by estimating the median from a 
sample of the data, breaking the data up there, and then looking for a 
quantile (statistically guaranteed to be) close to the min or max of one 
of the subsets.

It shouldn't be hard to recode in Python, Forth, or whatever. That 
wasn't my purpose in the exercise that started the thread though: I 
wanted to see if I could import modules good enough to do the job from 
public sources. In Python I could, and the entire image processing 
program is 15 lines. In Forth I couldn't.

Anyway, here it is:

/* Find the nth from the minimum value in an array */
/* Monte Carlo method intended for finding medians */
/* 2/13/85 jpd */

/* For random data, this routine takes about */
/* 2.6*numdata + O( log( numdata )) comparisons */
/* If the data is tightly clustered about the mean, */
/* there is a speedup; it may take as few as
/* 0.5*numdata comparisons. */
/* There is a slight penalty if the array is completely */
/* or partially sorted; it is at most 25%. */

/* NTH will be nthi, nths, etc., depending on DATATYPE */

NTH( data, numdata, n )
DATATYPE data[];		/* Data array (will be scrambled on return) */
int numdata;		/* lemgth of data array */
int n;			/* index if item to find:
			   1 <= n <= numdata */
{
	register DATATYPE boundary, thisdata;
	register DATATYPE *lowp, *highp;
	DATATYPE v1, v2;
	int nlowbin;

	lowp = data;			/* Init data pointer */

	v1 = data[ ixrand( numdata )];
	{
		register DATATYPE v1r = v1;
		int nc = 1 + numdata - n;	/* "Complement" of n */

		if( nc > n )
			highp = lowp + nc;
		else
			highp = lowp + n;	/* Limit to test for done */

					/* Scan for the first point which
					   doesn't match the boundary point.
					   If we encounter enough
					   matching points,
					   the boundary is the answer */
		while( *lowp++ == v1r ) {
			if( lowp >= highp ) return( v1r );
		}
		v2 = *--lowp;		/* Back up to get point */
	}

	boundary = ( v1 >> 1 ) + ( v2 >> 1 );	/* Beware overflows */

	highp = data + numdata;		/* Now process the whole thing */
	thisdata = *lowp;		/* Prime the pump */

	if( v2 < v1 ) {			/* Bin 2 is low bin */
		for( ; lowp < highp; thisdata = *lowp ) {
			if( thisdata <= boundary ) { /* Bin 2 */
				*lowp = *--highp;	/* Exchange */
				*highp = thisdata;
			}
			else ++lowp;	/* Data point in right place */
		}

		nlowbin = numdata - ( lowp - data );
		if( nlowbin >= n ) return( NTH( highp, nlowbin, n ));
		else return( NTH( data, lowp - data, n - nlowbin ));

	}
	else {			/* Primary bin is low bin */
		for( ; lowp < highp; thisdata = *lowp ) {
			if( thisdata > boundary ) { /* Bin 2 */
				*lowp = *--highp;	/* Exchange */
				*highp = thisdata;
			}
			else ++lowp;	/* Don't move point */
		}

		nlowbin = ( lowp - data );
		if( nlowbin >= n ) return( NTH( data, nlowbin, n ));
		else return( NTH( highp, numdata - nlowbin, n - nlowbin ));
	}
}




-- 
John Doty, Noqsi Aerospace, Ltd.
--
Specialization is for robots.



More information about the Python-list mailing list