Rankordering for nonparametric statistics (Newbie)

Stephen Horne $$$$$$$$$$$$$$$$$ at $$$$$$$$$$$$$$$$$$$$.co.uk
Tue Oct 14 02:24:50 EDT 2003


On Sun, 12 Oct 2003 23:11:16 GMT, baf at texas.antispam.net (Ben
Fairbank) wrote:

>I have a matrix with many rows(say 1000 to make this concrete) and a
>dozen or so columns.  One column has numbers ranging from 1 to several
>hundred.  I have to create a new column with numbers from 1 to 1000
>corresponding to the smallest to largest numbers (don't worry about
>ties (yet)) in the column of interest.  The new number thus indicate
>the ordinal or rank order of the values in the given column.  I have
>been futzing around with argsort, but cannot find an elegant fast way
>to do it.  Can a reader suggest?
>
>Thank you,
>
>BAFairbank

I just happen to be looking into stats for psychological studies at
the moment, so I know exactly what you mean. My approach would be...

1.  Build a list for the particular column, containing a tuple of
    value and subscript (position in the column).

    That subscript is there so you can link the results back to the
    original column.

2.  Sort, giving a list ordered by value.

3.  Extend each tuple in the list to add the subscript in the sorted
    version, rearranging the tuple so that the subscript from (1) is
    now the first item.

4.  This would be a good point to handle the ties.

5.  Sort again, putting the result back into the same order as the
    original column.

So (dropping step 4) this would be...

  la = []

  for subs, val in enumerate (<column as list> :
    la += [(val, subs)]

  l1.sort ()

  lb = []
  for rank, pair in enumerate (la) :
    val, subs = pair
    lb += [(subs, val, rank+1)]  #  conventionally, ranks start at '1'

  lb.sort ()

  #  lb is now a list of tuples (subscript, value, rank) in the same
  #  order as the original column


To handle ties would need a little extra processing in that second
loop, giving...

  la = []

  for subs, val in enumerate (<column as list> :
    la += [(val, subs)]

  l1.sort ()

  lb = []  #  to hold final result
  lc = []  #  to hold val, subs tuples for current tied group
  ranksum = 0  #  sum of ranks for current tied group, for averaging

  for rank, pair in enumerate (la) :
    if len(lc) == 0 : # this should only happen on the first iteration
      lc = [pair]
      ranksum = (rank + 1)
    else :
      if lc [0] [0] == pair [0] :  #  same value so another tie
        lc += [pair]
        ranksum += (rank + 1)
      else :
        rankmean = ranksum / len (lc) # note - this is deliberately
                                      #        not the integer //

        #  Transfer tied group to result
        for val, subs in lc :
          lb += [(subs, val, rankmean)]

        #  Start new possibly tied group
        lc = [pair]
        ranksum = rank + 1

  #  Handle final tied group, if any
  #  (there always will be unless the original column was empty)
  if len(lc) > 0 :
    rankmean = ranksum / len (lc)

    for val, subs in lc :
      lb += [(subs, val, rankmean)]

  # back to original column ordering
  lb.sort ()


I imagine there are far better examples around, but this should be a
reasonable illustration of the principles involved.

One issue is probably the "if lc [0] [0] == pair [0] :" line. If your
values are floats, this '==' is probably inappropriate - it is
oversensitive to float precision issues. 3.9999999999999... should
probably be treated as equal to 4.0, for instance.


-- 
Steve Horne

steve at ninereeds dot fsnet dot co dot uk




More information about the Python-list mailing list