[Tutor] how to sort the data inside the file.

Chris Fuller cfuller084 at thinkingplanet.net
Mon Dec 31 17:36:30 CET 2007


On Monday 31 December 2007 06:19, goldgod a wrote:
> hello all,
>             Please find the attached file. I want to sort the content
> of this file based on the "bytes" in descending order. How can i do
> it, any pointers to it.

This is a classic case for the use of regular expressions.  A powerful tool, 
but can be tricky.  I create a RE for a record, and than use the findall() 
function to get all of them in a list.  I'll walk you through it.

A not of caution: once you get the hang of REs, using them will leave you 
feeling giddy!

Note:  I load the whole file into memory and parse it in one piece.  If your 
log is up in the megabytes, it might make more sense to read it in line by 
line.

All of this is well documented in the Python Library Reference.


The code:

import re

fd = open('test', 'r')
s = fd.read()
fd.close()

# seperate into records
lin = re.findall('\s*([^\s]+)\s+([^\s]+)\s+(\d+)( [kM])?bytes', s)

# iterate through the records, and convert the text of the size to an integer
for fields in lin:
   if   fields[3] == ' M':
      mul = 1000000
      #mul = 1024000
      #mul = 1048576

   elif fields[3] == ' k':
      mul = 1000
      #mul = 1024

   else:
      mul = 1

   lout.append( (fields[0], fields[1], int(fields[2])*mul) )

# now sort
lout.sort(lambda a,b: cmp(a[2], b[2]))


Most of processing is happeneing in the single line
lin = re.findall('\s*([^\s]+)\s+([^\s]+)\s+(\d+)( [kM])?bytes', s)

Here is the regular expression
'\s*([^\s]+)\s+([^\s]+)\s+(\d+)( [kM])?bytes'

From left to right:
\s*  The \s designates whitespace, the asterisk matches zero or multiple 
instances.

([^\s]+)  The parentheses designate a "group", which can be referenced later, 
and represents the hyphen delimited IP address.  It also works in the 
familiar order-of-operations manner.  We'll see this in the last piece,  The 
square brackets designate a list of valid characters.  The caret inside is an 
inversion: match anything that isn't in the set.  \s designates whitespace, 
so this matches one or more characters of not-whitespace stuff.

\s+  The plus matches one or multiple instances of whitespace.

([^\s]+)  The second referenced group, which is the domain name in your file.

\s+  matches the whitespace between fields

(\d+)  the third referenced ("matched") group, which is one or more decimal 
digits.

( [kM])?  Now the tricky bit.  The [kM] matches any character in the set, so 
either "k" or "M".  The space inside the group includes the space preceeding 
the "k" or "M" and the ? applies to the whole group, and matches zero or one 
instance, which allows records that have only "bytes" to be matched.  These 
will have None returned for that group, and " k" or " M" if these were 
present.

bytes  This is pretty obvious.  Nothing funny going on here.

I hope that made sense.

The other tricky bit is the sort.  In python, lists can be sorted in place, if 
you pass a function that determines the relative precedence of two elements.  
The cmp() built-in is a shorthand that makes this easy.  What this line does 
is sorts the list according to the value of the third element.  Reverse the a 
and b to get a descending sort.


An interesting note:  I mismatched some parentheses at first, and I ended up 
multiplying the string that converted to int rather than the int.  
Python will happily convert a million digit int for you, but damn it takes a 
long time!

Cheers


More information about the Tutor mailing list