[Tutor] how to sort the data inside the file.
Chris Fuller
cfuller084 at thinkingplanet.net
Mon Dec 31 17:36:30 CET 2007
On Monday 31 December 2007 06:19, goldgod a wrote:
> hello all,
> Please find the attached file. I want to sort the content
> of this file based on the "bytes" in descending order. How can i do
> it, any pointers to it.
This is a classic case for the use of regular expressions. A powerful tool,
but can be tricky. I create a RE for a record, and than use the findall()
function to get all of them in a list. I'll walk you through it.
A not of caution: once you get the hang of REs, using them will leave you
feeling giddy!
Note: I load the whole file into memory and parse it in one piece. If your
log is up in the megabytes, it might make more sense to read it in line by
line.
All of this is well documented in the Python Library Reference.
The code:
import re
fd = open('test', 'r')
s = fd.read()
fd.close()
# seperate into records
lin = re.findall('\s*([^\s]+)\s+([^\s]+)\s+(\d+)( [kM])?bytes', s)
# iterate through the records, and convert the text of the size to an integer
for fields in lin:
if fields[3] == ' M':
mul = 1000000
#mul = 1024000
#mul = 1048576
elif fields[3] == ' k':
mul = 1000
#mul = 1024
else:
mul = 1
lout.append( (fields[0], fields[1], int(fields[2])*mul) )
# now sort
lout.sort(lambda a,b: cmp(a[2], b[2]))
Most of processing is happeneing in the single line
lin = re.findall('\s*([^\s]+)\s+([^\s]+)\s+(\d+)( [kM])?bytes', s)
Here is the regular expression
'\s*([^\s]+)\s+([^\s]+)\s+(\d+)( [kM])?bytes'
From left to right:
\s* The \s designates whitespace, the asterisk matches zero or multiple
instances.
([^\s]+) The parentheses designate a "group", which can be referenced later,
and represents the hyphen delimited IP address. It also works in the
familiar order-of-operations manner. We'll see this in the last piece, The
square brackets designate a list of valid characters. The caret inside is an
inversion: match anything that isn't in the set. \s designates whitespace,
so this matches one or more characters of not-whitespace stuff.
\s+ The plus matches one or multiple instances of whitespace.
([^\s]+) The second referenced group, which is the domain name in your file.
\s+ matches the whitespace between fields
(\d+) the third referenced ("matched") group, which is one or more decimal
digits.
( [kM])? Now the tricky bit. The [kM] matches any character in the set, so
either "k" or "M". The space inside the group includes the space preceeding
the "k" or "M" and the ? applies to the whole group, and matches zero or one
instance, which allows records that have only "bytes" to be matched. These
will have None returned for that group, and " k" or " M" if these were
present.
bytes This is pretty obvious. Nothing funny going on here.
I hope that made sense.
The other tricky bit is the sort. In python, lists can be sorted in place, if
you pass a function that determines the relative precedence of two elements.
The cmp() built-in is a shorthand that makes this easy. What this line does
is sorts the list according to the value of the third element. Reverse the a
and b to get a descending sort.
An interesting note: I mismatched some parentheses at first, and I ended up
multiplying the string that converted to int rather than the int.
Python will happily convert a million digit int for you, but damn it takes a
long time!
Cheers
More information about the Tutor
mailing list