[Tutor] Logfile Manipulation

Mon Nov 9 10:38:02 CET 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

 : An apache logfile entry looks like this:
 : 
 : 89.151.119.196 - - [04/Nov/2009:04:02:10 +0000] "GET
 : /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812
 : HTTP/1.1" 200 50 "-" "-"
 : 
 : I want to extract 24 hrs of data based timestamps like this:
 : 
 : [04/Nov/2009:04:02:10 +0000]
 : 
 : I also need to do some filtering (eg I actually don't want 
 : anything with service.php), and I also have to do some 
 : substitutions - that's trivial other than not knowing the optimum 
 : place to do it?  IE should I do multiple passes?

I wouldn't.  Then, you spend decompression CPU, line matching CPU 
and I/O several times.  I'd do it all at once.

 : Or should I try to do all the work at once, only viewing each 
 : line once?  Also what about reading from compressed files?  The 
 : data comes in as 6 gzipped logfiles which expand to 6G in total.

There are standard modules for handling compressed data (gzip and 
bz2).  I'd imagine that the other pythonistas on this list will give 
you more detailed (and probably better) advice, but here's a sample 
of how to use the gzip module and how to skip the lines containing 
the '/service.php' string, and to extract an epoch timestamp from 
the datestamp field(s).  You would pass the filenames to operate on 
as arguments to this script.

  See optparse if you want fancier capabilities for option handling.

  See re if you want to match multiple patterns to ignore.

  See time (and datetime) for mangling time and date strings.  Be
    forewarned, time zone issues will probably be a massive headache.
    Many others have been here before [0].

  Look up itertools (and be prepared for some study) if you want the
    output from the log files from your different servers sorted in 
    the output.

Note that the below snippet is a toy and makes no attempt to trap 
(try/except) any error conditions.  

If you are looking for a weblog analytics package once you have 
reassambled the files into a whole, perhaps you could just start 
there (e.g. webalizer, analog are two old-school packages that come 
to mind for processing logging that has been produced in a Common 
Log Format).

I will echo Alan Gauld's sentiments of a few minutes ago and note 
that there are a probably many different Apache log parsers out 
there which can accomplish what you hope to accomplish.  On the 
other hand, you may be using this as an excuse to learn a bit of 
python.

Good luck,

- -Martin

 [0] http://seehuhn.de/blog/52

Sample:

  import sys, time, gzip

  files = sys.argv[1:]

  for file in files:
    print >>sys.stderr, "About to open %s" % ( file )
    f = gzip.open( file )
    for line in f:
      if line.find('/service.php') > 0:
        continue
      fields = line.split()
      # -- ignoring time zone; you are logging in UTC, right?
      #    tz = fields[4]
      d = int( time.mktime( time.strptime(fields[3], "[%d/%b/%Y:%H:%M:%S") ) )
      print d, line,

- -- 
Martin A. Brown
http://linux-ip.net/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: pgf-0.72 (http://linux-ip.net/sw/pine-gpg-filter/)

iD8DBQFK9+MGHEoZD1iZ+YcRAhITAKCLGF6GnEMYr50bgk4vAw3YMRZjuACg2VUg
I7/Vrw6KKjwqfxG0qfr10lo=
=oi6X
-----END PGP SIGNATURE-----