[Tutor] Logfile Manipulation
Martin A. Brown
martin at linux-ip.net
Mon Nov 9 10:38:02 CET 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello,
: An apache logfile entry looks like this:
:
: 89.151.119.196 - - [04/Nov/2009:04:02:10 +0000] "GET
: /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812
: HTTP/1.1" 200 50 "-" "-"
:
: I want to extract 24 hrs of data based timestamps like this:
:
: [04/Nov/2009:04:02:10 +0000]
:
: I also need to do some filtering (eg I actually don't want
: anything with service.php), and I also have to do some
: substitutions - that's trivial other than not knowing the optimum
: place to do it? IE should I do multiple passes?
I wouldn't. Then, you spend decompression CPU, line matching CPU
and I/O several times. I'd do it all at once.
: Or should I try to do all the work at once, only viewing each
: line once? Also what about reading from compressed files? The
: data comes in as 6 gzipped logfiles which expand to 6G in total.
There are standard modules for handling compressed data (gzip and
bz2). I'd imagine that the other pythonistas on this list will give
you more detailed (and probably better) advice, but here's a sample
of how to use the gzip module and how to skip the lines containing
the '/service.php' string, and to extract an epoch timestamp from
the datestamp field(s). You would pass the filenames to operate on
as arguments to this script.
See optparse if you want fancier capabilities for option handling.
See re if you want to match multiple patterns to ignore.
See time (and datetime) for mangling time and date strings. Be
forewarned, time zone issues will probably be a massive headache.
Many others have been here before [0].
Look up itertools (and be prepared for some study) if you want the
output from the log files from your different servers sorted in
the output.
Note that the below snippet is a toy and makes no attempt to trap
(try/except) any error conditions.
If you are looking for a weblog analytics package once you have
reassambled the files into a whole, perhaps you could just start
there (e.g. webalizer, analog are two old-school packages that come
to mind for processing logging that has been produced in a Common
Log Format).
I will echo Alan Gauld's sentiments of a few minutes ago and note
that there are a probably many different Apache log parsers out
there which can accomplish what you hope to accomplish. On the
other hand, you may be using this as an excuse to learn a bit of
python.
Good luck,
- -Martin
[0] http://seehuhn.de/blog/52
Sample:
import sys, time, gzip
files = sys.argv[1:]
for file in files:
print >>sys.stderr, "About to open %s" % ( file )
f = gzip.open( file )
for line in f:
if line.find('/service.php') > 0:
continue
fields = line.split()
# -- ignoring time zone; you are logging in UTC, right?
# tz = fields[4]
d = int( time.mktime( time.strptime(fields[3], "[%d/%b/%Y:%H:%M:%S") ) )
print d, line,
- --
Martin A. Brown
http://linux-ip.net/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: pgf-0.72 (http://linux-ip.net/sw/pine-gpg-filter/)
iD8DBQFK9+MGHEoZD1iZ+YcRAhITAKCLGF6GnEMYr50bgk4vAw3YMRZjuACg2VUg
I7/Vrw6KKjwqfxG0qfr10lo=
=oi6X
-----END PGP SIGNATURE-----
More information about the Tutor
mailing list