[Tutor] HTML Parsing

Mon Apr 21 17:35:54 CEST 2008

Stephen Nelson-Smith wrote:
> Hi,
>
> I want to write a little script that parses an apache mod_status page.
>
> I want it to return simple the number of page requests a second and
> the number of connections.
>
> It seems this is very complicated... I can do it in a shell one-liner:
>
> curl 10.1.2.201/server-status 2>&1 | grep -i request | grep dt | {
> IFS='> ' read _ rps _; IFS='> ' read _ currRequests _ _ _ _
> idleWorkers _; echo $rps $currRequests $idleWorkers   ; }
>
> But that's horrid.
>
> So is:
>
> $ eval `printf '<dt>3 requests currently being processed, 17 idle
> workers</dt>\n <dt>2.82 requests/sec - 28.1 kB/second - 10.0
> kB/request</dt>\n' | sed -nr '/<dt>/ { N;
> s@<dt>([0-9]*)[^,]*,([0-9]*).*<dt>([0-9.]*).*@workers=$((\1+\2));requests=\3 at p;
> }'`
> $ echo "workers: $workers reqs/secs $requests"
> workers: 20 reqs/sec 2.82
>
> The page looks like this:
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <html><head>
> <title>Apache Status</title>
> </head><body>
> <h1>Apache Server Status for 10.1.2.201</h1>
>
> <dl><dt>Server Version: Apache/2.0.46 (Red Hat)</dt>
> <dt>Server Built: Aug  1 2006 09:25:45
> </dt></dl><hr /><dl>
> <dt>Current Time: Monday, 21-Apr-2008 14:29:44 BST</dt>
> <dt>Restart Time: Monday, 21-Apr-2008 13:32:46 BST</dt>
> <dt>Parent Server Generation: 0</dt>
> <dt>Server uptime:  56 minutes 58 seconds</dt>
> <dt>Total accesses: 10661 - Total Traffic: 101.5 MB</dt>
> <dt>CPU Usage: u6.03 s2.15 cu0 cs0 - .239% CPU load</dt>
> <dt>3.12 requests/sec - 30.4 kB/second - 9.7 kB/request</dt>
> <dt>9 requests currently being processed, 11 idle workers</dt>
> </body></html>
>
> How can/should I do this?
>
> S.
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
>   
I don't know how you get the page HTML, but let's assume each line is in 
an iterable, named html. It seems very straightforward to code:

for lineno, line in enumerate(html):
  x = line.find("requests/sec")
  if x >= 0:
    no_requests_sec = line[3:x]
    break
for lineno, line in enumerate(html[lineno+1:]):
  x = line.find("requests currently being processed")
  if x >= 0:
    no_connections = line[3:x]

That makes certain assumptions about the file format, such as the 
matching text and knowing that connections follows requests/sec, and 
does not assume that connections is the first line after requests/sec.

-- 
Bob Gailer
919-636-4239 Chapel Hill, NC