Trouble with quotes

Stephen Nelson-Smith sanelson at gmail.com
Mon Mar 8 12:06:25 EST 2010


Hi,

I've written some (primitive) code to parse some apache logfies and
establish if apache has appended a session cookie to the end.  We're
finding that some browsers don't and apache doesn't just append a "-"
- it just omits it.

It's working fine, but for an edge case:

Couldn't match  192.168.1.107 - - [24/Feb/2010:20:30:44 +0100] "GET
http://sekrit.com/node/175523 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match  192.168.1.107 - - [24/Feb/2010:20:31:15 +0100] "GET
http://sekrit.com/node/175521 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match  192.168.1.107 - - [24/Feb/2010:20:32:07 +0100] "GET
http://sekrit.com/node/175520 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match  192.168.1.107 - - [24/Feb/2010:20:32:33 +0100] "GET
http://sekrit.com/node/175522 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match  192.168.1.107 - - [24/Feb/2010:20:33:01 +0100] "GET
http://sekrit.com/node/175527 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match  192.168.1.107 - - [25/Feb/2010:17:01:54 +0100] "GET
http://sekrit.com/search/results/ HTTP/1.0" 200 -
"http://sekrit.com/search/results/"guideline%20grids"&page=1"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)"
Couldn't match  192.168.1.107 - - [25/Feb/2010:17:02:15 +0100] "GET
http://sekrit.com/search/results/ HTTP/1.0" 200 -
"http://sekrit.com/search/results/"guideline%20grids"&page=1"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)"

If there are " " inside the request string, my regex breaks.

Here's the code:

#!/usr/bin/env python
import re

pattern = r'(?P<ForwardedFor>^(-|[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(,
[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})*){1})
(?P<RemoteLogname>(\S*)) (?P<RemoteUser>(\S*))
(?P<Timestamp>(\[[^\]]+\]))
(?P<FirstLineOfRequest>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<Status>(\S*)) (?P<Size>(\S*))
(?P<Referrer>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<UserAgent>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)(
)?(?P<SiteIntelligenceCookie>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)'

regex = re.compile(pattern)

lines = 0
no_cookies = 0
unmatched = 0

for line in open('/home/stephen/scratch/test-data.txt'):
  lines +=1
  line = line.strip()
  match = regex.match(line)

  if match:
    data = match.groupdict()
    if data['SiteIntelligenceCookie'] == '':
      no_cookies +=1
  else:
    print "Couldn't match ", line
    unmatched +=1

print "I analysed %s lines." % (lines,)
print "There were %s lines with missing Site Intelligence cookies." %
(no_cookies,)
print "I was unable to process %s lines." % (unmatched,)

How can I make the regex a bit more resilient so it doesn't break when
" " is embedded?

-- 
Stephen Nelson-Smith
Technical Director
Atalanta Systems Ltd
www.atalanta-systems.com



More information about the Python-list mailing list