Trouble with quotes
Stephen Nelson-Smith
sanelson at gmail.com
Mon Mar 8 12:06:25 EST 2010
Hi,
I've written some (primitive) code to parse some apache logfies and
establish if apache has appended a session cookie to the end. We're
finding that some browsers don't and apache doesn't just append a "-"
- it just omits it.
It's working fine, but for an edge case:
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:30:44 +0100] "GET
http://sekrit.com/node/175523 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:31:15 +0100] "GET
http://sekrit.com/node/175521 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:32:07 +0100] "GET
http://sekrit.com/node/175520 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:32:33 +0100] "GET
http://sekrit.com/node/175522 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:33:01 +0100] "GET
http://sekrit.com/node/175527 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [25/Feb/2010:17:01:54 +0100] "GET
http://sekrit.com/search/results/ HTTP/1.0" 200 -
"http://sekrit.com/search/results/"guideline%20grids"&page=1"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)"
Couldn't match 192.168.1.107 - - [25/Feb/2010:17:02:15 +0100] "GET
http://sekrit.com/search/results/ HTTP/1.0" 200 -
"http://sekrit.com/search/results/"guideline%20grids"&page=1"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)"
If there are " " inside the request string, my regex breaks.
Here's the code:
#!/usr/bin/env python
import re
pattern = r'(?P<ForwardedFor>^(-|[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(,
[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})*){1})
(?P<RemoteLogname>(\S*)) (?P<RemoteUser>(\S*))
(?P<Timestamp>(\[[^\]]+\]))
(?P<FirstLineOfRequest>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<Status>(\S*)) (?P<Size>(\S*))
(?P<Referrer>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<UserAgent>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)(
)?(?P<SiteIntelligenceCookie>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)'
regex = re.compile(pattern)
lines = 0
no_cookies = 0
unmatched = 0
for line in open('/home/stephen/scratch/test-data.txt'):
lines +=1
line = line.strip()
match = regex.match(line)
if match:
data = match.groupdict()
if data['SiteIntelligenceCookie'] == '':
no_cookies +=1
else:
print "Couldn't match ", line
unmatched +=1
print "I analysed %s lines." % (lines,)
print "There were %s lines with missing Site Intelligence cookies." %
(no_cookies,)
print "I was unable to process %s lines." % (unmatched,)
How can I make the regex a bit more resilient so it doesn't break when
" " is embedded?
--
Stephen Nelson-Smith
Technical Director
Atalanta Systems Ltd
www.atalanta-systems.com
More information about the Python-list
mailing list