[Tutor] Making Regular Expressions readable
Stephen Nelson-Smith
sanelson at gmail.com
Mon Mar 8 17:12:35 CET 2010
Hi,
I've written this today:
#!/usr/bin/env python
import re
pattern = r'(?P<ForwardedFor>^(-|[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(,
[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})*){1})
(?P<RemoteLogname>(\S*)) (?P<RemoteUser>(\S*))
(?P<Timestamp>(\[[^\]]+\]))
(?P<FirstLineOfRequest>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<Status>(\S*)) (?P<Size>(\S*))
(?P<Referrer>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<UserAgent>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)(
)?(?P<SiteIntelligenceCookie>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)'
regex = re.compile(pattern)
lines = 0
no_cookies = 0
for line in open('/home/stephen/scratch/feb-100.txt'):
lines +=1
line = line.strip()
match = regex.match(line)
if match:
data = match.groupdict()
if data['SiteIntelligenceCookie'] == '':
no_cookies +=1
else:
print "Couldn't match ", line
print "I analysed %s lines." % (lines,)
print "There were %s lines with missing Site Intelligence cookies." %
(no_cookies,)
It works fine, but it looks pretty unreadable and unmaintainable to
anyone who hasn't spent all day writing regular expressions.
I remember reading about verbose regular expressions. Would these help?
How could I make the above more maintainable?
S.
--
Stephen Nelson-Smith
Technical Director
Atalanta Systems Ltd
www.atalanta-systems.com
More information about the Tutor
mailing list