[Tutor] advice on regex matching for dates?

spir denis.spir at free.fr
Thu Dec 11 23:38:52 CET 2008


Serdar Tumgoren a écrit :
> Hey everyone,
> 
> I was wondering if there is a way to use the datetime module to check for
> variations on a month name when performing a regex match?
> 
> In the script below, I created a regex pattern that checks for dates in the
> following pattern:  "August 31, 2007". If there's a match, I can then print
> the capture date and the line from which it was extracted.
> 
> While it works in this isolated case, it struck me as not very flexible.
> What happens when I inevitably get data that has dates formatted in a
> different way? Do I have to create some type of library that contains
> variations on each month name (i.e. - January, Jan., 01, 1...) and use that
> to parse each line?
> 
> Or is there some way to use datetime to check for date patterns when using
> regex? Is there a "best practice" in this area that I'm unaware of in this
> area?
> 
> Apologies if this question has been answered elsewhere. I wasn't sure how to
> research this topic (beyond standard datetime docs), but I'd be happy to RTM
> if someone can point me to some resources.
> 
> Any suggestions are welcome (including optimizations of the code below).
> 
> Regards,
> Serdar
> 
> #!/usr/bin/env python
> 
> import re, sys
> 
> sourcefile = open(sys.argv[1],'r')
> 
> pattern =
> re.compile(r'(?P<month>January|February|March|April|May|June|July|August|September|October|November|December)\s(?P<day>\d{1,2}),\s(?P<year>\d{4})')
> 
> pattern2 = re.compile(r'Return to List')
> 
> counter = 0
> 
> for line in sourcefile:
>     x = pattern.search(line)
>     break_point = pattern2.match(line)
> 
>     if x:
>         counter +=1
>         print "%s %d, %d <== %s" % ( x.group('month'), int(x.group('day')),
> int(x.group('year')), line ),
>     elif break_point:
>         break
> 
> print counter
> sourcefile.close()

I just found a simple, but nice, trick to make regexes less unlegible. Using 
substrings to represent sub-patterns. E.g. instead of:

p = 
re.compile(r'(?P<month>January|February|March|April|May|June|July|August|September|October|November|December)\s(?P<day>\d{1,2}),\s(?P<year>\d{4})')

write first:

month = 
r'(?P<month>January|February|March|April|May|June|July|August|September|October|November|December)'
day = r'(?P<day>\d{1,2})'
year = r'(?P<year>\d{4})'

then:
p = re.compile( r"%s\s%s,\s%s" % (month,day,year) )
or even:
p = re.compile( r"%(month)s\s%(day)s,\s%(year)s" % 
{'month':month,'day':day,'year':year} )

denis




More information about the Tutor mailing list