[Tutor] advice on regex matching for dates?
spir
denis.spir at free.fr
Thu Dec 11 23:38:52 CET 2008
Serdar Tumgoren a écrit :
> Hey everyone,
>
> I was wondering if there is a way to use the datetime module to check for
> variations on a month name when performing a regex match?
>
> In the script below, I created a regex pattern that checks for dates in the
> following pattern: "August 31, 2007". If there's a match, I can then print
> the capture date and the line from which it was extracted.
>
> While it works in this isolated case, it struck me as not very flexible.
> What happens when I inevitably get data that has dates formatted in a
> different way? Do I have to create some type of library that contains
> variations on each month name (i.e. - January, Jan., 01, 1...) and use that
> to parse each line?
>
> Or is there some way to use datetime to check for date patterns when using
> regex? Is there a "best practice" in this area that I'm unaware of in this
> area?
>
> Apologies if this question has been answered elsewhere. I wasn't sure how to
> research this topic (beyond standard datetime docs), but I'd be happy to RTM
> if someone can point me to some resources.
>
> Any suggestions are welcome (including optimizations of the code below).
>
> Regards,
> Serdar
>
> #!/usr/bin/env python
>
> import re, sys
>
> sourcefile = open(sys.argv[1],'r')
>
> pattern =
> re.compile(r'(?P<month>January|February|March|April|May|June|July|August|September|October|November|December)\s(?P<day>\d{1,2}),\s(?P<year>\d{4})')
>
> pattern2 = re.compile(r'Return to List')
>
> counter = 0
>
> for line in sourcefile:
> x = pattern.search(line)
> break_point = pattern2.match(line)
>
> if x:
> counter +=1
> print "%s %d, %d <== %s" % ( x.group('month'), int(x.group('day')),
> int(x.group('year')), line ),
> elif break_point:
> break
>
> print counter
> sourcefile.close()
I just found a simple, but nice, trick to make regexes less unlegible. Using
substrings to represent sub-patterns. E.g. instead of:
p =
re.compile(r'(?P<month>January|February|March|April|May|June|July|August|September|October|November|December)\s(?P<day>\d{1,2}),\s(?P<year>\d{4})')
write first:
month =
r'(?P<month>January|February|March|April|May|June|July|August|September|October|November|December)'
day = r'(?P<day>\d{1,2})'
year = r'(?P<year>\d{4})'
then:
p = re.compile( r"%s\s%s,\s%s" % (month,day,year) )
or even:
p = re.compile( r"%(month)s\s%(day)s,\s%(year)s" %
{'month':month,'day':day,'year':year} )
denis
More information about the Tutor
mailing list