[Tutor] re.findall parentheses problem

Evert Rol evert.rol at gmail.com
Tue Sep 14 18:56:11 CEST 2010


> I have a regex that matches dates in various formats.  I've tested the regex in a reliable testbed, and it seems to match what I want (dates in formats like "1 Jan 2010" and "January 1, 2010" and also "January 2008").  It's just that using re.findall with it is giving me weird output.  I'm using Python 2.6.5 here, and I've put in line breaks for clarity's sake:
> 
> >>> import re
> 
> >>> date_regex = re.compile(r"([0-3]?[0-9])?((\s*)|(\t*))((Jan\.?u?a?r?y?)|(Feb\.?r?u?a?r?y?)|(Mar\.?c?h?)|(Apr\.?i?l?)|(May)|(Jun[e.]?)|(Jul[y.]?)|(Aug\.?u?s?t?)|(Sep[t.]?\.?e?m?b?e?r?)|(Oct\.?o?b?e?r?)|(Nov\.?e?m?b?e?r?)|(Dec\.?e?m?b?e?r?))((\s*)|(\t*))(2?0?[0-3]?[0-9]\,?)?((\s*)|(\t*))(2?0?[01][0-9])")

This will also match '1 Janry 2010'. 
Not sure if it should?


<snip>two examples</snip>

> >>> test_output = re.findall(date_regex, "The date was January 1, 2008.  But it was not January 2, 2008.")
> 
> >>> print test_output
> [('', ' ', ' ', '', 'January', 'January', '', '', '', '', '', '', '', '', '', '', '', ' ', ' ', '', '1,', ' ', ' ', '', '2008'), ('', ' ', ' ', '', 'January', 'January', '', '', '', '', '', '', '', '', '', '', '', ' ', ' ', '', '2,', ' ', ' ', '', '2008')]
> 
> A friend says: " I think that the problem is that every time that you have a parenthesis you get an output. Maybe there is a way to suppress this."
> 
> My friend's explanation speaks to the empties, but maybe not to the two Januaries.  Either way, what I want is for re.finall, or some other re method that perhaps I haven't properly explored, to return the matches and just the matches.       
> 
> I've read the documentation, googled various permutations etc, and I can't figure it out.  Any help much appreciated.

The docs say: " If one or more groups are present in the pattern, return a list of groups". So your friend is right.

In fact, your last example shows exactly this: it shows a list of two tuples. The tuples contain individual group matches, the two list elements are your two date matches.
You could solve this by grouping the entire regex (so r"(([0-3 .... [0-9]))" ; I would even use a named group), and then picking out the first tuple element of each list element:
[(' January 1, 2008', '', ' ', ' ', '', 'January', 'January', '', '', '', '', '', '', '', '', '', '', '', ' ', ' ', '', '1,', ' ', ' ', '', '2008'), (' January 2, 2008', '', ' ', ' ', '', 'January', 'January', '', '', '', '', '', '', '', '', '', '', '', ' ', ' ', '', '2,', ' ', ' ', '', '2008')]


Hth,

  Evert



More information about the Tutor mailing list