pattern matching

Thu Feb 24 07:29:46 EST 2011

On Feb 23, 9:11 pm, monkeys paw <mon... at joemoney.net> wrote:
> if I have a string such as '<td>01/12/2011</td>' and i want
> to reformat it as '20110112', how do i pull out the components
> of the string and reformat them into a YYYYDDMM format?
>
> I have:
>
> import re
>
> test = re.compile('\d\d\/')
> f = open('test.html')  # This file contains the html dates
> for line in f:
>      if test.search(line):
>          # I need to pull the date components here
What you need are parentheses, which capture part of the text you're
matching. Each set of parentheses creates a "group". To get to these
groups, you need the match object which is returned by re.search.
Group 0 is the entire match, group 1 is the contents of the first set
of parentheses, and so forth. If the regex does not match, then
re.search returns None.


DATA FILE (test.html):
<table>
    <tr><td>David</td><td>02/19/1967</td></tr>
    <tr><td>Susan</td><td>05/23/1948</td></tr>
    <tr><td>Clare</td><td>09/22/1952</td></tr>
    <tr><td>BP</td><td>08/27/1990</td></tr>
    <tr><td>Roger</td><td>12/19/1954</td></tr>
</table>


CODE:
import re
rx_test = re.compile(r'<td>(\d{2})/(\d{2})/(\d{4})</td>')

f = open('test.html')
for line in f:
    m = rx_test.search(line)
    if m:
	    new_date = m.group(3) + m.group(1) + m.group(2)
	    print "raw text: ",m.group(0)
	    print "new date: ",new_date
	    print

OUTPUT:
raw text:  <td>02/19/1967</td>
new date:  19670219

raw text:  <td>05/23/1948</td>
new date:  19480523

raw text:  <td>09/22/1952</td>
new date:  19520922

raw text:  <td>08/27/1990</td>
new date:  19900827

raw text:  <td>12/19/1954</td>
new date:  19541219