pattern matching
John S
jstrickler at gmail.com
Thu Feb 24 07:29:46 EST 2011
On Feb 23, 9:11 pm, monkeys paw <mon... at joemoney.net> wrote:
> if I have a string such as '<td>01/12/2011</td>' and i want
> to reformat it as '20110112', how do i pull out the components
> of the string and reformat them into a YYYYDDMM format?
>
> I have:
>
> import re
>
> test = re.compile('\d\d\/')
> f = open('test.html') # This file contains the html dates
> for line in f:
> if test.search(line):
> # I need to pull the date components here
What you need are parentheses, which capture part of the text you're
matching. Each set of parentheses creates a "group". To get to these
groups, you need the match object which is returned by re.search.
Group 0 is the entire match, group 1 is the contents of the first set
of parentheses, and so forth. If the regex does not match, then
re.search returns None.
DATA FILE (test.html):
<table>
<tr><td>David</td><td>02/19/1967</td></tr>
<tr><td>Susan</td><td>05/23/1948</td></tr>
<tr><td>Clare</td><td>09/22/1952</td></tr>
<tr><td>BP</td><td>08/27/1990</td></tr>
<tr><td>Roger</td><td>12/19/1954</td></tr>
</table>
CODE:
import re
rx_test = re.compile(r'<td>(\d{2})/(\d{2})/(\d{4})</td>')
f = open('test.html')
for line in f:
m = rx_test.search(line)
if m:
new_date = m.group(3) + m.group(1) + m.group(2)
print "raw text: ",m.group(0)
print "new date: ",new_date
print
OUTPUT:
raw text: <td>02/19/1967</td>
new date: 19670219
raw text: <td>05/23/1948</td>
new date: 19480523
raw text: <td>09/22/1952</td>
new date: 19520922
raw text: <td>08/27/1990</td>
new date: 19900827
raw text: <td>12/19/1954</td>
new date: 19541219
More information about the Python-list
mailing list