[Tutor] Regular Expression help
Kent Johnson
kent37 at tds.net
Wed Jun 27 15:39:20 CEST 2007
Gardner, Dean wrote:
> Hi
>
> I have a text file that I would like to split up so that I can use it in
> Excel to filter a certain field. However as it is a flat text file I
> need to do some processing on it so that Excel can correctly import it.
>
> File Example:
> tag desc VR VM
> (0012,0042) Clinical Trial Subject Reading ID LO 1
> (0012,0050) Clinical Trial Time Point ID LO 1
> (0012,0051) Clinical Trial Time Point Description ST 1
> (0012,0060) Clinical Trial Coordinating Center Name LO 1
> (0018,0010) Contrast/Bolus Agent LO 1
> (0018,0012) Contrast/Bolus Agent Sequence SQ 1
> (0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
> (0018,0015) Body Part Examined CS 1
>
> What I essentially want is to use python to process this file to give me
>
>
> (0012,0042); Clinical Trial Subject Reading ID; LO; 1
> (0012,0050); Clinical Trial Time Point ID; LO; 1
> (0012,0051); Clinical Trial Time Point Description; ST; 1
> (0012,0060); Clinical Trial Coordinating Center Name; LO; 1
> (0018,0010); Contrast/Bolus Agent; LO; 1
> (0018,0012); Contrast/Bolus Agent Sequence; SQ ;1
> (0018,0014); Contrast/Bolus Administration Route Sequence; SQ; 1
> (0018,0015); Body Part Examined; CS; 1
>
> so that I can import to excel using a delimiter.
>
> This file is extremely long and all I essentially want to do is to break
> it into it 'fields'
>
> Now I suspect that regular expressions are the way to go but I have only
> basic experience of using these and I have no idea what I should be doing.
This seems to work:
data = '''\
(0012,0042) Clinical Trial Subject Reading ID LO 1
(0012,0050) Clinical Trial Time Point ID LO 1
(0012,0051) Clinical Trial Time Point Description ST 1
(0012,0060) Clinical Trial Coordinating Center Name LO 1
(0018,0010) Contrast/Bolus Agent LO 1
(0018,0012) Contrast/Bolus Agent Sequence SQ 1
(0018,0014) Contrast/Bolus Administration Route Sequence SQ 1
(0018,0015) Body Part Examined CS 1'''.splitlines()
import re
fieldsRe = re.compile(r'^(\(\d+,\d+\)) (.*?) (\w+) (\d+)$')
for line in data:
match = fieldsRe.match(line)
if match:
print ';'.join(match.group(1, 2, 3, 4))
I don't think you want the space after the ; that you put in your
example; Excel wants a single-character delimiter.
Kent
More information about the Tutor
mailing list