regular expression to extract text
Lonnie Princehouse
fnord at u.washington.edu
Thu Nov 20 14:55:45 EST 2003
One of the beautiful things about Python is that,
while there is usually one obvious and reasonable
way to do something, there are many many ridiculous
ways to do it as well. This is especially true when
regular expressions are involved.
I'd do it like this: (Note that this wants the whole file as
one string, so use read() instead of readline())
data = """
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode Sumformula
<Conventional cell parameters>
------------------------------------------
QEXZUO C26 H31 N1 O3
6.164 15.892 22.551 90.00 90.00 90.00
------------------------------------------
ARQTYD C19 H23 N1 O5
6.001 15.227 22.558 90.00 90.00 90.00
------------------------------------------
NHDIIS C45 H40 Cl2
6.532 15.147 22.453 90.00 90.00 90.00 """
import re
r1 = re.compile('\-+\n([A-Z]+)(.*?)(?:\-|$)', re.DOTALL)
r2 = re.compile('([A-Z]+\d+)', re.I)
r3 = re.compile('(\d+\.\d+)')
results = dict([ (name, {
'isotopes': r2.findall(body),
'values': [float(x) for x in r3.findall(body)]
}) for name, body in r1.findall(data) ])
I assumes that you want the numbers as floats instead of strings;
if you're just going to print them out again, don't call float().
I also assume (perhaps wrongly) that the order of entries isn't
important. Don't do the dict() conversion if that assumption's wrong.
This yields:
{'ARQTYD': {'isotopes': ['C19', 'H23', 'N1', 'O5'],
'values': [6.0010000000000003,
15.227,
22.558,
90.0,
90.0,
90.0]},
'NHDIIS': {'isotopes': ['C45', 'H40', 'Cl2'],
'values': [6.532,
15.147,
22.452999999999999,
90.0,
90.0,
90.0]},
'QEXZUO': {'isotopes': ['C26', 'H31', 'N1', 'O3'],
'values': [6.1639999999999997,
15.891999999999999,
22.550999999999998,
90.0,
90.0,
90.0]}}
More information about the Python-list
mailing list