[Tutor] Regular expression on python
Steven D'Aprano
steve at pearwood.info
Tue Apr 14 03:48:16 CEST 2015
On Mon, Apr 13, 2015 at 02:29:07PM +0200, jarod_v6 at libero.it wrote:
> Dear all.
> I would like to extract from some file some data.
> The line I'm interested is this:
>
> Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward
> Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%)
> Dropped: 308617 (14.51%)
Some people, when confronted with a problem, think "I know, I'll
use regular expressions." Now they have two problems.
-- Jamie Zawinski
I swear that Perl has been a blight on an entire generation of
programmers. All they know is regular expressions, so they turn every
data processing problem into a regular expression. Or at least they
*try* to. As you have learned, regular expressions are hard to read,
hard to write, and hard to get correct.
Let's write some Python code instead.
def extract(line):
# Extract key:number values from the string.
line = line.strip() # Remove leading and trailing whitespace.
words = line.split()
accumulator = [] # Collect parts of the string we care about.
for word in words:
if word.startswith('(') and word.endswith('%)'):
# We don't care about percentages in brackets.
continue
try:
n = int(word)
except ValueError:
accumulator.append(word)
else:
accumulator.append(n)
# Now accumulator will be a list of strings and ints:
# e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000]
# Collect consecutive strings as the key, int to be the value.
results = {}
keyparts = []
for item in accumulator:
if isinstance(item, int):
key = ' '.join(keyparts)
keyparts = []
if key.endswith(':'):
key = key[:-1]
results[key] = item
else:
keyparts.append(item)
# When we have finished processing, the keyparts list should be empty.
if keyparts:
extra = ' '.join(keyparts)
print('Warning: found extra text at end of line "%s".' % extra)
return results
Now let me test it:
py> line = ('Input Read Pairs: 2127436 Both Surviving: 1795091'
... ' (84.38%) Forward Only Surviving: 17315 (0.81%)'
... ' Reverse Only Surviving: 6413 (0.30%) Dropped:'
... ' 308617 (14.51%)\n')
py>
py> print(line)
Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward
Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%)
Dropped: 308617 (14.51%)
py> extract(line)
{'Dropped': 308617, 'Both Surviving': 1795091, 'Reverse Only Surviving':
6413, 'Forward Only Surviving': 17315, 'Input Read Pairs': 2127436}
Remember that dicts are unordered. All the data is there, but in
arbitrary order. Now that you have a nice function to extract the data,
you can apply it to the lines of a data file in a simple loop:
with open("255.trim.log") as p:
for line in p:
if line.startswith("Input "):
d = extract(line)
print(d) # or process it somehow
--
Steven
More information about the Tutor
mailing list