[Tutor] Regular expression on python

Tue Apr 14 10:00:47 CEST 2015

Steven D'Aprano wrote:

> On Mon, Apr 13, 2015 at 02:29:07PM +0200, jarod_v6 at libero.it wrote:
>> Dear all.
>> I would like to extract from some file some data.
>> The line I'm interested is this:
>> 
>> Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward
>> Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%)
>> Dropped: 308617 (14.51%)
> 
> 
>     Some people, when confronted with a problem, think "I know, I'll
>     use regular expressions." Now they have two problems.
>     -- Jamie Zawinski
> ‎
> I swear that Perl has been a blight on an entire generation of
> programmers. All they know is regular expressions, so they turn every
> data processing problem into a regular expression. Or at least they
> *try* to. As you have learned, regular expressions are hard to read,
> hard to write, and hard to get correct.
> 
> Let's write some Python code instead.
> 
> 
> def extract(line):
>     # Extract key:number values from the string.
>     line = line.strip()  # Remove leading and trailing whitespace.
>     words = line.split()
>     accumulator = []  # Collect parts of the string we care about.
>     for word in words:
>         if word.startswith('(') and word.endswith('%)'):
>             # We don't care about percentages in brackets.
>             continue
>         try:
>             n = int(word)
>         except ValueError:
>             accumulator.append(word)
>         else:
>             accumulator.append(n)
>     # Now accumulator will be a list of strings and ints:
>     # e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000]
>     # Collect consecutive strings as the key, int to be the value.
>     results = {}
>     keyparts = []
>     for item in accumulator:
>         if isinstance(item, int):
>             key = ' '.join(keyparts)
>             keyparts = []
>             if key.endswith(':'):
>                 key = key[:-1]
>             results[key] = item
>         else:
>             keyparts.append(item)
>     # When we have finished processing, the keyparts list should be empty.
>     if keyparts:
>         extra = ' '.join(keyparts)
>         print('Warning: found extra text at end of line "%s".' % extra)
>     return results
> 
> 
> 
> Now let me test it:
> 
> py> line = ('Input Read Pairs: 2127436 Both Surviving: 1795091'
> ...         ' (84.38%) Forward Only Surviving: 17315 (0.81%)'
> ...         ' Reverse Only Surviving: 6413 (0.30%) Dropped:'
> ...         ' 308617 (14.51%)\n')
> py>
> py> print(line)
> Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward
> Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%)
> Dropped: 308617 (14.51%)
> 
> py> extract(line)
> {'Dropped': 308617, 'Both Surviving': 1795091, 'Reverse Only Surviving':
> 6413, 'Forward Only Surviving': 17315, 'Input Read Pairs': 2127436}
> 
> 
> Remember that dicts are unordered. All the data is there, but in
> arbitrary order. Now that you have a nice function to extract the data,
> you can apply it to the lines of a data file in a simple loop:
> 
> with open("255.trim.log") as p:
>     for line in p:
>         if line.startswith("Input "):
>             d = extract(line)
>             print(d)  # or process it somehow

The tempter took posession of me and dictated:

>>> pprint.pprint(
... [(k, int(v)) for k, v in
... re.compile(r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*").findall(line)])
[('Input Read Pairs', 2127436),
 ('Both Surviving', 1795091),
 ('Forward Only Surviving', 17315),
 ('Reverse Only Surviving', 6413),
 ('Dropped', 308617)]