[Tutor] Regular expression on python

Steven D'Aprano steve at pearwood.info
Tue Apr 14 03:48:16 CEST 2015


On Mon, Apr 13, 2015 at 02:29:07PM +0200, jarod_v6 at libero.it wrote:
> Dear all.
> I would like to extract from some file some data.
> The line I'm interested is this:
> 
> Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward 
> Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%) 
> Dropped: 308617 (14.51%)


    Some people, when confronted with a problem, think "I know, I'll 
    use regular expressions." Now they have two problems.
    -- Jamie Zawinski
‎
I swear that Perl has been a blight on an entire generation of 
programmers. All they know is regular expressions, so they turn every 
data processing problem into a regular expression. Or at least they 
*try* to. As you have learned, regular expressions are hard to read, 
hard to write, and hard to get correct.

Let's write some Python code instead.


def extract(line):
    # Extract key:number values from the string.
    line = line.strip()  # Remove leading and trailing whitespace.
    words = line.split()
    accumulator = []  # Collect parts of the string we care about.
    for word in words:
        if word.startswith('(') and word.endswith('%)'):
            # We don't care about percentages in brackets.
            continue
        try:
            n = int(word)
        except ValueError:
            accumulator.append(word)
        else:
            accumulator.append(n)
    # Now accumulator will be a list of strings and ints:
    # e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000]
    # Collect consecutive strings as the key, int to be the value.
    results = {}
    keyparts = []
    for item in accumulator:
        if isinstance(item, int):
            key = ' '.join(keyparts)
            keyparts = []
            if key.endswith(':'):
                key = key[:-1]
            results[key] = item
        else:
            keyparts.append(item)
    # When we have finished processing, the keyparts list should be empty.
    if keyparts:
        extra = ' '.join(keyparts)
        print('Warning: found extra text at end of line "%s".' % extra)
    return results



Now let me test it:

py> line = ('Input Read Pairs: 2127436 Both Surviving: 1795091'
...         ' (84.38%) Forward Only Surviving: 17315 (0.81%)'
...         ' Reverse Only Surviving: 6413 (0.30%) Dropped:'
...         ' 308617 (14.51%)\n')
py>
py> print(line)
Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward 
Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%) 
Dropped: 308617 (14.51%)

py> extract(line)
{'Dropped': 308617, 'Both Surviving': 1795091, 'Reverse Only Surviving': 
6413, 'Forward Only Surviving': 17315, 'Input Read Pairs': 2127436}


Remember that dicts are unordered. All the data is there, but in 
arbitrary order. Now that you have a nice function to extract the data, 
you can apply it to the lines of a data file in a simple loop:

with open("255.trim.log") as p:
    for line in p:
        if line.startswith("Input "):
            d = extract(line)
            print(d)  # or process it somehow



-- 
Steven


More information about the Tutor mailing list