[Tutor] Regular expression on python
Peter Otten
__peter__ at web.de
Tue Apr 14 10:00:47 CEST 2015
Steven D'Aprano wrote:
> On Mon, Apr 13, 2015 at 02:29:07PM +0200, jarod_v6 at libero.it wrote:
>> Dear all.
>> I would like to extract from some file some data.
>> The line I'm interested is this:
>>
>> Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward
>> Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%)
>> Dropped: 308617 (14.51%)
>
>
> Some people, when confronted with a problem, think "I know, I'll
> use regular expressions." Now they have two problems.
> -- Jamie Zawinski
>
> I swear that Perl has been a blight on an entire generation of
> programmers. All they know is regular expressions, so they turn every
> data processing problem into a regular expression. Or at least they
> *try* to. As you have learned, regular expressions are hard to read,
> hard to write, and hard to get correct.
>
> Let's write some Python code instead.
>
>
> def extract(line):
> # Extract key:number values from the string.
> line = line.strip() # Remove leading and trailing whitespace.
> words = line.split()
> accumulator = [] # Collect parts of the string we care about.
> for word in words:
> if word.startswith('(') and word.endswith('%)'):
> # We don't care about percentages in brackets.
> continue
> try:
> n = int(word)
> except ValueError:
> accumulator.append(word)
> else:
> accumulator.append(n)
> # Now accumulator will be a list of strings and ints:
> # e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000]
> # Collect consecutive strings as the key, int to be the value.
> results = {}
> keyparts = []
> for item in accumulator:
> if isinstance(item, int):
> key = ' '.join(keyparts)
> keyparts = []
> if key.endswith(':'):
> key = key[:-1]
> results[key] = item
> else:
> keyparts.append(item)
> # When we have finished processing, the keyparts list should be empty.
> if keyparts:
> extra = ' '.join(keyparts)
> print('Warning: found extra text at end of line "%s".' % extra)
> return results
>
>
>
> Now let me test it:
>
> py> line = ('Input Read Pairs: 2127436 Both Surviving: 1795091'
> ... ' (84.38%) Forward Only Surviving: 17315 (0.81%)'
> ... ' Reverse Only Surviving: 6413 (0.30%) Dropped:'
> ... ' 308617 (14.51%)\n')
> py>
> py> print(line)
> Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward
> Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%)
> Dropped: 308617 (14.51%)
>
> py> extract(line)
> {'Dropped': 308617, 'Both Surviving': 1795091, 'Reverse Only Surviving':
> 6413, 'Forward Only Surviving': 17315, 'Input Read Pairs': 2127436}
>
>
> Remember that dicts are unordered. All the data is there, but in
> arbitrary order. Now that you have a nice function to extract the data,
> you can apply it to the lines of a data file in a simple loop:
>
> with open("255.trim.log") as p:
> for line in p:
> if line.startswith("Input "):
> d = extract(line)
> print(d) # or process it somehow
The tempter took posession of me and dictated:
>>> pprint.pprint(
... [(k, int(v)) for k, v in
... re.compile(r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*").findall(line)])
[('Input Read Pairs', 2127436),
('Both Surviving', 1795091),
('Forward Only Surviving', 17315),
('Reverse Only Surviving', 6413),
('Dropped', 308617)]
More information about the Tutor
mailing list