[Tutor] Using Regular Expression to extracting string in brackets on a list

Steven D'Aprano steve at pearwood.info
Mon Dec 30 02:14:52 CET 2013


On Sun, Dec 29, 2013 at 04:02:01PM -0500, Jing Ai wrote:
> Hello,
> 
> I am trying to rewrite some contents on a long list that contains words
> within brackets and outside brackets and I'm having trouble extracting the
> words within brackets, especially since I have to add the append function
> for list as well.  Does anyone have any suggestions? Thank you!
> 
> *An example of list*:
> 
> ['hypothetical protein BRAFLDRAFT_208408 [Branchiostoma floridae]\n',
> 'hypoxia-inducible factor 1-alpha [Mus musculus]\n', 'hypoxia-inducible
> factor 1-alpha [Gallus gallus]\n' ]
> 
> *What I'm trying to extract out of this*:
> 
> ['Branchiostoma floridae', 'Mus musculus', 'Gallus gallus']

You have a list of strings. Each string has exactly one pair of square 
brackets []. You want the content of the square brackets.

Start with a function that extracts the content of the square brackets 
from a single string.

def extract(s):
    start = s.find('[')
    if start == -1:
        # No opening bracket found. Should this be an error?
        return ''
    start += 1  # skip the bracket, move to the next character
    end = s.find(']', start)
    if end == -1:
        # No closing bracket found after the opening bracket.
        # Should this be an error instead?
        return s[start:]
    else:
        return s[start:end]


Let's test it and see if it works:

py> s = 'hypothetical protein BRAFLDRAFT_208408 [Branchiostoma floridae]\n'
py> extract(s)
'Branchiostoma floridae'

So far so good. Now let's write a loop:

names = []
for line in list_of_strings:
    names.append(extract(line))


where list_of_strings is your big list like the example above.

We can simplify the loop by using a list comprehension:

names = [extract(line) for line in list_of_strings]


If you prefer to use a regular expression, that's simple enough. Here's 
a new version of the extract function:

import re
def extract(s):
    mo = re.search(r'\[(.*)\]', s)
    if mo:
        return mo.group(1)
    return ''


The list comprehension remains the same.


-- 
Steven


More information about the Tutor mailing list