[Tutor] Using Regular Expression to extracting string in brackets on a list
Steven D'Aprano
steve at pearwood.info
Mon Dec 30 02:14:52 CET 2013
On Sun, Dec 29, 2013 at 04:02:01PM -0500, Jing Ai wrote:
> Hello,
>
> I am trying to rewrite some contents on a long list that contains words
> within brackets and outside brackets and I'm having trouble extracting the
> words within brackets, especially since I have to add the append function
> for list as well. Does anyone have any suggestions? Thank you!
>
> *An example of list*:
>
> ['hypothetical protein BRAFLDRAFT_208408 [Branchiostoma floridae]\n',
> 'hypoxia-inducible factor 1-alpha [Mus musculus]\n', 'hypoxia-inducible
> factor 1-alpha [Gallus gallus]\n' ]
>
> *What I'm trying to extract out of this*:
>
> ['Branchiostoma floridae', 'Mus musculus', 'Gallus gallus']
You have a list of strings. Each string has exactly one pair of square
brackets []. You want the content of the square brackets.
Start with a function that extracts the content of the square brackets
from a single string.
def extract(s):
start = s.find('[')
if start == -1:
# No opening bracket found. Should this be an error?
return ''
start += 1 # skip the bracket, move to the next character
end = s.find(']', start)
if end == -1:
# No closing bracket found after the opening bracket.
# Should this be an error instead?
return s[start:]
else:
return s[start:end]
Let's test it and see if it works:
py> s = 'hypothetical protein BRAFLDRAFT_208408 [Branchiostoma floridae]\n'
py> extract(s)
'Branchiostoma floridae'
So far so good. Now let's write a loop:
names = []
for line in list_of_strings:
names.append(extract(line))
where list_of_strings is your big list like the example above.
We can simplify the loop by using a list comprehension:
names = [extract(line) for line in list_of_strings]
If you prefer to use a regular expression, that's simple enough. Here's
a new version of the extract function:
import re
def extract(s):
mo = re.search(r'\[(.*)\]', s)
if mo:
return mo.group(1)
return ''
The list comprehension remains the same.
--
Steven
More information about the Tutor
mailing list