[Tutor] Extract strings from a text file

Fri Feb 27 13:07:21 CET 2009

On Fri, Feb 27, 2009 at 2:22 AM, spir <denis.spir at free.fr> wrote:
> Anyway for a startup exploration you can use regular expressions (regex) to extract individual data item. For instance:
>
> from re import compile as Pattern
> pattern = Pattern(r""".*<ID>(.+)<.+>.*""")
> line = "text text text <ID>Joseph</text text text>"
> print pattern.findall(line)
> text = """\
> text text text <ID>Joseph</text text text>
> text text text <ID>Jodia</text text text>
> text text text <ID>Joobawap</text text text>
> """
> print pattern.findall(text)
> ==>
> ['Joseph']
> ['Joseph', 'Jodia', 'Joobawap']

You need to be a bit careful with wildcards, your regex doesn't work
correctly if there are two <ID>s on a line:
In [7]: re.findall(r""".*<ID>(.+)<.+>.*""", 'text <ID>Joseph</ID><ID>Mary</ID>')
Out[7]: ['Mary']

The problem is that the initial .* matches the whole line; the regex
then backtracks to the second <ID>, finds a match and stops.

Taking out the initial .* shows another problem:
In [8]: re.findall(r"""<ID>(.+)<.+>""", 'text <ID>Joseph</ID><ID>Mary</ID>')
Out[8]: ['Joseph</ID><ID>Mary']

Now (.+) is matching to the end of the line, then backing up to find the last <.

One way to fix this is to use non-greedy matching:
In [10]: re.findall(r"""<ID>(.+?)<""", 'text <ID>Joseph</ID><ID>Mary</ID>')
Out[10]: ['Joseph', 'Mary']

Another way is to specifically exclude the character you are matching
from the wildcard match:
In [11]: re.findall(r"""<ID>([^[<]+)<""", 'text <ID>Joseph</ID><ID>Mary</ID>')
Out[11]: ['Joseph', 'Mary']

Kent