python regex: variable length of positive lookbehind assertion

Vlastimil Brom vlastimil.brom at
Wed Jun 15 04:31:22 EDT 2016

2016-06-15 5:28 GMT+02:00 Yubin Ruan <ablacktshirt at>:
> Hi everyone,
> I am struggling writing a right regex that match what I want:
> Problem Description:
> Given a string like this:
>     >>>string = "false_head <a>aaa</a> <a>bbb</a> false_tail \
>              true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a> true_tail"
> I want to match the all the text surrounded by those "<a> </a>",
> but only if those "<a> </a>" locate **in some distance** behind "true_head". That is, I expect to result to be like this:
>     >>>import re
>     >>>result = re.findall("the_regex",string)
>     >>>print result
>     ["ccc","ddd","eee"]
> How can I write a regex to match that?
> I have try to use the **positive lookbehind assertion** in python regex,
> but it does not allowed variable length of lookbehind.
> Thanks in advance,
> Ruan
> --

html-like data is generally not very suitable for parsing with regex,
as was explained in the previous answers (especially if comments and
nesting are massively involved).
However, if this suits your data and the usecase, you can use regex
with variable-length lookarounds in a much enhanced "regex" library
for python

your pattern might then simply have the form you most likely have
intended, e.g.:
>>> regex.findall(r"(?<=true_head.*)<a>([^<]+)</a>(?=.*true_tail)", "false_head <a>aaa</a> <a>bbb</a> false_tail true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a> true_tail <a>fff</a> another_false_tail")
['ccc', 'ddd', 'eee']

If you are accustomed to use regular expressions, I'd certainly
recommend this excellent library (besides unlimited lookarounds, there
are repeated and recursive patterns, many unicode-related
enhancements, powerful character set operations, even fuzzy matching
and much more).


More information about the Python-list mailing list