python regex: variable length of positive lookbehind assertion

Vlastimil Brom vlastimil.brom at gmail.com
Wed Jun 15 04:31:22 EDT 2016


2016-06-15 5:28 GMT+02:00 Yubin Ruan <ablacktshirt at gmail.com>:
> Hi everyone,
> I am struggling writing a right regex that match what I want:
>
> Problem Description:
>
> Given a string like this:
>
>     >>>string = "false_head <a>aaa</a> <a>bbb</a> false_tail \
>              true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a> true_tail"
>
> I want to match the all the text surrounded by those "<a> </a>",
> but only if those "<a> </a>" locate **in some distance** behind "true_head". That is, I expect to result to be like this:
>
>     >>>import re
>     >>>result = re.findall("the_regex",string)
>     >>>print result
>     ["ccc","ddd","eee"]
>
> How can I write a regex to match that?
> I have try to use the **positive lookbehind assertion** in python regex,
> but it does not allowed variable length of lookbehind.
>
> Thanks in advance,
> Ruan
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
html-like data is generally not very suitable for parsing with regex,
as was explained in the previous answers (especially if comments and
nesting are massively involved).
However, if this suits your data and the usecase, you can use regex
with variable-length lookarounds in a much enhanced "regex" library
for python
https://pypi.python.org/pypi/regex

your pattern might then simply have the form you most likely have
intended, e.g.:
>>> regex.findall(r"(?<=true_head.*)<a>([^<]+)</a>(?=.*true_tail)", "false_head <a>aaa</a> <a>bbb</a> false_tail true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a> true_tail <a>fff</a> another_false_tail")
['ccc', 'ddd', 'eee']
>>>

If you are accustomed to use regular expressions, I'd certainly
recommend this excellent library (besides unlimited lookarounds, there
are repeated and recursive patterns, many unicode-related
enhancements, powerful character set operations, even fuzzy matching
and much more).

hth,
   vbr


More information about the Python-list mailing list