python regex: variable length of positive lookbehind assertion
Jussi Piitulainen
jussi.piitulainen at helsinki.fi
Wed Jun 15 12:04:29 EDT 2016
alister writes:
> On Wed, 15 Jun 2016 15:55:42 +0300, Jussi Piitulainen wrote:
>
>> alister writes:
>>
>>> On Tue, 14 Jun 2016 20:28:24 -0700, Yubin Ruan wrote:
>>>
>>>> Hi everyone,
>>>> I am struggling writing a right regex that match what I want:
>>>>
>>>> Problem Description:
>>>>
>>>> Given a string like this:
>>>>
>>>> >>>string = "false_head <a>aaa</a> <a>bbb</a> false_tail \
>>>> true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a>
>>>> true_tail"
>>>>
>>>> I want to match the all the text surrounded by those "<a> </a>",
>>>> but only if those "<a> </a>" locate **in some distance** behind
>>>> "true_head". That is, I expect to result to be like this:
>>>>
>>>> >>>import re result = re.findall("the_regex",string) print result
>>>> ["ccc","ddd","eee"]
>>>>
>>>> How can I write a regex to match that?
>>>> I have try to use the **positive lookbehind assertion** in python
>>>> regex,
>>>> but it does not allowed variable length of lookbehind.
>>>>
>>>> Thanks in advance,
>>>> Ruan
>>>
>>> don't try to use regex to parse html it wont work reliably i am
>>> surprised no one has mentioned beautifulsoup yet, which is probably
>>> what you require.
>>
>> Nothing in the question indicates that the data is HTML.
>
> the <a></a> tags are a prety good indicator though
I can see how they point that way, but to me that alone seemed pretty
weak.
> even if it is not HTML the same advise stands for XML (the quote
> example would be invalid if it was XML)
It's not valid HTML either, for similar reasons. Or is it? I don't even
want to know.
> if it is neither for these formats but still using a similar tag
> structure then I would say that Reg ex is still unsuitable & the OP
> would need to write a full parser for the format if one does not
> already exist
That depends on details that weren't provided.
I work with a data format that mixes element tags with line-oriented
data records, and having a dedicated parser would be more of a hassle. A
couple of very simple regexen are useful in making sure that start tags
have a valid form and extracting attribute-value pairs from them. I'm
not at all experiencing "two problems" here. Some uses of regex are
good. (And now I may be about to experience the third problem. That
makes me sad.)
Anyway, I think you and another person guessed correctly that the OP is
indeed really considering HTML, and then your suggestion is certainly
helpful.
More information about the Python-list
mailing list