regular expression problem
MRAB
python at mrabarnett.plus.com
Sun Oct 28 16:04:37 EDT 2018
On 2018-10-28 18:51, Karsten Hilbert wrote:
> Dear list members,
>
> I cannot figure out why my regular expression does not work as I expect it to:
>
> #---------------------------
> #!/usr/bin/python
>
> from __future__ import print_function
> import re as regex
>
> rx_works = '\$<[^<:]+?::.*?::\d*?>\$|\$<[^<:]+?::.*?::\d+-\d+>\$'
> # it fails if switched around:
> rx_fails = '\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$'
> line = 'junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk'
>
> print ('')
> print ('line:', line)
> print ('expected: $<match_A::options A::4>$')
> print ('expected: $<match_B::options B::4-5>$')
>
> print ('')
> placeholders_in_line = regex.findall(rx_works, line, regex.IGNORECASE)
> print('found (works):')
> for ph in placeholders_in_line:
> print (ph)
>
> print ('')
> placeholders_in_line = regex.findall(rx_fails, line, regex.IGNORECASE)
> print('found (fails):')
> for ph in placeholders_in_line:
> print (ph)
>
> #---------------------------
>
> I am sure I simply don't see the problem ?
>
Here are some of the steps while matching the second regex. (View this
in a monospaced font.)
1:
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
2:
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
3:
The .*? matches as few characters as possible, initially none.
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
4:
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
At this point it can't match, so it backtracks.
5:
The .*? matches more characters, including the ":".
After more matching it's like the following.
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
6:
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
Again it can't match, so it backtracks.
7:
The .*? matches more characters, including the ":".
After more matching it's like the following.
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
8:
junk $<match_A::options A::4>$ junk $<match_B::options B::4-5>$ junk
^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^
Success!
The first choice has matched this:
$<match_A::options A::4>$ junk $<match_B::options B::4-5>$
More information about the Python-list
mailing list