[New-bugs-announce] [issue40879] Strange regex cycle

Fri Jun 5 16:49:23 EDT 2020

New submission from Matt Miller <usr.bin.bourbon at gmail.com>:

I was evaluating a few regular expressions for parsing URL.  One such expression (https://daringfireball.net/2010/07/improved_regex_for_matching_urls) causes the `re.Pattern` to exhibit some strange behavior (notice the stripped characters in the `repr`):

```
>>> STR_RE_URL = r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))"""
>>> print(re.compile(STR_RE_URL))
re.compile('(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\, re.IGNORECASE)
```

The reason I started looking at this was because the following string causes the same `re.Pattern` object's `.search()` method to loop forever for some reason:

```
>>> weird_str = """AY:OhQOhQNhQLdLAX78N'7M&6K%4K#4K#7N&9P(JcHOiQE^=8P'F_DJdLC\@9P&D\;IdKHbJ at Z8AY7@Y7AY7B[9E_<Ha?G`>Jc at Jc:F_1PjRRlSOiLKeAKeAGa=D^:F`=Ga=Fa<MhHRmSRlSJc7Ga1Ic3Kd4Jc3Ga0<V&?Y*D]-Hb1Mg7D^/;S at +@)"""
>>> url_pat.search(weird_str)
```

The `.search(weird_str)` will never exit.

I assume the `.search()` taking forever is is an error in the expression but the fact that it causes the `repr` to strip some characters was something I thought should be looked into.

I have not tested this on any other versions of Python.

----------
components: Library (Lib)
messages: 370784
nosy: Matt Miller
priority: normal
severity: normal
status: open
title: Strange regex cycle
versions: Python 3.7

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue40879>
_______________________________________