[issue8064] Large regex handling very slow on Linux

Fri Mar 5 03:14:32 CET 2010

Ezio Melotti <ezio.melotti at gmail.com> added the comment:

This is a proof that you can have an equivalent regex without including all the 'letter chars' (tested on both narrow and wide builds):
>>> s = u''.join(unichr(c) for c in range(sys.maxunicode))
>>> diff = set(re.findall(u'[^\W\d]', s, re.U)) ^ set(re.findall(u'[%s_-]' % makew(), s, re.U))
>>> diff.remove('-')
>>> re.findall(u'(?:[^\W\d%s]|-)' % ''.join(diff), s, re.U) == re.findall(u'[%s_-]' % makew(), s, re.U)
True

(I don't like the way I included the '-' but I couldn't find anything better.)
It looks however that most of the time is spent during the findall and from a quick benchmark it seems that my regex is slower (even if it's shorter and it compiles faster).

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8064>
_______________________________________