[Python-Dev] Regular expressions, Unicode etc.
Mike Klaas
mike.klaas at gmail.com
Wed Aug 8 22:29:50 CEST 2007
In 8-Aug-07, at 12:47 PM, Nick Maclaren wrote:
>
>>> The other approach, which is to stick to true regular expressions,
>>> and wholly or partially convert to DFAs, has already been rendered
>>> impossible by even the limited Perl/PCRE extensions that Python
>>> has adopted.
>>
>> Impossible? Surely, a sufficiently-competent re engine could detect
>> when a DFA is possible to construct?
>
> I doubt it. While it isn't equivalent to the halting problem, it IS
> an intractable one! There are two problems:
>
> Firstly, things like backreferences are an absolute no-no. They
> are not regular, and REs with them in cannot be converted to DFAs.
> That could be 'solved' by a parser that kicked out such constructions,
> but it would get screams from many users.
>
> Secondly, anything involving explicit or implicit negation can lead
> to (if I recall) a super-exponential explosion in the size of the
> DFA. That could be 'solved' by imposing a limit, but few people
> would be able to predict when it would bite.
Right. The analysis I envisioned would be more along the lines of
"if troublesome RE extensions are used, do not attempt to construct a
DFA". It could even be exposed via an alternate api (re.compile_dfa
()) that admitted a subset of the usual grammar.
> Thirdly, I would require notice of the question of whether capturing
> parentheses could be supported, and what semantic changes would be
> to which were set and how.
Capturing groups are rather integral to the python regex api and, as
you say, a major difficulty for DFA-based implementations. Sounds
like a task best left to a thirdparty package.
-Mike
More information about the Python-Dev
mailing list