[Doc-SIG] References in the same line as the target text

Simon Budig Simon.Budig@unix-ag.uni-siegen.de
Fri, 5 Jul 2002 15:41:59 +0200


David Goodger (goodger@users.sourceforge.net) wrote:
> Simon Budig wrote:
> > The second point is closely connected to this. When looking at
> > Inline markup the parsing work is done by a class "Inliner". This is
> > dominated by a huge regular expression that matches to a lot of
> > different constructs. In my eyes it would be better to break this
> > apart in different regular expressions and test them in a sequence
> > (it might be necessary to remember which match starts first). An
> > extension could add a regular expression to that list instead of
> > having to replace a complicated regular expression with an even more
> > complicated regex.
> 
> The "Inliner" class has to use one large regular expression.  If we
> have some text like this::
> 
>     Here is an ``inline **literal**``.
> 
> If we check for "strong" (**) first, the result will be wrong.  No
> ordering would get it right for all constructs.  We have to check for
> each start-string simultaneously, because there are no precedence
> rules (almost); first occurrence from left to right in the text is the
> determinant.

This is why I meant that it might be necessary to remember which match
starts first. To emulate the behaviour of a big regex we have to match
against all regexes, check which one starts closest to the beginning of
the string and if this is ambigous check, which one is the longest match.

Advantage: This would immediately give the matching construct.

> But that idea is close to the solution I'm thinking of.  My idea is to
> break up the one huge regexp into several lists of individual regexps,
> one list per construct/regexp type (find start-string only, find the
> whole construct, etc.), and join them dynamically into compound
> OR-groups, building the large regexp from components at runtime.
> Dynamic syntax directives can install new regexps and rebuild the
> master regexp.

The advantage of this approach is that it might be a bit more quick
since it is inside a single regular expression. It makes it a bit harder
to detect what actually was the matching regex. Of course this is
doable via ((?P<regex1>blablabla)|(?P<regex2>blu(?P<data>b*)lubb))
and then check, which of the named groups regex1 or regex2 matches.
It might be a problem because you have to be careful with the naming of
additional groups in the different regexes to avoid conflicts.

Bye,
        Simon
-- 
      Simon.Budig@unix-ag.org       http://www.home.unix-ag.org/simon/