problem negative lookahead assertion

Tue Apr 16 16:38:48 EDT 2002

Quoting sjoerd siebinga (ssiebinga at fa.knaw.nl):
> Thanks Geoff and Amk for your advise.
> 
> Sadly enough my input data requires nested {} due to specialist
> diacritics use.So the ([^}]*?) option gives more errors when
> processing the data through the latex interpreter. I used the
> ([\s,\W]) regex to grab all the curly braces in  the emph.
> 
> Now it is time to ask a silly question. I am a trained linguist and
> not a programmer.
> 
> > Note that this will break in a different situation, if you have {} inside  
> > the contents of \emph, as for example in \emph{ab^{cd}}, because this 
> > problem really needs a full parser to handle the general case.
> 
> What do you mean by a 'full parser'?

I think the way you translate that is "regular expressions aren't
going to be sufficient for what you're doing". You'd need a parser
that would look at each distinct token and/or character, and decide
what its disposition should be.

The problem is the nested {}'s. If you can't count on the first } you
encounter ending the \emph section, then you can't use [^}] to delimit
your \1.

I'm going to take some time and analyze your input document, but I
think the question you need to be asking right now is: how to
reasonably bound your \1 expression?

Let me give you an example, from your examples:

> `to wade through, ford', \emph{vadum} \index{unl~vadum}`ford'. It cannot

I believe this should be generated from a source line:

  `to wade through, ford', \emph{vadum}`ford'. It cannot

Now, what we want to retrieve is the "vadum" part.

For this particular case, it's easy:

  r"\\emph\{(\w+)\}"

This matches a string beginning with \emph{ followed by a group of
alphanumeric characters (and only alphanumeric characters), followed
by a closing brace }. The contents between the brace are extracted as
group 1.

However, this approach doesn't work if the "vadum" part -- the
emphasized term -- needs to be able to contain characters other than
alphanumerics. For example

> \emph{(ge)wolt}, \emph{(ge)walt}, \emph{ge]welt}

None of those would match, because (, ), and ] are not alphanumeric
characters.

So the next step is to try to figure out which characters are
permitted between the two enclosing { } brackets.

This line:

> OS \emph{gi]weldig}, OHG \emph{gi]welt\={\i}g}, \emph{gi]walt\={\i}g},

Is pretty revealing. From this, it looks like we can expect to see
just about anything. :) Further, it doesn't look like there's really
any way to say "a } followed by X" for any value of X where it will
(1) always match at the end of an \emph expression and (2) never
accidentally match in the middle of an \emph expression.

What you really need is a way to do brace-matching within your
expression, so that if \emph is followed by two {'s, you'll wait for
two }'s before you decide the \emph is over.

That is, you need to be able to look at:

> OHG \emph{gi]welt\={\i}g},

and you need to be able to extract gi]welt\={\i}g as the content of
the \emph expression, and to do that, you hafta know that the } after
the \i matches the { right before the \i, not the { after \emph.

I don't think you can't get this with a single regular expression, and
I don't believe that regular expressions in general are not
well-suited to this task. My certainty here is about 90%, and I only
express doubt because I may be ignorant of some techniques.

Below is a sort of pseudocode outline for how you would go about
solving this problem. It might not be the fastest or most efficient
way, but it's pretty readable and straightforward.

The complexity here is due to the fact that you can't trust your
pattern to identify only the conditions we are interested in. Instead,
we're thinking it's likely that the pattern will return lots of false
positives.

This is, by the way, some variation on the theme "a full parser".

- copy the input to a temporary string
- create an empty results buffer.

- while you still have characters in your temporary string:
  - if you match your current emph expression:
    - copy the characters up to the beginning of the match from the
      temp string to the results buffer
    - remove the characters up to the beginning of the match

    - extract group 0 (the whole match)
    - analyze it character-by-character to extract the contents of the
      \emph. this can be done by keeping a simple count of {'s and a
      count of }'s: when the two numbers are equal, you're at the end
      of the expression
    - if the next expression is an \index expression (check this
      here, this is necessary because your lookahead is not reliable):
      - copy all characters up to the start of \index into your "results" 
        buffer, with no transform. this was a false positive.
      - remove those characters from your temp string, and redo the loop.
    - if the next expression is not an \index expression:
      - put the \emph and the created \index into the results buffer.
        this is a true hit.
      - remove the characters up to the end of the \emph from the temp
        buffer.
  - if you don't match your expression, copy the remainder of your
    temporary string into the results buffer, and empty out your
    temporary string.

> As Geoff asked I will send  a couple of samples. My Python script is
> rather too large to be send to the newsgroup. If either of you want to
> see it, I can send it to you with a couple of sample lemmata. Another
> point is that the substitutions are preformed in turn much like an sed
> script. The emph regex is processed last.

I don't think we need more examples at this point, but one point that
would be helpful when picking your examples: pick one piece of input
text, preferably the shortest piece you can isolate that demonstrates
all the hard problems. Then include what happens when you process it,
and what you wish would happen when you processed it.

The examples you gave were pretty big (not a huge problem), and all
different, making it very difficult to compare what you have with what
you want.

-- 
Geoff Gerrietts             "There is no fate that cannot be 
<geoff at gerrietts net>     surmounted by scorn." --Albert Camus