Guru advice needed for mxTextTools

Mon Jun 3 17:19:53 EDT 2002

I'm curious, if you have an EBNF grammar, why not just go ahead and use 
it with SimpleParse to generate the tag-table.  As for the two-line 
parsing, just define the grammar such that a matched line is:

file := (match/fail)+

match := properHeaderLine, properContentLine
fail := -[\n]*,"\n",-[\n]*,"\n"? # two lines, second newline optional

<properHeaderLine> := whateveryouneedtotest
properContentLine := (group/-("\n"/STOPCHAR))*, "\n"?

group := STARTCHAR,(group/-("\n"/STOPCHAR))*,STOPCHAR

(That's untested, of course ;) .  Grouping code is always a pain to 
write without having corner cases fry you :) ).

2MB is a small file, so shouldn't take long to parse using mx.TextTools, 
and it's normally much faster to just slurp that size of file into 
memory and process the whole thing in the mx.TextTools C loop.  I often 
worked with 5-6MB VRML files (far more complex grammar generating large 
numbers of in-memory nodes) and never had a problem with speed there.

Note: the above grammar will silently ignore improperly formatted 
content lines after a properHeaderLine.  Marc-Andre has added suport 
that would allow creating an error in that case, but I've been busy 
elsewhere and don't yet support it in SimpleParse.

HTH,
Mike

Pekka Niiranen wrote:
> I am trying to optimize a function that searches nested strings from
> set of (allmost) flat files (about 2 MB each) . If I use regular
> expressions, I must
> fix the amount of nesting:
> 
<head-hurting code deleted>
...
> 
> In the code above support two nested strings. If prefix is "?" and
> suffix is "!" then
> it will evaluate into:
> 
> 
>>>>pattern = re.compile("(\?[^?!]+(\?[^?!]+\!)*[^?!]+\!)")
>>>>Line = "?AA?BB!CC!?DD!ee?EE!ff?FF?GG!HH!"
>>>>print re.findall(pattern, Line)
>>>
> [('?AA?BB!CC!', '?BB!'), ('?DD!', ''), ('?EE!', ''), ('?FF?GG!HH!',
> '?GG!')]
> 
...
> 2)    The file is not flat: I also need to check the contents of the
> previous line. If previous line
>        does not contain correct value, I do not have to run the regular
>        expression on the current line:
>             for i in range(1,len(lines),2):
>                         test = lines[i-1].strip()
>                         if (test == 'x' or test == 'y'):
>                             matches = re.findall(pattern,
> lines[i].strip())
>                             if matches:
>                                 # Remove empty results with filters
>                                 pars =
> filter(operator.truth,reduce(operator.add, matches))
> 
> 3)    Amount of nesting may vary in the future
...
> I have thought of EBNF -notation that should be supported with
> Simpleparse  + mxtexttools
> 
> Questions are:
> 
> 1)    What is the mxtexttool tagtable for the regular expression above
> with additions of unlimited nesting.
>        If suffix is the same as prefix, no nesting is assumed
> 2)    Is it possible to parse the file without keeping the record of the
> current line number since
>        values to be checked are allways on odd line numbers and regular
> expression is allways run
>        on even line numbers. If I could read two lines at a time and
> parsing them both simultaneously
>        (as a single line) with mxtexttools (with lookAhead or whatever
> ), I could gain some speed ?
> 3)    Should I seek examples from XML -tools instead OR write my own
> parser with C + SWIG ?
...
_______________________________________
   Mike C. Fletcher
   http://members.rogers.com/mcfletch/