[Tutor] rtf to xml regexp question
Paul Tremblay
phthenry@earthlink.net
Sat, 8 Jun 2002 18:23:50 -0400
I'm responding to my own question. I've done a lot of research
the past few days, and rezlize that for my needs I really needed
a parser or state machine.
As with many things with programming and linux, it is always
frustrating at first. Then once you understand, you think "Ah,
that is a much simpler solution!"
That is how I feel about state machines. I have been killing
myself with regular expressions. No more!
Paul
PS I'v included my code below, in case anyone is interested. The
only problem I am having at this point is that the state machine
is putting a space after my "{", but I think I can write a
sub-routine to strip trailing spaces.
PSS Originally this message was a plea for help. My regular
expressions were so greedy that they were overlapping each other.
But as soon as I started writing this email, the solution came to
me!
*******************************
!/usr/bin/python
from Plex import *
##from Plex.Traditional import re
class MyScanner(Scanner):
def begin_footnote(self, text):
self.produce('##Footnote', '')
if self.nesting_level == 0:
self.begin('footnote')
self.nesting_level = self.nesting_level + 1
def end_something(self, text):
self.nesting_level = self.nesting_level - 1
if self.nesting_level == 0:
self.produce('##END OF FOOTNOTE##','')
self.begin('')
else:
self.produce('}','')
def begin_open_bracket(self, text):
self.produce('{','')
self.nesting_level = self.nesting_level + 1
string = Rep1(AnyBut("{}"))
lexicon = Lexicon([
(Str(r"{\footnote"), begin_footnote),
State('footnote', [
(Str(r"{\footnote"), begin_footnote),
(Str("}"), end_something),
(Str(r"{"), begin_open_bracket),
(string, TEXT)
]),
(string, TEXT),
(Str("{"), TEXT),
(Str("}"), TEXT),
])
def __init__(self, file, name):
Scanner.__init__(self, self.lexicon, file, name)
self.nesting_level = 0
filename = "/home/paul/paultemp/my_file.txt"
file = open(filename, "r")
scanner = MyScanner(file, filename)
while 1:
token = scanner.read()
if token[0] is None:
break
print token[0],
--
************************
*Paul Tremblay *
*phthenry@earthlink.net*
************************