[Tutor] rtf to xml regexp question

Paul Tremblay phthenry@earthlink.net
Sat, 8 Jun 2002 18:23:50 -0400


I'm responding to my own question. I've done a lot of research
the past few days, and rezlize that for my needs I really needed
a parser or state machine. 

As with many things with programming and linux, it is always
frustrating at first. Then once you understand, you think "Ah,
that is a much simpler solution!" 

That is how I feel about state machines. I have been killing
myself with regular expressions. No more! 

Paul

PS I'v included my code below, in case anyone is interested. The
only problem I am having at this point is that the state machine
is putting a space after my "{", but I think I can write a
sub-routine to strip trailing spaces.

PSS Originally this message was a plea for help. My regular
expressions were so greedy that they were overlapping each other.
But as soon as I started writing this email, the solution came to
me!


*******************************
!/usr/bin/python

from Plex import *
##from Plex.Traditional import re


class MyScanner(Scanner):

	def begin_footnote(self, text):
		self.produce('##Footnote', '')
		if self.nesting_level == 0:
			self.begin('footnote')
		self.nesting_level = self.nesting_level + 1

	def end_something(self, text):
		self.nesting_level = self.nesting_level - 1
		if self.nesting_level == 0:
			self.produce('##END OF FOOTNOTE##','')
			self.begin('')
		else:
			self.produce('}','')

	def begin_open_bracket(self, text):
		self.produce('{','')
		self.nesting_level = self.nesting_level + 1

	string = Rep1(AnyBut("{}"))

	lexicon = Lexicon([
		(Str(r"{\footnote"),     begin_footnote),
		
		State('footnote', [
			(Str(r"{\footnote"), begin_footnote),
			(Str("}"), end_something),
			(Str(r"{"), begin_open_bracket),
			(string,	  TEXT)
		]),
		(string,		TEXT),
		(Str("{"), 	TEXT),
		(Str("}"),	TEXT),
	])

	def __init__(self, file, name):
		Scanner.__init__(self, self.lexicon, file, name)
		self.nesting_level = 0

filename = "/home/paul/paultemp/my_file.txt"
file = open(filename, "r")

scanner = MyScanner(file, filename)
while 1:
	token = scanner.read()
	if token[0] is None:
		break
	print token[0],

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************