[Tutor] problems with re module

Sat Nov 15 03:04:19 EST 2003

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Guys,

I'm trying to write a function that searches through a string of plain text, 
that may (or may not) contain some tags which look like this:

<Graphics file: pics/PCs/barbar2.jpg>

and replace those tags with docbook markup, which looks like this:

<graphic srccredit="Fix Me!" fileref='pics/PCs/barbar2.jpg' />

I'm using the re module, and a recursive algorithm to find and replace the 
offending strings, but I'm getting very weird results... I've tried to nut 
this out for the last 3-4 hours, but can't seem to get anywhere with it...

here's the code:

- ---------------------------------
def processcol(message):
	"""This procedure takes a column text as an argument, and returns the same 
text, without
	any illegal characters for XML. It even does a bit of text tidying"""

	message = message.replace('\n',' ')
	message = message.replace('\t',' ')

 	m = re.search(r"<Graphics\s+file:\s+",message)	#search for the starting tag.
 	if m:
 		start,end = m.span()			
 		cstart,cend = re.search(r">",message).span()
 		fname = message[end:cstart - 1]
 		message = message[:start] + "<graphic srccredit='Fix Me!' fileref='%s' />" 
% (fname)+ message[cend:]
 		return processcol(message[cend:])
 	else:
 		return message

- -----------------------------------

There's some really simple reason why this doesn't go, but I can't quite put 
my finger on it... There were a whole raft of debugging print statements, but 
I removed them for your sanity ;)

What's *meant* to happen:

a string which may contain the offending tags gets passed to the processcol() 
function. a few simple cleanup operations are performed (removing newlines 
and tabs). 

Then, if a bad tag is found, the index where the tag starts is recorded, as 
well as where the tag ends. the filename is extracted, and the bad tag is 
replaced. Because the regex searching goes from left to right, we now pass 
the string to the right of the tag we have just fixed to ourselves - this 
means that if there were twobad tags, one after the other, the left hand one 
would be fixed first, and then the right hand one. 

If no bad tags are found, the message is returned.

Can anyone here help me get this going properly?

- -- 
Thomi Richards,
http://once.sourceforge.net/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/td4D2tSuYV7JfuERAuFRAJ9p//NL94AWovOw3EBnAaZA1mu7gwCfbqjN
FGl/VfrI/r4Zxe4fmrU7EU8=
=BzZz
-----END PGP SIGNATURE-----