[Tutor] Removing/Handing large blocks of text

Max Noel maxnoel_fr at yahoo.fr
Wed Dec 8 16:11:55 CET 2004


On Dec 8, 2004, at 14:42, Jesse Noller wrote:

> Hello,
>
> I'm trying to do some text processing with python on a farily large
> text file (actually, XML, but I am handling it as plaintext as all I
> need to do is find/replace/move) and I am having problems with trying
> to identify two lines in the text file, and remove everything in
> between those two lines (but not the two lines) and then write the
> file back (I know the file IO part).

	Okay, here are some hints: you need to identify when you enter a <foo> 
block and when you exit a </foo> block, keeping in mind that this may 
happen on the same line (e.g. <foo>blah</foo>). The rest is trivial.
	The rest of your message is included as a spoiler space if you want to 
find the solution by yourself -- however, a 17-line program that does 
that is included at the end of this message. It prints the resulting 
file to the standard out, for added flexibility: if you want the result 
to be in a file, just redirect stdout (python blah.py > out.txt).

	Oh, one last thing: don't use readlines(), it uses up a lot of memory 
(especially with big files), and you don't need it since you're reading 
the file sequentially. Use the file iterator instead.

> I'm trying to do this with the re module - the two tags looks like:
>
> <foo>
>     ...
>     a bunch of text (~1500 lines)
>     ...
> </foo>
>
> I need to identify the first tag, and the second, and unconditionally
> strip out everything in between those two tags, making it look like:
>
> <foo>
> </foo>
>
> I'm familiar with using read/readlines to pull the file into memory
> and alter the contents via string.replace(str, newstr) but I am not
> sure where to begin with this other than the typical open/readlines.
>
> I'd start with something like:
>
> re1 = re.compile('^\<foo\>')
> re2 = re.compile('^\<\/foo\>')
>
> f = open('foobar.txt', 'r')
> for lines in f.readlines()
>     match = re.match(re1, line)
>
> But I'm lost after this point really, as I can identify the two lines,
> but I am not sure how to do the processing.
>
> thank you
> -jesse
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor


#!/usr/bin/env python

import sre

reStart = sre.compile('^\s*\<foo\>')
reEnd = sre.compile('\</foo\>\s*$')

inBlock = False

fileSource = open('foobar.txt')

for line in fileSource:
     if reStart.match(line): inBlock = True
     if not inBlock: print line
     if reEnd.match(line): inBlock = False

fileSource.close()



-- Max
maxnoel_fr at yahoo dot fr -- ICQ #85274019
"Look at you hacker... A pathetic creature of meat and bone, panting 
and sweating as you run through my corridors... How can you challenge a 
perfect, immortal machine?"



More information about the Tutor mailing list