[Tutor] Removing/Handing large blocks of text

Thu Dec 9 21:37:25 CET 2004

On Wed, 8 Dec 2004 15:11:55 +0000, Max Noel <maxnoel_fr at yahoo.fr> wrote:
> 
> 
> 
> On Dec 8, 2004, at 14:42, Jesse Noller wrote:
> 
> > Hello,
> >
> > I'm trying to do some text processing with python on a farily large
> > text file (actually, XML, but I am handling it as plaintext as all I
> > need to do is find/replace/move) and I am having problems with trying
> > to identify two lines in the text file, and remove everything in
> > between those two lines (but not the two lines) and then write the
> > file back (I know the file IO part).
> 
>         Okay, here are some hints: you need to identify when you enter a <foo>
> block and when you exit a </foo> block, keeping in mind that this may
> happen on the same line (e.g. <foo>blah</foo>). The rest is trivial.
>         The rest of your message is included as a spoiler space if you want to
> find the solution by yourself -- however, a 17-line program that does
> that is included at the end of this message. It prints the resulting
> file to the standard out, for added flexibility: if you want the result
> to be in a file, just redirect stdout (python blah.py > out.txt).
> 
>         Oh, one last thing: don't use readlines(), it uses up a lot of memory
> (especially with big files), and you don't need it since you're reading
> the file sequentially. Use the file iterator instead.
> 
> 
> 
> > I'm trying to do this with the re module - the two tags looks like:
> >
> > <foo>
> >     ...
> >     a bunch of text (~1500 lines)
> >     ...
> > </foo>
> >
> > I need to identify the first tag, and the second, and unconditionally
> > strip out everything in between those two tags, making it look like:
> >
> > <foo>
> > </foo>
> >
> > I'm familiar with using read/readlines to pull the file into memory
> > and alter the contents via string.replace(str, newstr) but I am not
> > sure where to begin with this other than the typical open/readlines.
> >
> > I'd start with something like:
> >
> > re1 = re.compile('^\<foo\>')
> > re2 = re.compile('^\<\/foo\>')
> >
> > f = open('foobar.txt', 'r')
> > for lines in f.readlines()
> >     match = re.match(re1, line)
> >
> > But I'm lost after this point really, as I can identify the two lines,
> > but I am not sure how to do the processing.
> >
> > thank you
> > -jesse
> > _______________________________________________
> > Tutor maillist  -  Tutor at python.org
> > http://mail.python.org/mailman/listinfo/tutor
> 
> #!/usr/bin/env python
> 
> import sre
> 
> reStart = sre.compile('^\s*\<foo\>')
> reEnd = sre.compile('\</foo\>\s*$')
> 
> inBlock = False
> 
> fileSource = open('foobar.txt')
> 
> for line in fileSource:
>      if reStart.match(line): inBlock = True
>      if not inBlock: print line
>      if reEnd.match(line): inBlock = False
> 
> fileSource.close()
> 
> -- Max
> maxnoel_fr at yahoo dot fr -- ICQ #85274019
> "Look at you hacker... A pathetic creature of meat and bone, panting
> and sweating as you run through my corridors... How can you challenge a
> perfect, immortal machine?"
> 
> 

Thanks a bunch for all of your fast responses, they helped a lot -
I'll post what I cook up back to the list as soon as I complete it.
Thanks!

-jesse