[Tutor] Removing/Handing large blocks of text
Jesse Noller
jnoller at gmail.com
Thu Dec 9 21:37:25 CET 2004
On Wed, 8 Dec 2004 15:11:55 +0000, Max Noel <maxnoel_fr at yahoo.fr> wrote:
>
>
>
> On Dec 8, 2004, at 14:42, Jesse Noller wrote:
>
> > Hello,
> >
> > I'm trying to do some text processing with python on a farily large
> > text file (actually, XML, but I am handling it as plaintext as all I
> > need to do is find/replace/move) and I am having problems with trying
> > to identify two lines in the text file, and remove everything in
> > between those two lines (but not the two lines) and then write the
> > file back (I know the file IO part).
>
> Okay, here are some hints: you need to identify when you enter a <foo>
> block and when you exit a </foo> block, keeping in mind that this may
> happen on the same line (e.g. <foo>blah</foo>). The rest is trivial.
> The rest of your message is included as a spoiler space if you want to
> find the solution by yourself -- however, a 17-line program that does
> that is included at the end of this message. It prints the resulting
> file to the standard out, for added flexibility: if you want the result
> to be in a file, just redirect stdout (python blah.py > out.txt).
>
> Oh, one last thing: don't use readlines(), it uses up a lot of memory
> (especially with big files), and you don't need it since you're reading
> the file sequentially. Use the file iterator instead.
>
>
>
> > I'm trying to do this with the re module - the two tags looks like:
> >
> > <foo>
> > ...
> > a bunch of text (~1500 lines)
> > ...
> > </foo>
> >
> > I need to identify the first tag, and the second, and unconditionally
> > strip out everything in between those two tags, making it look like:
> >
> > <foo>
> > </foo>
> >
> > I'm familiar with using read/readlines to pull the file into memory
> > and alter the contents via string.replace(str, newstr) but I am not
> > sure where to begin with this other than the typical open/readlines.
> >
> > I'd start with something like:
> >
> > re1 = re.compile('^\<foo\>')
> > re2 = re.compile('^\<\/foo\>')
> >
> > f = open('foobar.txt', 'r')
> > for lines in f.readlines()
> > match = re.match(re1, line)
> >
> > But I'm lost after this point really, as I can identify the two lines,
> > but I am not sure how to do the processing.
> >
> > thank you
> > -jesse
> > _______________________________________________
> > Tutor maillist - Tutor at python.org
> > http://mail.python.org/mailman/listinfo/tutor
>
> #!/usr/bin/env python
>
> import sre
>
> reStart = sre.compile('^\s*\<foo\>')
> reEnd = sre.compile('\</foo\>\s*$')
>
> inBlock = False
>
> fileSource = open('foobar.txt')
>
> for line in fileSource:
> if reStart.match(line): inBlock = True
> if not inBlock: print line
> if reEnd.match(line): inBlock = False
>
> fileSource.close()
>
> -- Max
> maxnoel_fr at yahoo dot fr -- ICQ #85274019
> "Look at you hacker... A pathetic creature of meat and bone, panting
> and sweating as you run through my corridors... How can you challenge a
> perfect, immortal machine?"
>
>
Thanks a bunch for all of your fast responses, they helped a lot -
I'll post what I cook up back to the list as soon as I complete it.
Thanks!
-jesse
More information about the Tutor
mailing list