[Tutor] Removing/Handing large blocks of text
Max Noel
maxnoel_fr at yahoo.fr
Wed Dec 8 16:11:55 CET 2004
On Dec 8, 2004, at 14:42, Jesse Noller wrote:
> Hello,
>
> I'm trying to do some text processing with python on a farily large
> text file (actually, XML, but I am handling it as plaintext as all I
> need to do is find/replace/move) and I am having problems with trying
> to identify two lines in the text file, and remove everything in
> between those two lines (but not the two lines) and then write the
> file back (I know the file IO part).
Okay, here are some hints: you need to identify when you enter a <foo>
block and when you exit a </foo> block, keeping in mind that this may
happen on the same line (e.g. <foo>blah</foo>). The rest is trivial.
The rest of your message is included as a spoiler space if you want to
find the solution by yourself -- however, a 17-line program that does
that is included at the end of this message. It prints the resulting
file to the standard out, for added flexibility: if you want the result
to be in a file, just redirect stdout (python blah.py > out.txt).
Oh, one last thing: don't use readlines(), it uses up a lot of memory
(especially with big files), and you don't need it since you're reading
the file sequentially. Use the file iterator instead.
> I'm trying to do this with the re module - the two tags looks like:
>
> <foo>
> ...
> a bunch of text (~1500 lines)
> ...
> </foo>
>
> I need to identify the first tag, and the second, and unconditionally
> strip out everything in between those two tags, making it look like:
>
> <foo>
> </foo>
>
> I'm familiar with using read/readlines to pull the file into memory
> and alter the contents via string.replace(str, newstr) but I am not
> sure where to begin with this other than the typical open/readlines.
>
> I'd start with something like:
>
> re1 = re.compile('^\<foo\>')
> re2 = re.compile('^\<\/foo\>')
>
> f = open('foobar.txt', 'r')
> for lines in f.readlines()
> match = re.match(re1, line)
>
> But I'm lost after this point really, as I can identify the two lines,
> but I am not sure how to do the processing.
>
> thank you
> -jesse
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
#!/usr/bin/env python
import sre
reStart = sre.compile('^\s*\<foo\>')
reEnd = sre.compile('\</foo\>\s*$')
inBlock = False
fileSource = open('foobar.txt')
for line in fileSource:
if reStart.match(line): inBlock = True
if not inBlock: print line
if reEnd.match(line): inBlock = False
fileSource.close()
-- Max
maxnoel_fr at yahoo dot fr -- ICQ #85274019
"Look at you hacker... A pathetic creature of meat and bone, panting
and sweating as you run through my corridors... How can you challenge a
perfect, immortal machine?"
More information about the Tutor
mailing list