Parsing text
Bengt Richter
bokr at oz.net
Tue Dec 20 18:43:10 EST 2005
On 20 Dec 2005 08:06:39 -0800, "sicvic" <morange.victor at gmail.com> wrote:
>Not homework...not even in school (do any universities even teach
>classes using python?). Just not a programmer. Anyways I should
>probably be more clear about what I'm trying to do.
Ok, not homework.
>
>Since I cant show the actual output file lets say I had an output file
>that looked like this:
>
>aaaaa bbbbb Person: Jimmy
>Current Location: Denver
>Next Location: Chicago
>----------------------------------------------
>aaaaa bbbbb Person: Sarah
>Current Location: San Diego
>Next Location: Miami
>Next Location: New York
>----------------------------------------------
>
>Now I want to put (and all recurrences of "Person: Jimmy")
>
>Person: Jimmy
>Current Location: Denver
>Next Location: Chicago
>
>in a file called jimmy.txt
>
>and the same for Sarah in sarah.txt
>
>The code I currently have looks something like this:
>
>import re
>import sys
>
>person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
>person_sarah = open('sarah.txt', 'w') #creates sarah.txt
>
>f = open(sys.argv[1]) #opens output file
>#loop that goes through all lines and parses specified text
>for line in f.readlines():
> if re.search(r'Person: Jimmy', line):
> person_jimmy.write(line)
> elif re.search(r'Person: Sarah', line):
> person_sarah.write(line)
>
>#closes all files
>
>person_jimmy.close()
>person_sarah.close()
>f.close()
>
>However this only would produces output files that look like this:
>
>jimmy.txt:
>
>aaaaa bbbbb Person: Jimmy
>
>sarah.txt:
>
>aaaaa bbbbb Person: Sarah
>
>My question is what else do I need to add (such as an embedded loop
>where the if statements are?) so the files look like this
>
>aaaaa bbbbb Person: Jimmy
>Current Location: Denver
>Next Location: Chicago
>
>and
>
>aaaaa bbbbb Person: Sarah
>Current Location: San Diego
>Next Location: Miami
>Next Location: New York
>
>
>Basically I need to add statements that after finding that line copy
>all the lines following it and stopping when it sees
>'----------------------------------------------'
>
>Any help is greatly appreciated.
>
Ok, I generalized on your theme of extracting file chunks to named files,
where the beginning line has the file name. I made '.txt' hardcoded extension.
I provided a way to direct the output to a (I guess not necessarily sub) directory
Not tested beyond what you see. Tweak to suit.
----< extractfilesegs.py >--------------------------------------------------------
"""
Usage: [python] extractfilesegs [source [outdir [startpat [endpat]]]]
where source is -tf for test file, a file name, or an open file
outdir is a directory prefix that will be joined to output file names
startpat is a regular expression with group 1 giving the extracted file name
endpat is a regular expression whose match line is excluded and ends the segment
"""
import re, os
def extractFileSegs(linesrc, outdir='extracteddata', start=r'Person:\s+(\w+)', stop='-'*30):
rxstart = re.compile(start)
rxstop = re.compile(stop)
if isinstance(linesrc, basestring): linesrc = open(linesrc)
lineit = iter(linesrc)
files = []
for line in lineit:
match = rxstart.search(line)
if not match: continue
name = match.group(1)
filename = name.lower() + '.txt'
filename = os.path.join(outdir, filename)
#print 'opening file %r'%filename
files.append(filename)
fout = open(filename, 'a') # append in case repeats?
fout.write(match.group(0)+'\n') # did you want aaa bbb stuff?
for data_line in lineit:
if rxstop.search(data_line):
#print 'closing file %r'%filename
fout.close() # don't write line with ending mark
fout = None
break
else:
fout.write(data_line)
if fout:
fout.close()
print 'file %r ended with source file EOF, not stop mark'%filename
return files
def get_testfile():
from StringIO import StringIO
return StringIO("""\
...irrelevant leading
stuff ...
aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------
irrelevant
trailing stuff ...
with a blank line
""")
if __name__ == '__main__':
import sys
args = sys.argv[1:]
if not args: raise SystemExit(__doc__)
tf = args.pop(0)
if tf=='-tf': fin = get_testfile()
else: fin = tf
if not args:
files = extractFileSegs(fin)
elif len(args)==1:
files = extractFileSegs(fin, args[0])
elif len(args)==2:
files = extractFileSegs(fin, args[0], args[1], '^$') # stop on blank line?
else:
files = extractFileSegs(fin, args[0], '|'.join(args[1:-1]), args[-1])
print '\nFiles created:'
for fname in files:
print ' "%s"'% fname
if tf == '-tf':
for fpath in files:
print '====< %s >====\n%s============'%(fpath, open(fpath).read())
----------------------------------------------------------------------------------
Running on your test data:
[15:19] C:\pywk\clp>md extracteddata
[15:19] C:\pywk\clp>py24 extractfilesegs.py -tf
Files created:
"extracteddata\jimmy.txt"
"extracteddata\sarah.txt"
====< extracteddata\jimmy.txt >====
Person: Jimmy
Current Location: Denver
Next Location: Chicago
============
====< extracteddata\sarah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============
[15:20] C:\pywk\clp>md xd
[15:20] C:\pywk\clp>py24 extractfilesegs.py -tf xd (Jimmy) ----
Files created:
"xd\jimmy.txt"
====< xd\jimmy.txt >====
Jimmy
Current Location: Denver
Next Location: Chicago
============
[15:21] C:\pywk\clp>py24 extractfilesegs.py -tf xd "Person: (Sarah)" ----
Files created:
"xd\sarah.txt"
====< xd\sarah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============
[15:22] C:\pywk\clp>py24 extractfilesegs.py -tf xd "^(irrelevant)"
Files created:
"xd\irrelevant.txt"
====< xd\irrelevant.txt >====
irrelevant
trailing stuff ...
============
HTH, NO WARRANTIES ;-)
Regards,
Bengt Richter
More information about the Python-list
mailing list