[Tutor] Should I use python for parsing text
kent37 at tds.net
Wed Mar 21 11:29:58 CET 2007
Jay Mutter III wrote:
> "Jay Mutter III" <jmutter at uakron.edu
> <http://mail.python.org/mailman/listinfo/tutor>> wrote
>>/ See example// //next:/
>>/ A.-C. Manufacturing Company. (See Sebastian, A. A.,/
>>/ and Capes, assignors.)/
>>/Aaron, Solomon E., Boston, Mass. Pliers. No. 1,329,155 ;/
>>/Jan. 27 ; v. 270 ; p. 554./
>>/ For instance, I would like to go to end of line and if last/
>>/ character// //is a comma or semicolon or hyphen then/
>>/ remove the CR./
> It would look something like:
> output = open('example.fixed','w')
> for line in file('example.txt'):
> if line[-1] in ',;-': # check last character
The last character will always be a newline; try
if len(line) > 1 and line[-2] in ';,-':
> line = line.strip() # lose the C/R
This will also lose any leading or trailing whitespace. line.rstrip()
would be safer.
> output.write(line) # write to output
> else: output.write(line) # append the next line complete with C/R
> Working from the above suggestion ( and thank you very much - i did
> enjoy your online tutorial)
> I came up with the following:
> import os
> import sys
> import re
> import string
You don't need any of the above.
> # The next 5 lines are so I have an idea of how many lines i started
> with in the file.
> in_filename = raw_input('What is the COMPLETE name of the file you want
> to open: ')
> in_file = open(in_filename, 'r')
> text = in_file.read()
As Luke pointed out, you should use readlines() here.
> num_lines = text.count('\n')
> print 'There are', num_lines, 'lines in the file', in_filename
> output = open("cleandata.txt","a") # file for writing data to after
> stripping newline character
> # read file, copying each line to new file
> for line in text:
> if line[:-1] in '-':
> line = line.rstrip()
> else: output.write(line)
Since line is a single character, line[:-1] is always an empty string
and the condition will always be true. What this loop does is strip all
the whitespace out of your file.
> print "Data written to cleandata.txt."
> # close the files
> The above ran with no erros, gave me the number of lines in my orginal
> file but then when i opened the cleandata.txt file
> I got:
This is mysterious. What is the original data? What OS are you running
on? How did you view the file?
> So what did I do to cause all of the strange characters????
> Plus since this goes on it is as if it removed all \n and not just the
> ones after a hyphen which I was using as my test case.
> Thanks again.
>>/ Then move line by line through the file and delete everything/
>>/ after a// //numerical sequence/
> Slightly more tricky because you need to use a regular expression.
> But if you know regex then only slightly.
>>/ //I am wondering if Python would be a good tool/
> Absolutely, its one of the areas where Python excels.
>>/ find information on how to accomplish this/
> You could check my tutorial on the three topics:
> Handling text
> Handling files
> Regular Expressions.
> Also the standard python documentation for the general tutorial
> (assuming you've done basic programming in some other language
> before) plus the re module
>>/ using something like the unix tool awk or something else??/
> awk or sed could both be used, but Python is more generally
> useful so unless you already know awk I'd take the time to
> learn the basics of Python (a few hours maybe) and use that.
> Alan Gauld
> Author of the Learn to Program web site
> Tutor maillist - Tutor at python.org
More information about the Tutor