[Tutor] Should I use python for parsing text
Kent Johnson
kent37 at tds.net
Wed Mar 21 11:29:58 CET 2007
Jay Mutter III wrote:
> "Jay Mutter III" <jmutter at uakron.edu
> <http://mail.python.org/mailman/listinfo/tutor>> wrote
>
>>/ See example// //next:/
>>/ A.-C. Manufacturing Company. (See Sebastian, A. A.,/
>>/ and Capes, assignors.)/
>>/.../
>>/Aaron, Solomon E., Boston, Mass. Pliers. No. 1,329,155 ;/
>>/Jan. 27 ; v. 270 ; p. 554./
>>
>>/ For instance, I would like to go to end of line and if last/
>>/ character// //is a comma or semicolon or hyphen then/
>>/ remove the CR./
>
> It would look something like:
>
> output = open('example.fixed','w')
> for line in file('example.txt'):
> if line[-1] in ',;-': # check last character
The last character will always be a newline; try
if len(line) > 1 and line[-2] in ';,-':
instead.
> line = line.strip() # lose the C/R
This will also lose any leading or trailing whitespace. line.rstrip()
would be safer.
> output.write(line) # write to output
> else: output.write(line) # append the next line complete with C/R
> output.close()
>
>
>
>
> Working from the above suggestion ( and thank you very much - i did
> enjoy your online tutorial)
> I came up with the following:
>
> import os
> import sys
> import re
> import string
You don't need any of the above.
>
> # The next 5 lines are so I have an idea of how many lines i started
> with in the file.
>
> in_filename = raw_input('What is the COMPLETE name of the file you want
> to open: ')
> in_file = open(in_filename, 'r')
> text = in_file.read()
As Luke pointed out, you should use readlines() here.
> num_lines = text.count('\n')
> print 'There are', num_lines, 'lines in the file', in_filename
>
> output = open("cleandata.txt","a") # file for writing data to after
> stripping newline character
>
> # read file, copying each line to new file
> for line in text:
> if line[:-1] in '-':
> line = line.rstrip()
> output.write(line)
> else: output.write(line)
Since line is a single character, line[:-1] is always an empty string
and the condition will always be true. What this loop does is strip all
the whitespace out of your file.
>
> print "Data written to cleandata.txt."
>
> # close the files
> in_file.close()
> output.close()
>
> The above ran with no erros, gave me the number of lines in my orginal
> file but then when i opened the cleandata.txt file
> I got:
>
> A.-C.䴀愀渀甀昀愀挀琀甀爀椀渀最�Company.⠀匀攀攀�Sebastian,䄀⸀�A.,�and
> 䌀愀瀀攀猀Ⰰ�assignors.)�A.䜀⸀�A.刀愀椀氀眀愀礀�Light☀�Signal䌀漀⸀�(See
> 䴀攀搀攀渀Ⰰ�Elof�H
assignor.)�A-N䌀漀洀瀀愀渀礀Ⰰ�The.⠀匀攀攀�Alexander愀
> 渀搀�Nasb,愀猀ⴀ�猀椀最渀漀爀猀⸀㬀�䄀一�Company,吀栀攀⸀�(See一愀猀栀Ⰰ�It.
> 䨀⸀Ⰰ�and䄀氀攀砀愀渀搀攀爀Ⰰ�as-�
This is mysterious. What is the original data? What OS are you running
on? How did you view the file?
Kent
>
> So what did I do to cause all of the strange characters????
> Plus since this goes on it is as if it removed all \n and not just the
> ones after a hyphen which I was using as my test case.
>
> Thanks again.
>
> Jay
>
>
>
>>/ Then move line by line through the file and delete everything/
>>/ after a// //numerical sequence/
>
> Slightly more tricky because you need to use a regular expression.
> But if you know regex then only slightly.
>
>>/ //I am wondering if Python would be a good tool/
>
> Absolutely, its one of the areas where Python excels.
>
>>/ find information on how to accomplish this/
>
> You could check my tutorial on the three topics:
>
> Handling text
> Handling files
> Regular Expressions.
>
> Also the standard python documentation for the general tutorial
> (assuming you've done basic programming in some other language
> before) plus the re module
>
>>/ using something like the unix tool awk or something else??/
>
> awk or sed could both be used, but Python is more generally
> useful so unless you already know awk I'd take the time to
> learn the basics of Python (a few hours maybe) and use that.
>
> --
> Alan Gauld
> Author of the Learn to Program web site
> http://www.freenetpages.co.uk/hp/alan.gauld
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
More information about the Tutor
mailing list