[Tutor] Should I use python for parsing text

Kent Johnson kent37 at tds.net
Wed Mar 21 11:29:58 CET 2007


Jay Mutter III wrote:
> "Jay Mutter III" <jmutter at uakron.edu 
> <http://mail.python.org/mailman/listinfo/tutor>> wrote
> 
>>/ See example//  //next:/
>>/ A.-C. Manufacturing Company. (See Sebastian, A. A.,/
>>/ and Capes, assignors.)/
>>/.../
>>/Aaron, Solomon E., Boston, Mass. Pliers. No. 1,329,155 ;/
>>/Jan. 27 ; v. 270 ; p. 554./
>>
>>/ For instance, I would like to go to end of line and if last/
>>/ character//  //is a comma or semicolon or hyphen then/
>>/ remove the CR./
> 
> It would look something like:
> 
> output = open('example.fixed','w')
> for line in file('example.txt'):
>     if line[-1] in ',;-':            # check last character

The last character will always be a newline; try
  if len(line) > 1 and line[-2] in ';,-':
instead.

>       line = line.strip()         # lose the C/R

This will also lose any leading or trailing whitespace. line.rstrip() 
would be safer.

>       output.write(line)        # write to output
>     else: output.write(line)  # append the next line complete with C/R
> output.close()
> 
> 
> 
> 
> Working from the above suggestion ( and thank you very much - i did 
> enjoy your online tutorial)
> I came up with the following:
> 
> import os
> import sys
> import re
> import string

You don't need any of the above.
> 
> # The next 5 lines are so I have an idea of how many lines i started 
> with in the file.
> 
> in_filename = raw_input('What is the COMPLETE name of the file you want 
> to open:    ')
> in_file = open(in_filename, 'r')
> text = in_file.read()

As Luke pointed out, you should use readlines() here.

> num_lines = text.count('\n')
> print 'There are', num_lines, 'lines in the file', in_filename
> 
> output = open("cleandata.txt","a")    # file for writing data to after 
> stripping newline character
> 
> # read file, copying each line to new file
> for line in text:
>     if line[:-1] in '-':
>         line = line.rstrip()
>         output.write(line)
>     else: output.write(line)

Since line is a single character, line[:-1] is always an empty string 
and the condition will always be true. What this loop does is strip all 
the whitespace out of your file.
> 
> print "Data written to cleandata.txt."
> 
> # close the files
> in_file.close()
> output.close()
> 
> The above ran with no erros, gave me the number of lines in my orginal 
> file but then when i opened the cleandata.txt file
> I got:
> 
> A.-C.䴀愀渀甀昀愀挀琀甀爀椀渀最�Company.⠀匀攀攀�Sebastian,䄀⸀�A.,�and 
> 䌀愀瀀攀猀Ⰰ�assignors.)�A.䜀⸀�A.刀愀椀氀眀愀礀�Light☀�Signal䌀漀⸀�(See 
> 䴀攀搀攀渀Ⰰ�Elof�H
assignor.)�A-N䌀漀洀瀀愀渀礀Ⰰ�The.⠀匀攀攀�Alexander愀 
> 渀搀�Nasb,愀猀ⴀ�猀椀最渀漀爀猀⸀㬀�䄀一�Company,吀栀攀⸀�(See一愀猀栀Ⰰ�It. 
> 䨀⸀Ⰰ�and䄀氀攀砀愀渀搀攀爀Ⰰ�as-�

This is mysterious. What is the original data? What OS are you running 
on? How did you view the file?

Kent
> 
> So what did I do to cause all of the strange characters????
> Plus since this goes on it is as if it removed all \n and not just the 
> ones after a hyphen which I was using as my test case.
> 
> Thanks again.
> 
> Jay
> 
> 
> 
>>/ Then move line by line through the file and delete everything/
>>/ after a//  //numerical sequence/
> 
> Slightly more tricky because you need to use a regular expression.
> But if you know regex then only slightly.
> 
>>/  //I am wondering if Python would be a good tool/
> 
> Absolutely, its one of the areas where Python excels.
> 
>>/ find information on how to accomplish this/
> 
> You could check  my tutorial on the three topics:
> 
> Handling text
> Handling files
> Regular Expressions.
> 
> Also the standard python documentation for the general tutorial
> (assuming you've done basic programming in some other language
> before) plus the re module
> 
>>/ using something like the unix tool awk or something else??/
> 
> awk or sed could both be used, but Python is more generally
> useful so unless you already know awk I'd take the time to
> learn the basics of Python (a few hours maybe) and use that.
> 
> -- 
> Alan Gauld
> Author of the Learn to Program web site
> http://www.freenetpages.co.uk/hp/alan.gauld
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list