[Tutor] Should I use python for parsing text

Wed Mar 21 03:47:40 CET 2007

"Jay Mutter III" <jmutter at uakron.edu> wrote

 > See example  next:
 > A.-C. Manufacturing Company. (See Sebastian, A. A.,
 > and Capes, assignors.)
 >...
 >Aaron, Solomon E., Boston, Mass. Pliers. No. 1,329,155 ;
 >Jan. 27 ; v. 270 ; p. 554.
 >
 > For instance, I would like to go to end of line and if last
 > character  is a comma or semicolon or hyphen then
 > remove the CR.

It would look something like:

output = open('example.fixed','w')
for line in file('example.txt'):
     if line[-1] in ',;-':            # check last character
       line = line.strip()         # lose the C/R
       output.write(line)        # write to output
     else: output.write(line)  # append the next line complete with C/R
output.close()

Working from the above suggestion ( and thank you very much - i did  
enjoy your online tutorial)
I came up with the following:

import os
import sys
import re
import string

# The next 5 lines are so I have an idea of how many lines i started  
with in the file.

in_filename = raw_input('What is the COMPLETE name of the file you  
want to open:    ')
in_file = open(in_filename, 'r')
text = in_file.read()
num_lines = text.count('\n')
print 'There are', num_lines, 'lines in the file', in_filename

output = open("cleandata.txt","a")    # file for writing data to  
after stripping newline character

# read file, copying each line to new file
for line in text:
     if line[:-1] in '-':
         line = line.rstrip()
         output.write(line)
     else: output.write(line)

print "Data written to cleandata.txt."

# close the files
in_file.close()
output.close()

The above ran with no erros, gave me the number of lines in my  
orginal file but then when i opened the cleandata.txt file
I got:

A.-C.䴀愀渀甀昀愀挀琀甀爀椀渀最 Company.⠀匀攀攀  
Sebastian,䄀⸀ A., and䌀愀瀀攀猀Ⰰ assignors.) A.䜀⸀ A.刀 
愀椀氀眀愀礀 Light☀ Signal䌀漀⸀ (See䴀攀搀攀渀 
Ⰰ Elof H
assignor.) A-N䌀漀洀瀀愀渀礀Ⰰ The.⠀匀攀攀  
Alexander愀渀搀 Nasb,愀猀ⴀ 猀椀最渀漀爀猀⸀㬀 䄀一  
Company,吀栀攀⸀ (See一愀猀栀Ⰰ It.䨀⸀Ⰰ and䄀氀攀砀 
愀渀搀攀爀Ⰰ as- 

So what did I do to cause all of the strange characters????
Plus since this goes on it is as if it removed all \n and not just  
the ones after a hyphen which I was using as my test case.

Thanks again.

Jay

 > Then move line by line through the file and delete everything
 > after a  numerical sequence

Slightly more tricky because you need to use a regular expression.
But if you know regex then only slightly.

 >  I am wondering if Python would be a good tool

Absolutely, its one of the areas where Python excels.

 > find information on how to accomplish this

You could check  my tutorial on the three topics:

Handling text
Handling files
Regular Expressions.

Also the standard python documentation for the general tutorial
(assuming you've done basic programming in some other language
before) plus the re module

 > using something like the unix tool awk or something else??

awk or sed could both be used, but Python is more generally
useful so unless you already know awk I'd take the time to
learn the basics of Python (a few hours maybe) and use that.

-- 
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20070320/4f66ab3d/attachment-0001.htm