[Tutor] Should I use python for parsing text?

Thu Mar 22 12:51:27 CET 2007

Jay Mutter III wrote:
> Kent;
> 
> Thanks for the reply on tutor-python.
> 
> My data file which is just a .txt file created under WinXP by an OCR 
> program contains lines like:
> 
> A.-C. Manufacturing Company. (See Sebastian, A. A.,
> and Capes, assignors.)
> A. G. A. Railway Light & Signal Co. (See Meden, Elof
> H„ assignor.)
> A-N Company, The. (See Alexander and Nasb, as-
> signors.;
> AN Company, The. (See Nash, It. J., and Alexander, as-
> signors.)
> 
> I use an intel imac running OS x10.4.9 and when I used python to append 
> one file to another I got a file that opened in OS X's
> TexEdit program with characters that looked liked Japanese/Chinese 
> characters.
> 
> When i pasted them into my mail client (OS X's mail) they were then just 
> a sequence of question marks so I am not sure what happened.
> 
> Any thoughts???

For some reason, after you run the Python program, TexEdit thinks the 
file is not ascii data; it seems to think it is utf-8 or a Chinese 
encoding. Your original email was utf-8 which points in that direction 
but is not conclusive.

If you zip up and send me the original file and the cleandata.txt file 
*exactly as it is produced* by the Python program - not edited in any 
way - I will take a look and see if I can guess what is going on.
> 
> And i tried  using the following on the above data:
> 
> in_filename = raw_input('What is the COMPLETE name of the file you want 
> to open:    ')
> in_file = open(in_filename, 'r')

It wouldn't hurt to use universal newlines here since you are working 
cross-platform:
   open(in_filename, 'Ur')

> text = in_file.readlines()
> num_lines = text.count('\n')

Here 'text' is a list of lines, so text.count('\n') is counting the 
number of blank lines (lines containing only a newline) in your file. 
You should use
   num_lines = len(text)

> print 'There are', num_lines, 'lines in the file', in_filename
> 
> output = open("cleandata.txt","a")    # file for writing data to after 
> stripping newline character

I agree with Luke, use 'w' for now to make sure the file has only the 
output of this program. Maybe something already in the file is making it 
look like utf-8...

> 
> # read file, copying each line to new file
> for line in text:
>     if len(line) > 1 and line[-2] in ';,-':
>         line = line.rstrip()
>         output.write(line)
>     else: output.write(line)
> 
> print "Data written to cleandata.txt."
> 
> # close the files
> in_file.close()
> output.close()
> 
> As written above it tells me that there are 0 lines which is surprising 
> because if I run the first part by itself it tells there are 1982 lines 
> ( actually 1983 so i am figuring EOF)
> It copies/writes the data to the cleandata file but it does not strip 
> out CR and put data on one line ( a sample of what i am trying to get is 
> next)
> 
> A.-C. Manufacturing Company. (See Sebastian, A. A., and Capes, assignors.)
> 
> 
> My apologies if i have intruded.

Please reply on-list in the future.

Kent