[Tutor] Should I use python for parsing text?

Fri Mar 23 14:51:00 CET 2007

First thanks for all of the help
I am actually starting to see the light.

On Mar 22, 2007, at 7:51 AM, Kent Johnson wrote:

> Jay Mutter III wrote:
>> Kent;
>> Thanks for the reply on tutor-python.
>> My data file which is just a .txt file created under WinXP by an  
>> OCR program contains lines like:
>> A.-C. Manufacturing Company. (See Sebastian, A. A.,
>> and Capes, assignors.)
>> A. G. A. Railway Light & Signal Co. (See Meden, Elof
>> H„ assignor.)
>> A-N Company, The. (See Alexander and Nasb, as-
>> signors.;
>> AN Company, The. (See Nash, It. J., and Alexander, as-
>> signors.)
>> I use an intel imac running OS x10.4.9 and when I used python to  
>> append one file to another I got a file that opened in OS X's
>> TexEdit program with characters that looked liked Japanese/Chinese  
>> characters.
>> When i pasted them into my mail client (OS X's mail) they were  
>> then just a sequence of question marks so I am not sure what  
>> happened.
>> Any thoughts???
>
> For some reason, after you run the Python program, TexEdit thinks  
> the file is not ascii data; it seems to think it is utf-8 or a  
> Chinese encoding. Your original email was utf-8 which points in  
> that direction but is not conclusive.
>
> If you zip up and send me the original file and the cleandata.txt  
> file *exactly as it is produced* by the Python program - not edited  
> in any way - I will take a look and see if I can guess what is  
> going on.
>>

You are correct that it was utf-8
Multiple people were scanning pages and converting to text, some  
saved as ascii and some saved as unicode
The sample used above was utf-8 so after your comment i checked all,  
put everything as ascii, combined all pieces into one file and  
normalized the line endings to unix style

>> And i tried  using the following on the above data:
>> in_filename = raw_input('What is the COMPLETE name of the file you  
>> want to open:    ')
>> in_file = open(in_filename, 'r')
>
> It wouldn't hurt to use universal newlines here since you are  
> working cross-platform:
>   open(in_filename, 'Ur')
>

corrected this

>> text = in_file.readlines()
>> num_lines = text.count('\n')
>
> Here 'text' is a list of lines, so text.count('\n') is counting the  
> number of blank lines (lines containing only a newline) in your  
> file. You should use
>   num_lines = len(text)
>

changed

>> print 'There are', num_lines, 'lines in the file', in_filename
>> output = open("cleandata.txt","a")    # file for writing data to  
>> after stripping newline character
>
> I agree with Luke, use 'w' for now to make sure the file has only  
> the output of this program. Maybe something already in the file is  
> making it look like utf-8...
>
>> # read file, copying each line to new file
>> for line in text:
>>     if len(line) > 1 and line[-2] in ';,-':
>>         line = line.rstrip()
>>         output.write(line)
>>     else: output.write(line)
>> print "Data written to cleandata.txt."
>> # close the files
>> in_file.close()
>> output.close()
>> As written above it tells me that there are 0 lines which is  
>> surprising because if I run the first part by itself it tells  
>> there are 1982 lines ( actually 1983 so i am figuring EOF)
>> It copies/writes the data to the cleandata file but it does not  
>> strip out CR and put data on one line ( a sample of what i am  
>> trying to get is next)
>> A.-C. Manufacturing Company. (See Sebastian, A. A., and Capes,  
>> assignors.)
>> My apologies if i have intruded.
>
> Please reply on-list in the future.
>
> Kent