[Tutor] Should I use python for parsing text?
Jay Mutter III
jmutter at uakron.edu
Fri Mar 23 14:51:00 CET 2007
First thanks for all of the help
I am actually starting to see the light.
On Mar 22, 2007, at 7:51 AM, Kent Johnson wrote:
> Jay Mutter III wrote:
>> Kent;
>> Thanks for the reply on tutor-python.
>> My data file which is just a .txt file created under WinXP by an
>> OCR program contains lines like:
>> A.-C. Manufacturing Company. (See Sebastian, A. A.,
>> and Capes, assignors.)
>> A. G. A. Railway Light & Signal Co. (See Meden, Elof
>> H„ assignor.)
>> A-N Company, The. (See Alexander and Nasb, as-
>> signors.;
>> AN Company, The. (See Nash, It. J., and Alexander, as-
>> signors.)
>> I use an intel imac running OS x10.4.9 and when I used python to
>> append one file to another I got a file that opened in OS X's
>> TexEdit program with characters that looked liked Japanese/Chinese
>> characters.
>> When i pasted them into my mail client (OS X's mail) they were
>> then just a sequence of question marks so I am not sure what
>> happened.
>> Any thoughts???
>
> For some reason, after you run the Python program, TexEdit thinks
> the file is not ascii data; it seems to think it is utf-8 or a
> Chinese encoding. Your original email was utf-8 which points in
> that direction but is not conclusive.
>
> If you zip up and send me the original file and the cleandata.txt
> file *exactly as it is produced* by the Python program - not edited
> in any way - I will take a look and see if I can guess what is
> going on.
>>
You are correct that it was utf-8
Multiple people were scanning pages and converting to text, some
saved as ascii and some saved as unicode
The sample used above was utf-8 so after your comment i checked all,
put everything as ascii, combined all pieces into one file and
normalized the line endings to unix style
>> And i tried using the following on the above data:
>> in_filename = raw_input('What is the COMPLETE name of the file you
>> want to open: ')
>> in_file = open(in_filename, 'r')
>
> It wouldn't hurt to use universal newlines here since you are
> working cross-platform:
> open(in_filename, 'Ur')
>
corrected this
>> text = in_file.readlines()
>> num_lines = text.count('\n')
>
> Here 'text' is a list of lines, so text.count('\n') is counting the
> number of blank lines (lines containing only a newline) in your
> file. You should use
> num_lines = len(text)
>
changed
>> print 'There are', num_lines, 'lines in the file', in_filename
>> output = open("cleandata.txt","a") # file for writing data to
>> after stripping newline character
>
> I agree with Luke, use 'w' for now to make sure the file has only
> the output of this program. Maybe something already in the file is
> making it look like utf-8...
>
>> # read file, copying each line to new file
>> for line in text:
>> if len(line) > 1 and line[-2] in ';,-':
>> line = line.rstrip()
>> output.write(line)
>> else: output.write(line)
>> print "Data written to cleandata.txt."
>> # close the files
>> in_file.close()
>> output.close()
>> As written above it tells me that there are 0 lines which is
>> surprising because if I run the first part by itself it tells
>> there are 1982 lines ( actually 1983 so i am figuring EOF)
>> It copies/writes the data to the cleandata file but it does not
>> strip out CR and put data on one line ( a sample of what i am
>> trying to get is next)
>> A.-C. Manufacturing Company. (See Sebastian, A. A., and Capes,
>> assignors.)
>> My apologies if i have intruded.
>
> Please reply on-list in the future.
>
> Kent
More information about the Tutor
mailing list