[Tutor] Should I use python for parsing text

Thu Mar 22 12:59:52 CET 2007

Jay Mutter III wrote:
> Luke;
I'm a bit pressed for time right now and I can't look over this e-mail.
Please reply on-list in the future using the 'reply-all' feature.  
You're more likely to get a prompt response.
(this e-mail is  carbon copied to the list, so don't worry about sending 
another.)
>
> Actually it did help but the following
>
> for line in text:
>     if len(line) > 1 and line[-2] in ';,-':
>         line = line.rstrip()
>         output.write(line)
>     else: output.write(line)
>
> does not have any apparent effect on my data.
>
> I start with lines
>
>
> A.-C. Manufacturing Company. (See Sebastian, A. A.,
> and Capes, assignors.)
> A. G. A. Railway Light & Signal Co. (See Meden, Elof
> H„ assignor.)
> A-N Company, The. (See Alexander and Nasb, as-
> signors.;
> AN Company, The. (See Nash, It. J., and Alexander, as-
> signors.)
> A/S. Arendal Smelteverk.    (See Kaaten, Einar, assignor.)
> A/S. Bjorgums Gevaei'kompani. (See Bjorguni, Nils, as-
> signor.)
> A/S  Mekano.     (Sec   Schepeler,   Herman  A.,  assignor.)
> A/S Myrens Verkstad.    (See Klling, Jens W. A., assignor.)
> A/S Stordo Kisgruber. (See Nielsen, C., and Ilelleland,
> assignors.)
>
> and I end up with the same.
> My goal is to strip out the CR or LF or whatever so that all info for 
> one entity is on 1 line.
>
> Any ideas of where i am going wrong?
>
> Thanks
>
> Jay
>
>
> On Mar 21, 2007, at 1:41 AM, Luke Paireepinart wrote:
>
>>
>>> # The next 5 lines are so I have an idea of how many lines i started 
>>> with in the file.
>>>
>>> in_filename = raw_input('What is the COMPLETE name of the file you 
>>> want to open:    ')
>>> in_file = open(in_filename, 'r')
>>> text = in_file.read()
>> read() returns a one-dimensional list with all the data, not a 
>> 2-dimensional one with each element a line.
>> Use readlines() for this functionality.
>> (Eg. A file with contents 'hello\nhoware\nyou?' would have this 
>> string returned by read(), but
>> readlines() would return ['hello\n','howare\n','you?'].)
>>> num_lines = text.count('\n')
>> or just len(text) if you're using readlines()
>>> print 'There are', num_lines, 'lines in the file', in_filename
>>>
>>> output = open("cleandata.txt","a")    # file for writing data to 
>>> after stripping newline character
>> You might want to open this file in 'write' mode while you're 
>> testing, so previous test results don't confuse you.
>>>
>>> # read file, copying each line to new file
>>> for line in text:
>> since read() returns a 1-dimensional list, you're looping over every 
>> character in the file, not every line.
>>>     if line[:-1] in '-':
>> In this case this is the same as "if line == '-':" because your 
>> 'line' variable only contains characters.
>>>         line = line.rstrip()
>>>         output.write(line)
>>>     else: output.write(line)
>>>
>>> print "Data written to cleandata.txt."
>>>
>>> # close the files
>>> in_file.close()
>>> output.close()
>>>
>>> The above ran with no erros, gave me the number of lines in my 
>>> orginal file but then when i opened the cleandata.txt file
>>> I got:
>>>
>>> A.-C.䴀愀渀甀昀愀挀琀甀爀椀渀最 �Company.⠀匀攀攀�Sebastian,䄀⸀�A., 
>>> �and 䌀愀瀀攀猀Ⰰ�assignors.)�A.䜀⸀�A.刀愀椀氀眀愀礀 �Light☀�Signal䌀 
>>> 漀⸀� (See䴀攀搀攀渀Ⰰ�Elof�H
assignor.)�A-N䌀漀洀瀀愀渀礀Ⰰ�The.⠀匀攀 
>>> 攀 �Alexander愀渀搀�Nasb,愀猀ⴀ�猀椀最渀漀爀猀⸀㬀�䄀一�Company,吀栀攀 
>>> ⸀� (See一愀猀栀Ⰰ�It.䨀⸀Ⰰ�and䄀氀攀砀愀渀搀攀爀Ⰰ�as-�
>> Not sure what caused all of those characters.
>> HTH,
>> -Luke
>
>