[Tutor] Should I use python for parsing text
rabidpoobear at gmail.com
Thu Mar 22 12:59:52 CET 2007
Jay Mutter III wrote:
I'm a bit pressed for time right now and I can't look over this e-mail.
Please reply on-list in the future using the 'reply-all' feature.
You're more likely to get a prompt response.
(this e-mail is carbon copied to the list, so don't worry about sending
> Actually it did help but the following
> for line in text:
> if len(line) > 1 and line[-2] in ';,-':
> line = line.rstrip()
> else: output.write(line)
> does not have any apparent effect on my data.
> I start with lines
> A.-C. Manufacturing Company. (See Sebastian, A. A.,
> and Capes, assignors.)
> A. G. A. Railway Light & Signal Co. (See Meden, Elof
> H„ assignor.)
> A-N Company, The. (See Alexander and Nasb, as-
> AN Company, The. (See Nash, It. J., and Alexander, as-
> A/S. Arendal Smelteverk. (See Kaaten, Einar, assignor.)
> A/S. Bjorgums Gevaei'kompani. (See Bjorguni, Nils, as-
> A/S Mekano. (Sec Schepeler, Herman A., assignor.)
> A/S Myrens Verkstad. (See Klling, Jens W. A., assignor.)
> A/S Stordo Kisgruber. (See Nielsen, C., and Ilelleland,
> and I end up with the same.
> My goal is to strip out the CR or LF or whatever so that all info for
> one entity is on 1 line.
> Any ideas of where i am going wrong?
> On Mar 21, 2007, at 1:41 AM, Luke Paireepinart wrote:
>>> # The next 5 lines are so I have an idea of how many lines i started
>>> with in the file.
>>> in_filename = raw_input('What is the COMPLETE name of the file you
>>> want to open: ')
>>> in_file = open(in_filename, 'r')
>>> text = in_file.read()
>> read() returns a one-dimensional list with all the data, not a
>> 2-dimensional one with each element a line.
>> Use readlines() for this functionality.
>> (Eg. A file with contents 'hello\nhoware\nyou?' would have this
>> string returned by read(), but
>> readlines() would return ['hello\n','howare\n','you?'].)
>>> num_lines = text.count('\n')
>> or just len(text) if you're using readlines()
>>> print 'There are', num_lines, 'lines in the file', in_filename
>>> output = open("cleandata.txt","a") # file for writing data to
>>> after stripping newline character
>> You might want to open this file in 'write' mode while you're
>> testing, so previous test results don't confuse you.
>>> # read file, copying each line to new file
>>> for line in text:
>> since read() returns a 1-dimensional list, you're looping over every
>> character in the file, not every line.
>>> if line[:-1] in '-':
>> In this case this is the same as "if line == '-':" because your
>> 'line' variable only contains characters.
>>> line = line.rstrip()
>>> else: output.write(line)
>>> print "Data written to cleandata.txt."
>>> # close the files
>>> The above ran with no erros, gave me the number of lines in my
>>> orginal file but then when i opened the cleandata.txt file
>>> I got:
>>> A.-C.䴀愀渀甀昀愀挀琀甀爀椀渀最 �Company.⠀匀攀攀�Sebastian,䄀⸀�A.,
>>> �and 䌀愀瀀攀猀Ⰰ�assignors.)�A.䜀⸀�A.刀愀椀氀眀愀礀 �Light☀�Signal䌀
>>> 漀⸀� (See䴀攀搀攀渀Ⰰ�Elof�H
>>> 攀 �Alexander愀渀搀�Nasb,愀猀ⴀ�猀椀最渀漀爀猀⸀㬀�䄀一�Company,吀栀攀
>>> ⸀� (See一愀猀栀Ⰰ�It.䨀⸀Ⰰ�and䄀氀攀砀愀渀搀攀爀Ⰰ�as-�
>> Not sure what caused all of those characters.
More information about the Tutor