[Tutor] parsing text

Jay Mutter III jmutter at uakron.edu
Sun Mar 25 00:25:10 CET 2007


Kent thanks for this as I was clearly confused with regards to string  
and list of strings.
I am, however, still having difficulty with how to solve a problem  
involving a related issue.

i have the following text:

Barnett, John B., assignor of one-half to R. N. Tutt, Kansas City,  
Mo.    Automatic display-sign.    No. 1,330 411-Apr. 13 ; v. 273 ; p.  
193.
Barnett,  John  II..  Tettenhall,  England.     Seat  of   
motorcars.    No. 1.353,708; Sept. 21 ; v. 278; p. 487. Barnett, Otto  
R.    (See Scott, John M., assignor.)
Barnett. Otto R.     (See Sponenburg, Hiram H., assignor)
Barnett, William A., Lincoln. Nebr.    Attachment for garment- 
turning   machines.     No.   1,342,937;   June   8 ?   v 270 ; p. 313."
Barnhart, Clarence D., Brooklyn, assignor to W. S. Rockwell Company,  
New York. N. Y.    Conveyer for furnaces No. 1.333.371 ; Mar. 9 ; v.  
272 ; p. 278.
Barnhart, Clarence v., Waynesboro, Pa., assignor to J. K. Hoffman and  
W. M. Raeclitel.  Hagerstowu, Md.     Seed-planter.    No. 1,357.43S:  
Nov. 2; v. 280: p. 45.
Barnhart, John E.    (See Haves, J. P.. and Barnhart )
Barnhart,-Mollie E.    (See Freeman. Alpheus J., assignor) Barnhill,  
E. B., and J. Stone, Indianapolis, Ind.    Auto-tire 477513

1.) when i do readlines and create a list and then print the list it  
adds a blank line between every line of text
2.)in the second line after p.487 there is the beginning of a new  
line of data only it isn't on a newline.
i tried string.replace(s,'p.','\n') in an attempt to put a CR in but  
it just put the characters\n in the string.

ideas?

Thanks again

jay



Jay Mutter III wrote:
 > Thanks for the response
 > Actually the number of lines this returns is the same number of lines
 > given when i put it in a text editor (TextWrangler).
 > Luke had mentioned the same thing earlier but when I do change  
read to
 > readlines  i get the following
 >
 >
 > Traceback (most recent call last):
 >   File "extract_companies.py", line 17, in ?
 >     count = len(text.splitlines())
 > AttributeError: 'list' object has no attribute 'splitlines'

I think maybe you are confused about the difference between "all the
text of a file in a single string" and "all the lines of a file in a
list of strings."

When you open() a file and read() the contents, you get all the text of
a file in a single string. len() will give you the length of the string
(the total file size) and iterating over the string gives you one
character at at time.

Here is an example of a string:
In [1]: s = 'This is text'
In [2]: len(s)
Out[2]: 12
In [3]: for i in s:
     ...:     print i
     ...:
     ...:
T
h
i
s

i
s

t
e
x
t

On the other hand, if you open() the file and then readlines() from the
file, the result is a list of strings, each of with is the contents of
one line of the file, up to and including the newline. len() of the list
is the number of lines in the list, and iterating the list gives each
line in turn.

Here is an example of a list of strings:
In [4]: l = [ 'line1', 'line2' ]
In [5]: len(l)
Out[5]: 2
In [6]: for i in l:
     ...:     print i
     ...:
     ...:
line1
line2

Notice that s and l are *used* exactly the same way with len() and for,
but the results are different.

As a further wrinkle, there are two easy ways to get all the lines in a
file and they give slightly different results.

open(...).readlines() returns a list of lines in the file and each line
includes the final newline if it was in the file. (The last line will
not include a newline if the last line of the file did not.)

open(...).read().splitlines() also gives a list of lines in the file,
but the newlines are not included.

HTH,
Kent



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20070324/55e95a41/attachment-0001.html 


More information about the Tutor mailing list