[Tutor] removing line ends from Word text files (continued)

Sat Jul 17 21:40:55 CEST 2004

(Continuing - the earlier post was an accident)

On Sat, 2004-07-17 at 12:54, David Rock wrote:
> * Michael Janssen <Janssen at rz.uni-frankfurt.de> [2004-07-17 15:55]:
> > On Fri, 9 Jul 2004, Christian Meesters wrote:
> > 
> > > Right now I have the problem that I want to remove the MS Word line end
> > > token from text files: When saving a text file as 'text only' line ends
> > > are displayed as '^M' in a shell (SGI IRIX (tcsh) and Mac (tcsh or
> > > bash)). I want to get rid of these elements for further processing of
> > > the file and have no idea how to access them in a Python script. Any
(snipped)
> > 
> > You can allways ask Python when you want to know how it will represent
> > this character: Read one line with "readline" and print its repr-string:
> > 
> > fo = open("filename")
> > line = fo.readline()
> > print repr(line)
> > 
> > repr gives you an alternative string representation of any objects. repr
> > used on strings doesn't interpret backslash sequences like \n or \r. As
> > you are on MAC, I would guess your newline character is a simple "\r".
> > 
> > you can also ask Python for the caracter's ordinal
> > print ord(line[-2]) # just in case one newline consists of two chars
> > print ord(line[-1])
> > 
> > It's probably best to do such investigations with an interactive Python
> > session. But now since I've realized that readline is Unix-only, I don't
> > think interactive mode is that much fun on MAC/Win: without readline you
> > can't repeat your commands (without having to type them again and again).
> > You can't use the cursor keys. Perhaps Idle offers elaborate line editing
> > even on those systems.
> 
> OK, a couple things... 
> readline is NOT a Unix-only thing. I just tried it on my XP box and it's
> fine. open is also an older way of doing things with opening files, as
> of 2.2, file is probably what you want.

I too was shifting from open(...) to file(...), however, Guido is
recommending a change to the documentation and continued use of open.
http://mail.python.org/pipermail/python-dev/2004-July/045931.html

> 
> http://www.python.org/doc/current/lib/built-in-funcs.html#l2h-25
> 
> and for the sake of completeness, here is the info about built-in file
> objects:
> http://www.python.org/doc/current/lib/bltin-file-objects.html
> 
(snipped)
> 
> as for interactive Python, I have recently been introduced to ipython
> and it's great. It has a LOT of features that aren't in the normal
> shell:
> http://ipython.scipy.org/
> 
> And finally, ^M is decimal 13 (hex 0D), \n is 10, and \r is 13 ...
> hmm, I guess that means ^M == \r
> 
> One thing that I have used over the years to strip newline chars off
> lines is this, it's not the prettiest, but you'll get the idea:
> 
> 	if '\n' in line:
> 		line = line[:-1]
> 	if '\r' in line:
> 		line = line[:-1]
I think
	while line[-1] in "\n\r":
		line = line[:-1]

is much less risky depending upon the source of the file.

Most of the time
	line = line.strip()	# rstrip would do only trailing white space

will do what you want.  However, it strips ALL leading and trailing
white space characters, not just the \r and \n at the end of the line.

> 
> basically, it's assuming (in the case of Windows) that the file ends
> with '\r\n', and strips them off one at a time.
-- 

Lloyd Kvam
Venix Corp.
1 Court Street, Suite 378
Lebanon, NH 03766-1358

voice:	603-653-8139
fax:	801-459-9582