[Tutor] removing line ends from Word text files

David Rock david at graniteweb.com
Sat Jul 17 18:54:33 CEST 2004


* Michael Janssen <Janssen at rz.uni-frankfurt.de> [2004-07-17 15:55]:
> On Fri, 9 Jul 2004, Christian Meesters wrote:
> 
> > Right now I have the problem that I want to remove the MS Word line end
> > token from text files: When saving a text file as 'text only' line ends
> > are displayed as '^M' in a shell (SGI IRIX (tcsh) and Mac (tcsh or
> > bash)). I want to get rid of these elements for further processing of
> > the file and have no idea how to access them in a Python script. Any
> > idea how to replace the '^M' against a simple '\n'? (I already tried
> > '\r\n' and various other combinations of characters, but apparently all
> > aren't '^M'.) '^M' is one character.
> 
> You can allways ask Python when you want to know how it will represent
> this character: Read one line with "readline" and print its repr-string:
> 
> fo = open("filename")
> line = fo.readline()
> print repr(line)
> 
> repr gives you an alternative string representation of any objects. repr
> used on strings doesn't interpret backslash sequences like \n or \r. As
> you are on MAC, I would guess your newline character is a simple "\r".
> 
> you can also ask Python for the caracter's ordinal
> print ord(line[-2]) # just in case one newline consists of two chars
> print ord(line[-1])
> 
> It's probably best to do such investigations with an interactive Python
> session. But now since I've realized that readline is Unix-only, I don't
> think interactive mode is that much fun on MAC/Win: without readline you
> can't repeat your commands (without having to type them again and again).
> You can't use the cursor keys. Perhaps Idle offers elaborate line editing
> even on those systems.

OK, a couple things... 
readline is NOT a Unix-only thing. I just tried it on my XP box and it's
fine. open is also an older way of doing things with opening files, as
of 2.2, file is probably what you want.

http://www.python.org/doc/current/lib/built-in-funcs.html#l2h-25

and for the sake of completeness, here is the info about built-in file
objects:
http://www.python.org/doc/current/lib/bltin-file-objects.html

So this:
fo = open("filename")
line = fo.readline()
print repr(line)

becomes this:
fo = file("filename")
line = fo.readline()
print repr(line)

as for interactive Python, I have recently been introduced to ipython
and it's great. It has a LOT of features that aren't in the normal
shell:
http://ipython.scipy.org/

And finally, ^M is decimal 13 (hex 0D), \n is 10, and \r is 13 ...
hmm, I guess that means ^M == \r

One thing that I have used over the years to strip newline chars off
lines is this, it's not the prettiest, but you'll get the idea:

	if '\n' in line:
		line = line[:-1]
	if '\r' in line:
		line = line[:-1]

basically, it's assuming (in the case of Windows) that the file ends
with '\r\n', and strips them off one at a time.

-- 
David Rock
david at graniteweb.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.python.org/pipermail/tutor/attachments/20040717/7ebf04f4/attachment.pgp


More information about the Tutor mailing list