Jack Jansen wrote:
- On input, unix line-endings are now acceptable for all text files. This
is an experimental feature (awaiting a general solution, for which a
PEP has been promised but not started yet, the giulty parties know who
they are:-), and it can be turned off with a preference.
I don't know if I qualify as one of the "guilty" parties, but I did
volunteer to help with a PEP about this, and I'd still like to. I do
have some ideas about what I'd like to see in that PEP.
The one thing I have done is write a prototype in pure Python for how I
would like platform neutral text files to work. I've enclosed it with
this message, and invite comments.
Has anyone started this PEP yet? if so, I'd like to help, if not, then
the following is a very early draft of my thoughts. Note that I am
writting this from memory, without going back to the archives to see
what all the comments were at the time. I will do that before I call
this a PEP.
Here are my quick thoughts:
This started (the recent thread, anyway) with the need for MacPython
(with the introduction of OS-X) to be able to read both traditional mac
style text files and unix style text files. An import-hook was
suggested, but then it was brought up that a lot of python code can be
read in other ways than an import, from execfile(), and a whole lot of
others, so an imprt hook would not be enough. In general, the problem
stems from the fact that while Python knows what system it is running
on, a file that is being read may or may not be on that same system.
This is most agregeuos with OS-X as you essentially have both Unix and
MacOS running on the same machine at the same time, often sharing a file
system. The issue also comes up with heterogeneous networks, where the
file might reside on a server running on a different system than Python,
and that file may be accessed by various systems. Some servers can do
line feed translation on the fly, but this is not universal or
In addition to Python code, many Python programs need to read and write
text files that are not in a native format, and the format may not be
known by the programmer when the code is writen.
My proposed solution to these problems is to have a new type of file: a
"Universal" text file. This would be a text file that would do line-feed
translation to the internal representation on the fly as the file was
being read (like the current text file type), but it would translate any
of the known text file formats automatically (\r\n, \r, \n Any
others???). When the file was being written to, a single terminator
would have to be specified, defaulting to the native one, or in the case
of a file opened for appending, perhaps the one in the file when it is
opened. The user could specify a non-native terminator when openign a
file for writing.
The two big issues that came up in the discussion were backward
compatability and performance:
1) The python open() function currently defaults to a text file type.
However, on Posix systems, there is no difference between a text file
and a binary file, so many programmers writing code that is designed to
run only on such systems left the "b" flag off when opening files for
binary reading and writing. If the behaviour of a file opened without
the binary flag were to change, a lot of code would break.
2) In recent versions of Python, a lot of effort was put into improving
performance of line oriented text file reading. These optimisations
require the use of native line endings. In order to get similar
performance with non-native endings, some portions of the C stdio
library would have to be re-written. This is a major undertaking, and no
one has stepped up to volunteer.
The proposed solution to both of these problems is to introduce a new
flag to the open() function: "t". If the "t" flag is present, the
function returns a Universal Text File, rather than a standard text
file. As this is a new flag, no old code should be broken. The default
would return a standard text file with the current behaviour. This would
allow the implimentation to be written in a way that was robust, but
perhaps not have optimum performance. If performance were critical, a
programmer could always use the old style text file. If, at some point,
code is written that allows the performance of Universal Text Files to
approach that of standard text files, perhaps the two could be merged.
It is unfortunate that the default would be the performance-optimised
but less generally useful case, but that is a reasonable price to be
paid for backward compatability. Perhaps the default could be changed at
some point in the future when other incompatabilities are introduced
In the case of Python code being read, performance of the file read is
unlikely to be critical to the performance of the application as a
Issues / questions:
Some systems, (VMS ?) store text files in the file system as a series of
lines, rather than just a string of bytes like most common systems
today. It would take a little more code to accomidate this, but it could
Should a file being read be required to have a single line termination
type, or could they be mixed and matched? The prototype code allows mix
and match, but I'm not married to that idea. If it requires a single
terminator, then some performance could be gained by checking the
terminator type when opening the file, and using the existing native
text file code when it is a native file.
I'd love to hear all your feedback on this write-up, as well as my code.
Please either CC me or the MacPython list, as I'm not subscribed to
ChrisHBarker@home.net --- --- ---
---@@ -----@@ -----@@
------@@@ ------@@@ ------@@@
Oil Spill Modeling ------ @ ------ @ ------ @
Water Resources Engineering ------- --------- --------
Coastal and Fluvial Hydrodynamics --------------------------------------
TextFile.py : a module that provides a UniversalTextFile class, and a
replacement for the native python "open" command that provides an
interface to that class.
It would usually be used as:
from TextFile import open
then you can use the new open just like the old one (with some added flags and arguments)
file = TextFile.open(filename,flags,[bufsize], [LineEndingType], [LineBufferSize])
please send bug reports, helpful hints, and/or feature requests to:
Copyright/licence is the same as whatever version of python you are running.
## Re-map the open function
_OrigOpen = open
def open(filename,flags = "",bufsize = -1, LineEndingType = "", LineBufferSize = ""):
A new open function, that returns a regular python file object for
the old calls, and returns a new nifty universal text file when
This works just like the regular open command, except that a new
flag and a new parameter has been added.
The new flag is "t" which indicates that the file to be opened is a
universal text file. While the standard open() function defaults to
a text file, on Posix systems, there is no difference between a text
file and binary fiole so there is a lot of code out there that opens
files as text, when a binary file is really required. This code
currently works just fine on Posix systems, so it was neccessary to
introduce a new flag, to maintian backward compatabilty. The old
style, line ending dpeendent text file with also provide better
file = open(filename,flags = "",bufsize = -1, LineEndingType = ""):
- filename is the name of the file to be opened
- flags is a string of one letter flags, the same as the standard open
command, plus a "t" for universal text file.
- - "b" means binary file, this returns the standard binary file object
- - "t" means universal text file
- - "r" for read only
- - "w" for write. If there is both "w" and "t" than the user can
specify a line ending type to be used with the LineEndingType
- - "a" means append to existing file
- bufsize specifies the buffer size to be used by the system. Same
as the regular open function
- LineEndingType is used only for writing (and appending) files, to specify a
non-native line ending to be written.
- - The options are: "native", "DOS", "Posix", "Unix", "Mac", or the
characters themselves( "\r\n", etc. ). "native" will result in
using the standard file object, which uses whatever is native
for the system that python is running on.
- LineBufferSize is the size of the buffer used to read data in
a readline() operation. The default is currently set to 200
characters. If you will be reading files with many lines over 200
characters long, you should set this number to the largest expected
NOTE: I'm sure the flag checking could be more robust.
if "t" in flags: # this is a universal text file
if ("w" in flags) and (not "w+" in flags) and LineEndingType == "native":
return _OrigOpen(filename,flags.replace("t",""), bufsize)
else: # this is a regular old file
A class that acts just like a python file object, but has a mode
that allows the reading of arbitrary formated text files, i.e. with
either Unix, DOS or Mac line endings. [\n , \r\n, or \r]
To keep it truly universal, it checks for each of these line ending
possibilities at every line, so it should work on a file with mixed
endings as well.
def __init__(self,filename,flags = "",LineEndingType = "native",LineBufferSize = ""):
self._file = _OrigOpen(filename,flags.replace("t","")+"b")
LineEndingType = LineEndingType.lower()
if LineEndingType == "native":
self.LineSep = os.linesep()
elif LineEndingType == "dos":
self.LineSep = "\r\n"
elif LineEndingType == "posix" or LineEndingType == "unix" :
self.LineSep = "\n"
elif LineEndingType == "mac":
self.LineSep = "\r"
self.LineSep = LineEndingType
## some attributes
self.closed = 0
self.mode = flags
self.softspace = 0
self._BufferSize = LineBufferSize
self._BufferSize = 100
start_pos = self._file.tell()
##print "Current file posistion is:", start_pos
line = ""
TotalBytes = 0
Buffer = self._file.read(self._BufferSize)
##print "Buffer = ",repr(Buffer)
newline_pos = Buffer.find("\n")
return_pos = Buffer.find("\r")
if return_pos == newline_pos-1 and return_pos >= 0: # we have a DOS line
line = Buffer[:return_pos]+ "\n"
TotalBytes = newline_pos+1
elif ((return_pos < newline_pos) or newline_pos < 0 ) and return_pos >=0: # we have a Mac line
line = Buffer[:return_pos]+ "\n"
TotalBytes = return_pos+1
elif newline_pos >= 0: # we have a Posix line
line = Buffer[:newline_pos]+ "\n"
TotalBytes = newline_pos+1
else: # we need a larger buffer
NewBuffer = self._file.read(self._BufferSize)
Buffer = Buffer + NewBuffer
else: # we are at the end of the file, without a line ending.
self._file.seek(start_pos + len(Buffer))
self._file.seek(start_pos + TotalBytes)
def readlines(self,sizehint = None):
readlines acts like the regular readlines, except that it
understands any of the standard text file line endings ("\r\n",
If sizehint is used, it will read a a maximum of that many
bytes. It will never round up, as the regular readline sometimes
does. This means that if your buffer size is less than the
length of the next line, you'll get an empty string, which could
incorrectly be interpreted as the end of the file.
Data = self._file.read(sizehint)
Data = self._file.read()
if len(Data) == sizehint:
#print "The buffer is full"
FullBuffer = 1
FullBuffer = 0
Data = Data.replace("\r\n","\n").replace("\r","\n")
Lines = [line + "\n" for line in Data.split('\n')]
## If the last line is only a linefeed it is an extra line
if Lines[-1] == "\n":
## if it isn't then the last line didn't have a linefeed, so we need to remove the one we put on.
## or it's the end of the buffer
self._file.seek(-(len(Lines[-1])-1),1) # reset the file position
Lines[-1] = Lines[-1][:-1]
def readnumlines(self,NumLines = 1):
readnumlines is an extension to the standard file object. It
returns a list containing the number of lines that are
requested. I have found this to be very useful, and allows me
to avoid the many loops like:
lines = 
for i in range(N):
Also, If I ever get around to writing this in C, it will provide a speed improvement.
Lines = 
while len(Lines) < NumLines:
def read(self,size = None):
read acts like the regular read, except that it tranlates any of
the standard text file line endings ("\r\n", "\n", "\r") into a
If size is used, it will read a maximum of that many bytes,
before translation. This means that if the line endings have
more than one character, the size returned will be smaller. This
could be fixed, but it didn't seem worth it. If you want that
much control, use a binary file.
Data = self._file.read(size)
Data = self._file.read()
write is just like the regular one, except that it uses the line
separator specified when the file was opened for writing or
for line in list:
# The rest of the standard file methods mapped
self.closed = 1
def seek(self,offset,whence = 0):