Import hook to do end-of-line conversion?

[Oops, try again] There's talk on the PythonMac-SIG to create an import hook that would read modules with either \r, \n or \r\n newlines and convert them to the local convention before feeding them to the rest of the import machinery. The reason this has become interesting is the mixed unixness/macness of MacOSX, where such an import hook could be used to share a Python tree between MacPython and bsd-Python. They would only need a different site.py (probably), living somehwere near the head of sys.path, that would be in local end of line convention and enable the hook. However, it seem that such a module would have a much more general scope, for instance if you're accessing samba partitions from windows, or other foreign file systems, etc. Does this sound like a good idea? And (even better:-) has anyone done this already? Would it be of enough interest to include it in the core Lib? -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | ++++ see http://www.xs4all.nl/~tank/ ++++

On Sat, Apr 07, 2001 at 06:25:52PM +0200, Fredrik Lundh wrote:
Exactly. That is where the correct fix should go. The compile can/should recognize all types of newlines as the NEWLINE token. Cheers, -g -- Greg Stein, http://www.lyra.org/

Fredrik Lundh wrote:
But if we only fix the compiler, we'll get complaints that other things don't work, eg. bogus tracebacks due to a non-fixed linecache.py, broken IDE's, etc. Btw. I can't seem to think of any examples that would break after such a change. I mean, who would depend on a \n text file with embedded \r's? Just

The same goes for file objects in text mode...
Yes.
probably -- but changing can break stuff (in theory, at least), and may require a PEP. changing the compiler is more of a bugfix, really...
Yes.
Yes.
On Unix, currently, tell() always give you a number that exactly matches the number of characters you've read since the beginning of the file. This would no longer be true. In general, code written on Unix with no expectation to ever leave Unix, can currently be sloppy about using binary mode, and open binary files in text mode. Such code could break. I'm sure there's plenty such code around (none written by me :-). --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido:
Maybe there should be a third mode, "extremely text mode", which Python-source-processing utilities (and anything else which wants to be cross-platform-line-ending-friendly) can use. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

I know that it's too late for 2.1, but for 2.2, I think we can do better: like Java, the import mechanism should accept all three line ending conventions on all platforms! It would also be nice if opening a file in text mode did this transformation, but alas, that would probably require more work on the file object than I care for. But import should be doable! --Guido van Rossum (home page: http://www.python.org/~guido/)

As Guido said, Java defines that source-code lines end with any of LF, CR, or CRLF, and that needn't even be consistent across lines. If source files are opened in C binary mode, this is easy enough to do but puts all the burden for line-end detection on Python. Opening source files in C text mode doesn't solve the problem either. For example, if you open a source file with CR endings in Windows C text mode, Windows thinks the entire file is "one line". I expect the same is true if CR files are opened in Unix text mode. So, in the end, binary mode appears to be better (more uniform code). But then what happens under oddball systems like OpenVMS, which seem to use radically different file structures for text and binary data? I've no idea what happens if you try to open a text file in binary mode under those. [Guido]
Well, Python source files aren't *just* read by "the compiler" in Python. For example, assorted tools in the std library analyze Python source files via opening as ordinary (Python) text files, and the runtime traceback mechanism opens Python source files in (C) text mode too. For that stuff to work correctly regardless of line ends is lots of work in lots of places. In the end I bet it would be easier to replace all direct references to C textfile operations with a "Python text file" abstraction layer. importing-is-only-the-start-of-the-battle-ly y'rs - tim

Jack Jansen <jack@oratrix.nl>:
read modules with either \r, \n or \r\n newlines Does this sound like a good idea?
YES! It's always annoyed me that the Mac (seemingly without good reason) complains about sources with \n line endings. I have often shuttled code between Mac and Unix systems during development, and having to do \r/\n translations every time is a royal pain.
Would it be of enough interest to include it in the core Lib?
I'd vote for building it right into the interpreter! Is there any reason why anyone would want *not* to have it? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

I'd vote for building it right into the interpreter! Is there any reason why anyone would want *not* to have it?
No, but (as has been explained) fixing the parser isn't enough -- all tools dealing with source would have to be fixed. Or we would have to write our own C-level file object, which has its own drawbacks. --Guido van Rossum (home page: http://www.python.org/~guido/)

I doubt that we could use anything that was done for another language, because everybody who codes this kind of thing makes it do exactly what their environment needs, e.g. in terms of error handling API, functionality, and performance.
What are the drawbacks?? (besides the below example)
The drawbacks aren't so much technical (I have a pretty good idea of how to build such a thing), they're political and psychological. There's the need for supporting the old way of doing things for years, there's the need for making it easy to convert existing code to the new way, there's the need to be no slower than the old solution, there's the need to be at least as portable as the old solution (which may mean implementing it *on top of* stdio since on some systems that's all you've got).
It would be one way towards that goal. But notice that we've already gotten most of the way there with the recent readline changes in 2.1. --Guido van Rossum (home page: http://www.python.org/~guido/)

Proposal for 2.2, outline for a PEP? 1) The Python file object needs to be modified so that in text mode it can recognize all major line ending conventions (Unix, Win and Mac). Reading data: - recognize \n, \r and \r\n as line ending, present as \n to Python Writing data: - convert \n to the platform line endings (this is already the case) This modification should be _optional_, because it may break code under unix (insert Guido's explanation here), and because it may not support oddball systems like OpenVMS. It should be _on_ by default under: - Windows - MacPython Classic - MacPython Carbon - Unix Python under MacOS X / Darwin It should probably be off by default on all other systems (I think a compile-time switch is good enough). Maybe if we advertize the potential sloppy-unix-code-breakage loud enough we can make the feature mandatory in a later release, however I don't see a practical way of issuing warnings for the situation. 2) I assume there are quite a few places where Python uses raw C text files: these places should be identified, we should figure out how much work it is to fix these so they behave just like the Python file object as described above. Who would like to team up with me to write a decent PEP and maybe an example implementation? Just

Just van Rossum: the
situation.
It should be on by default for the Python interpreter reading Python programs as making it off by default leads to the inability to run programs written with Windows or Mac tools on Unix which was the problem reported by 'dsavitsk' on comp.lang.python. If it is going to be off by default then the error message should include "Rerun with -f to fix this error". Neil

Neil Hodgson wrote:
Yes, but as was mentioned before: this will lead to other problems for which we wouldn't have a good excuse: any program printing a traceback with the traceback module will output bogus data if linecache.py will read the source files incorrectly. And that's just one example. I don't think the two features should be switchable separately. Maybe it should be on by default, provided we have a command line switch to to turn the new behavior *off*, just like there used to be a command line switch to revert to string based exceptions. Just

Just van Rossum wrote:
Proposal for 2.2, outline for a PEP?
Thanks, Just, for getting this rolling.
I agree that is should be possible to turn the proposed off, but I still think it should be on by default, even on *nix systems (which is mostly what I use, buy the way), as it would only cause a problem for "sloppy" code anyway. Would it be possible to have it be turned on/off at runtime, rather than compile time ? It would be pretty awkward to have a program need a specific version of the interpreter to run. Even a command line flag could be awkward enough, then only the main program could specify the flag, and modules might not be compatible. Another option is for the new version to have another flag or set of flags to the open command, which would indicate that the file being opened is "Unix", "Mac", "DOS", or "Any". this would make it easy to write text files in a non-native format, as well as read them. Even if we didn't go that far, we could use the "t" flag (analogous to "b" for binary), to specify the universal text format, and the default would still be the current, native format. This would keep the "sloppy" *nix code from breaking, and still give full functionality to new code. While we are at it, what would get written is something we need to consider. If we just have the above proposal, reading a file would work great, it could be on a server with a different line ending format, and that would be transparent. Writing, on the other hand is an issue. If a program is running on a windows box, and writing a file on a *nix server, what kind of line ending should it write? Would it even know what the native format is on the server? It seems we would need to be able to specify the line ending format explicitly for writing. Just a few thoughts, maybe we'll get a PEP out of this after all! -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------

Chris Barker <chrishbarker@home.net>:
Yes, I think that's the best that can be done. To do any better would require all file servers to be aware of the text/binary distinction and be willing to translate, and for there to be some way for the Python file object to communicate to the OS which mode is intended. Neither of these things are true in general. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

You might need to be able to specify a specific line ending format, but there should also be a default -- and it should be the default appropriate to the OS. So, \n on Unix, \r\n on Windows, \r on Mac running in "Mac mode", and \n on MacOS X running in "Unix mode". --Guido van Rossum (home page: http://www.python.org/~guido/)

At 21:41 -0500 4/9/01, Guido van Rossum wrote:
Is it the same in Mac OS X when reading a file from a UFS volume as from an HFS(+) volume? Only if the underlying libraries make it so. (Typing in Mac OS X, but I don't have any UFS volumes lying around.) It's a little scary to contemplate that reading two different files, which happen to be on the same disk spindle, will behave differently for the file on the HFS+ volume than for the file on the UFS volume. [There are perhaps similar issues for our Linux friends who mount Windows volumes.] What ever happened to "move text files to another system using FTP in ASCII mode?" Ah, yes...it probably died of Unicode. --John (there may no be any answers for this) Baxter -- John Baxter jwblist@olympus.net Port Ludlow, WA, USA

[me]
[JW Baxter]
Is it the same in Mac OS X when reading a file from a UFS volume as from an HFS(+) volume?
I'm not sure that the volume from which you're *reading* could or should have any influence on the default delimiter used for *writing*. The volume you're *writing* to might, if it's easy to determine -- but personally, I'd be happy with a default set at compile time.
Anyway, disk spindles are the wrong abstraction level to consider here. Who cares about what spindle your files are on?
What ever happened to "move text files to another system using FTP in ASCII mode?" Ah, yes...it probably died of Unicode.
No, obviously it's cross-platform disk sharing. The first time this came up was when it became possible to mount Unix volumes on NT boxes many years ago, and that's when Python's parser (eventually) grew the habit of silently ignoring a \r just before a \n in a source file. It's a sign of how backward the Mac world is that the problem only now pops up for the Mac. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
It's a sign of how backward the Mac world is that the problem only now pops up for the Mac. :-)
I know I shouldn't bite, but I find this a very childish remark, Guido! It's also not true... Here's an excerpt from a private thread between me, Jack and Guido. It's dated january 8, 1996, I remember I was just learning Python. (I'll give a translation below.) """
(Ik neem aan dat je bedoelt files met '\n' in plaats van '\r' als line separator.)
Hmm, ik weet niet of ik dit een goed idee vindt. Weet je wat: vraag eens wat Guido er van vind (met een cc-tje naar mij).
Geen goed idee, tenzij de C stdio library dit automatisch doet (kennelijk niet dus). Het is over het algemeel een kleine moeite dit bij het file transport recht te trekken (ftp in text mode etc.). """ Translation: """ [Just]
[Guido] (I take it you mean files with '\n' instead of '\r' as line separator.) [Jack]
Hm, I don't know whether I think this is a good idea. You know what, ask Guido what he thinks (and cc me).
[Guido] Not a good idea, unless the C stdio library does this automatically (apparently it doesn't). In general it's a small effort to correct this during the file transport (ftp in text mode etc.). """ So it's not that the problem wasn't there, it was just not taken very seriously at the time... Just

Guido van Rossum wrote:
No, obviously it's cross-platform disk sharing. The first time this came up was when it became possible to mount Unix volumes on NT boxes
I'm sure it came up before that, I know it has for me, and I don't happen to do any cross platform disk sharing. It is just a little more soluble if you aren't doing disk sharing.
many years ago, and that's when Python's parser (eventually) grew the habit of silently ignoring a \r just before a \n in a source file.
It can do that? I had no idea. Probably because I work on the Mac and Linux almost exclusively, and hardly ever encounter a Windows box.
It's a sign of how backward the Mac world is that the problem only now pops up for the Mac. :-)
Actually it's a sign of how *nix/Windows focused Python is. It's sad to see that someone thought to fix the problem for *nix/Windows, and didn't even consider the Mac (as Just pointed out the problem has been know for a long time). Frankly, it's also a symptom the the isolationist attitude of a lot of Mac users/developers. and Don't get me started on the spaces vs tabs thing! Just, Are you planning on putting together a PEP from all of this? I'd really like to see this happen! -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------

[Guido]
[Chris Barker]
It can do that? I had no idea. Probably because I work on the Mac and Linux almost exclusively, and hardly ever encounter a Windows box.
It's a sign of how backward the Mac world is that the problem only now pops up for the Mac. :-)
This is a reversal of history. The code to ignore \r when seeing \r\n originally (1995) applied to *all* platforms. I don't know why, but Jack submitted a patch to disable this behavior only when "#ifdef macintosh", in revision 2.29 of Parser/tokenizer.c, about 4 years ago. The #ifdef introduced then still exists today; 3 lines introduced by that patch start with XXX here for clarity (appropriately defined <wink>): XXX #ifndef macintosh /* replace "\r\n" with "\n" */ XXX /* For Mac we leave the \r, giving a syntax error */ pt = tok->inp - 2; if (pt >= tok->buf && *pt == '\r') { *pt++ = '\n'; *pt = '\0'; tok->inp = pt; } XXX #endif I have no idea what Mac C libraries return for text-mode reads. They must convert \r to \n, right? In which case I guess any \r remaining *should* be "an error" (but where would it come from, if the C library converts all \r thingies?). Do they leave \n alone? Etc: submit a patch that makes the code above "work", and I'm sure it would be accepted, but a non-Mac person can't guess what's needed. As to "considering the Mac", guilty as charged: I don't know anything about it. What's to consider? How often do you consider the impact of chnages on, say, OpenVMS? Same thing, provided you're as ignorant of it as I am of your system.
The std for distributed Python code is 4-space indents, no hard tab characters. So there's nothing left there to get started on <wink>. it's-not-that-we-don't-want-to-"fix"-macs-it's-that-we-don't-know- how-macs-work-or-what-"fix"-*means*-to-a-macizoid-ly y'rs - tim

Tim Peters wrote:
Interesting, I didn't know that. Jack's on holiday now, so he won't be able to comment for a while.
I have no idea what Mac C libraries return for text-mode reads. They must convert \r to \n, right?
Yes.
Nope: \r's get translated to \n and for whatever reason \n's get translated to \r... So when opening a unix file on the Mac, it will look like it has \r line endings and when opening a Windows text file on the Mac, it will appear as if it has \n\r line endings...
That's probably easy enough -- although would require changing all tokenizer code that looks for \n to also look for \r, including PyOS_ReadLine(), so it goes well beyond the snippet you posted. And then there's the Python file object... Just

end-of-line conversion? Just van Rossum <just@letterror.com>:
Unless you're using the MPW compiler, which swaps the meanings of \r and \n in the source instead! Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

[Just van Rossum]
Then it's probably a Good Thing Jack disabled this code, since it wouldn't have done anything useful on a Mac anyway (for Python to ever see \r\n the source file would have had to contain \n\r, which is nobody's text file convention).
No, there's nothing wrong with the tokenizer code: it's coded in C, and the C text convention is that lines end with \n, period. Reliance on that convention is ubiquitous -- and properly so. What we need instead are platform-specific implementations of fgets() functionality, which deliver lines containing \n where and only where the platform Python is supposed to believe a line ends. Then nothing else in the parser needs to be touched (and, indeed, the current \r\n mini-hack could be thrown away).
And then there's the Python file object...
Different issue. If this ever gets that far, note that the crunch to speed up line-at-a-time file input ended up *requiring* use of the native fgets() on Windows, as that was the only way on that platform to avoid having the OS do layers of expensive multithreading locks for each character read. So there's no efficient way in general to get Windows to recognize \r line endings short of implementing our own stdio from the ground up. On other platforms, fileobject.c's get_line() reads one character at a time, and I expect its test for "is this an EOL char?" could be liberalized at reasonable cost. OTOH, how does the new-fangled Mac OS fit into all this? Perhaps, for compatibility, their C libraries already recognize both Unix and Mac Classic line conventions, and deliver plain \n endings for both? Or did they blow that part too <wink>?

I expect that the right solution here is indeed to write our own stdio-like library from the ground up. That can solve any number of problems: telling how many characters are buffered (so you don't have to use unbuffered mode when using select or poll), platform-independent line end recognition, and super-efficient readline() to boot. But it's a lot of work, and won't be compatible with existing extensions that use FILE* (not too many I believe). --Guido van Rossum (home page: http://www.python.org/~guido/)

[Guido]
We also have the old http://sourceforge.net/tracker/?group_id=5470& atid=105470&func=detail&aid=210821 complaining that use of FILE* in our C API can make it impossible to (in that fellow's case) write an app in Borland C++ on Windows that tries to use those API functions (cuz Borland's FILE* is incompatible with MS's FILE*). I'm not sure the best solution to *that* is to give them a FILE* that's incompatible with everyone's, though <wink>>
But it's a lot of work, and won't be compatible with existing extensions that use FILE* (not too many I believe).
I'm more concerned about the "lot of work" part, with which I agree. OTOH, Plauger's book "The Standard C Library" contains source code for every library required by C89. He reported that implementing libm took him twice as long as everything else combined. But those who haven't written a libm will be prone to take a wrong lesson from that <wink>. it's-not-that-i/o-is-easy-despite-that-his-libm-code-isn't-production- quality-ly y'rs - tim

I don't get it: why would a thin layer on top of stdio be bad? Seems much less work than reimplementing stdio.
Because by layering stuff you lose performance. Example: fgets() is often implemented in a way that is faster than you can ever do yourself with portable code. (Because fgets() can peek in the buffer and see if there's a \n somewhere ahead, using memcmp() -- and if this succeeds, it can use memcpy(). You can't do that yourself - only the stdio implementation can. And this is not a hypothetical situation -- Tim used fgets() for a significant speed-up of readline() in 2.1. But if we want to use our own line end convention, we can't use fgets() any more, so we lose big. --Guido van Rossum (home page: http://www.python.org/~guido/)

[ re: various remarks about layering on stdio ] Has anybody looked at sfio ? I used it long ago for other reasons -- for a while the distribution seemed to have disappeared from att ( or maybe I just couldn't find it on netlib ), but I just did a google search and found that there is a new distribution: sfio2000: http://www.research.att.com/sw/tools/sfio/ I haven't looked at the package or the code for a LONG time & I don't know how portable it is, but it has some nice features and advantages -- if you're at the point of considering rewriting stdio it might be worth looking at. -- Steve Majewski

Steven D. Majewski wrote:
[ re: various remarks about layering on stdio ]
Has anybody looked at sfio ?
That reminds me of QIO, the stdio replacement in INN, which has already been ported to Python. -- --- Aahz (@pobox.com) Hugs and backrubs -- I break Rule 6 http://www.rahul.net/aahz Androgynous poly kinky vanilla queer het I don't really mind a person having the last whine, but I do mind someone else having the last self-righteous whine.

[Steven D. Majewski]
Did just now. Only runs on Unix boxes, so would be a heavyweight way to solve line-end problems across platforms that don't have any <wink>. Possible to run it on Windows, but only on top of the commercial UWIN Unix emulation package (http://www.research.att.com/sw/tools/uwin/). They didn't mention Macs at all. The papers should be worth reading for anyone intending to tackle this, though.

[Guido]
Well, people said "we couldn't" use fgets() for get_line() either, because Python strings can contain embedded nulls but fgets() doesn't tell you how many bytes it read and makes up null bytes of its own. But I have 200 lines of excruciating code in fileobject.c that proved them excruciatingly wrong <wink>. The same kind of excruciating crap could almost certainly be used to search for alternative line endings on top of fgets() too. We would have to layer our own buffer on top of the hidden platform buffer to get away with this, because while fgets() will stop at the first \n it sees, there's no way to ask it to stop at any other character (so in general fgets() would "over-read" when looking for a non-native line-end, and we'd have to save the excess in our own buffer). Hard to say how much that would cost. I think it surprised everyone (incl. me!) that even with all the extra buffer-filling and buffer-searching the fgets() hackery does, that method was at worst a wash with the getc_unlocked() method on all platforms tried. In any case, the fgets() hack is only *needed* on Windows, so every other platform could just make get_line()'s character-at-a-time loop search for more end conditions. This can't be impossible <wink>. s/\r\n?/\n/g-ly y'rs - tim

I understand now that I simply don't have enough clue about the implementation to even try to be involved with this. Unless it makes sense to have a PEP that doesn't touch the implementation at all (doubtful, IMHO), I'll take back my offer to write one. I still think it's an important issue, but it's simply beyond what I can deal with. To solve the issues on MacOS X, maybe it's enough to hack the Carbon version of stdio so it can handle unix text files. That way we can simply settle for unix line ending if sharing code between BSD Python and Carbon Python is desired. At the same time this would allow using CVS under Darwin for MacPython sources, which is something I look forward to... Just

Just van Rossum wrote:
Please write the results of this discussion up as a PEP. PEPs don't necessarily have to provide an implementation of what is covered; it sometimes simply suffices to start out with a summary of the discussions that have been going on. Then someone may pick up the threads from there and possibly find a solution which will then get implemented.
AFAIR, this discussion was about handling line endings in Python source code. There have been discussions about turning the tokenizer into a Unicode based machine. We could then use the Unicode tools to do line separations. I don't know why this thread lead to tweaking stdio -- after all we only need a solution for the Python tokenizer and not a general purpose stdio abstraction of text files unless I'm missing something here... -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
Aaaaaaaaaaaargh! ;-) Here we go again: fixing the tokenizer is great and all, but then what about all tools that read source files line by line? Eg. linecache.py, all IDE's, etc. etc. As Tim wrote a while back: importing-is-only-the-start-of-the-battle So no, we don't "only need a solution for the Python tokenizer"... Just

Just van Rossum wrote:
<grin> I'll repeat my question of yesterday: is there any reason why we couldn't start with QIO? I did some checking after I sent that out, and QIO claims that it can be configured to recognize different kinds of line endings. QIO is claimed to be 2-3 times faster than Python 1.5.2; don't know how that compares to 2.x. [the previous message was sent to python-dev only; this time I'm including pythonmac-sig] -- --- Aahz (@pobox.com) Hugs and backrubs -- I break Rule 6 http://www.rahul.net/aahz Androgynous poly kinky vanilla queer het I don't really mind a person having the last whine, but I do mind someone else having the last self-righteous whine.

[MAL]
I don't know why this thread lead to tweaking stdio -- after all we only need a solution for the Python tokenizer ...
[Just]
Note that this is why the topic needs a PEP: nothing here is new; the same debates reoccur every time it comes up. [Aahz]
It can be, yes, but in the same sense as Awk/Perl paragraph mode: you can tell it to consider any string (not just single character) as meaning "end of the line", but it's a *fixed* string per invocation. What people want *here* is more the ability to recognize the regular expression \r\n?|\n as ending a line, and QIO can't do that directly (as currently written). And MAL probably wants Unicode line-end detection: http://www.unicode.org/unicode/reports/tr13/
QIO is claimed to be 2-3 times faster than Python 1.5.2; don't know how that compares to 2.x.
The bulk of that was due to QIO avoiding per-character thread locks. 2.1 avoids them too, so most of QIO's speed advantage should be gone now. But QIO's internals could certainly be faster than they are (this is obscure because QIO.readline() has so many optional behaviors that the maze of if-tests makes it hard to see the speed-crucial bits; studying Perl's line-reading code is a better model, because Perl's speed-crucial inner loop has no non-essential operations -- Perl makes the *surrounding* code sort out the optional bits, instead of bogging down the loop with them).

Tim Peters wrote:
Right.
Right ;-)
Just curious: for the applications which Just has in mind, reading source code line-by-line is not really needed. Wouldn't it suffice to read the whole file, split it into lines and then let the tools process the resulting list of lines ? Maybe a naive approach, but one which will most certainly work on all platforms without having to replace stdio... -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
The point is to let existing tools work with all line end conventions *without* changing the tools. Whether this means replacing stdio I still don't know <wink>, but it sure means changing the behavior of the Python file object in text mode. Just

Just van Rossum wrote:
See... that's why we need a PEP on these things ;-) Seriously, I thought that you were only talking about being able to work on Python code from different platforms in a network (e.g. code is shared by a Windows box and development takes place on a Mac). Now it seems that you want to go for the full Monty :-) -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
So no, we don't "only need a solution for the Python tokenizer"...
See... that's why we need a PEP on these things ;-)
Agreed. I'll try to write one, once I'm feeling better: having the flu doesn't seem to help focussing on actual content... Just

Just van Rossum wrote:
Just (or anyone else) Have you made any progress on this PEP? I'd like to see it happen, so if you havn't done it, I'll try to find the time to make a start on it myself. I have written a simple class that impliments a line-ending-neutral text file class. I wrote it because I have a need for it, and I thought it would be a reasonable prototype for any syntax and methods we might want to use in an actual implimentation. I doubt anyone would find the methods I used particularly clean or elegant (or fast) but it's the first thing I've come up with, and it seems to work. I've enclosed the module with this email. If that doesn't work, let me know and I'll put it on a website. -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------ #!/usr/bin/env python """ TextFile.py : a module that provides a UniversalTextFile class, and a replacement for the native python "open" command that provides an interface to that class. It would usually be used as: from TextFile import open then you can use the new open just like the old one (with some added flags and arguments) or import TextFile file = TextFile.open(filename,flags,[bufsize], [LineEndingType], [LineBufferSize]) """ import os ## Re-map the open function _OrigOpen = open def open(filename,flags = "",bufsize = -1, LineEndingType = "", LineBufferSize = ""): """ A new open function, that returns a regular python file object for the old calls, and returns a new nifty universal text file when required. This works just like the regular open command, except that a new flag and a new parameter has been added. Call: file = open(filename,flags = "",bufsize = -1, LineEndingType = ""): - filename is the name of the file to be opened - flags is a string of one letter flags, the same as the standard open command, plus a "t" for universal text file. - - "b" means binary file, this returns the standard binary file object - - "t" means universal text file - - "r" for read only - - "w" for write. If there is both "w" and "t" than the user can specify a line ending type to be used with the LineEndingType parameter. - - "a" means append to existing file - bufsize specifies the buffer size to be used by the system. Same as the regular open function - LineEndingType is used only for writing (and appending) files, to specify a non-native line ending to be written. - - The options are: "native", "DOS", "Posix", "Unix", "Mac", or the characters themselves( "\r\n", etc. ). "native" will result in using the standard file object, which uses whatever is native for the system that python is running on. - LineBufferSize is the size of the buffer used to read data in a readline() operation. The default is currently set to 200 characters. If you will be reading files with many lines over 200 characters long, you should set this number to the largest expected line length. """ if "t" in flags: # this is a universal text file if ("w" in flags or "a" in flags) and LineEndingType == "native": return _OrigOpen(filename,flags.replace("t",""), bufsize) return UniversalTextFile(filename,flags,LineEndingType,LineBufferSize) else: # this is a regular old file return _OrigOpen(filename,flags,bufsize) class UniversalTextFile: """ A class that acts just like a python file object, but has a mode that allows the reading of arbitrary formated text files, i.e. with either Unix, DOS or Mac line endings. [\n , \r\n, or \r] To keep it truly universal, it checks for each of these line ending possibilities at every line, so it should work on a file with mixed endings as well. """ def __init__(self,filename,flags = "",LineEndingType = "native",LineBufferSize = ""): self._file = _OrigOpen(filename,flags.replace("t","")+"b") LineEndingType = LineEndingType.lower() if LineEndingType == "native": self.LineSep = os.linesep() elif LineEndingType == "dos": self.LineSep = "\r\n" elif LineEndingType == "posix" or LineEndingType == "unix" : self.LineSep = "\n" elif LineEndingType == "mac": self.LineSep = "\r" else: self.LineSep = LineEndingType ## some attributes self.closed = 0 self.mode = flags self.softspace = 0 if LineBufferSize: self._BufferSize = LineBufferSize else: self._BufferSize = 100 def readline(self): start_pos = self._file.tell() ##print "Current file posistion is:", start_pos line = "" TotalBytes = 0 Buffer = self._file.read(self._BufferSize) while Buffer: ##print "Buffer = ",repr(Buffer) newline_pos = Buffer.find("\n") return_pos = Buffer.find("\r") if return_pos == newline_pos-1 and return_pos >= 0: # we have a DOS line line = Buffer[:return_pos]+ "\n" TotalBytes = newline_pos+1 break elif ((return_pos < newline_pos) or newline_pos < 0 ) and return_pos >=0: # we have a Mac line line = Buffer[:return_pos]+ "\n" TotalBytes = return_pos+1 break elif newline_pos >= 0: # we have a Posix line line = Buffer[:newline_pos]+ "\n" TotalBytes = newline_pos+1 break else: # we need a larger buffer NewBuffer = self._file.read(self._BufferSize) if NewBuffer: Buffer = Buffer + NewBuffer else: # we are at the end of the file, without a line ending. self._file.seek(start_pos + len(Buffer)) return Buffer self._file.seek(start_pos + TotalBytes) return line def readlines(self,sizehint = None): """ readlines acts like the regular readlines, except that it understands any of the standard text file line endings ("\r\n", "\n", "\r"). If sizehint is used, it will read a a mximum of that many bytes. It will not round up, as the regular readline does. This means that if your buffer size is less thatn the length of the next line, you won't get anything. """ if sizehint: Data = self._file.read(sizehint) else: Data = self._file.read() if len(Data) == sizehint: #print "The buffer is full" FullBuffer = 1 else: FullBuffer = 0 Data = Data.replace("\r\n","\n").replace("\r","\n") Lines = [line + "\n" for line in Data.split('\n')] #print Lines ## If the last line is only a linefeed it is an extra line if Lines[-1] == "\n": del Lines[-1] ## if it isn't then the last line didn't have a linefeed, so we need to remove the one we put on. else: ## or it's the end of the buffer if FullBuffer: #print "the file is at:",self._file.tell() #print "the last line has length:",len(Lines[-1]) self._file.seek(-(len(Lines[-1])-1),1) # reset the file position del(Lines[-1]) else: Lines[-1] = Lines[-1][:-1] return Lines def readnumlines(self,NumLines = 1): """ readnumlines is an extension to the standard file object. It returns a list containing the number of lines that are requested. I have found this to be very usefull, and allows me to avoid the many loops like: lines = [] for i in range(N): lines.append(file.readline()) Also, If I ever get around to writing this in C, it will provide a speed improvement. """ Lines = [] while len(Lines) < NumLines: Lines.append(self.readline()) return Lines def read(self,size = None): """ read acts like the regular read, except that it tranlates any of the standard text file line endings ("\r\n", "\n", "\r") into a "\n" If size is used, it will read a maximum of that many bytes, before translation. This means that if the line endings have more than one character, the size returned will be smaller. This could gbe patched, but it didn't seem worth it. If you want that much control, use a binary file. """ if size: Data = self._file.read(size) else: Data = self._file.read() return Data.replace("\r\n","\n").replace("\r","\n") def write(self,string): """ write is just like the regular one, except that it uses the line separator specified when the file was opened for writing or appending. """ self._file.write(string.replace("\n",self.LineSep)) def writelines(self,list): for line in list: self.write(line) # The rest of the standard file methods mapped def close(self): self._file.close() self.closed = 1 def flush(self): self._file.flush() def fileno(self): return self._file.fileno() def seek(self,offset,whence = 0): self._file.seek(offset,whence) def tell(self): return self._file.tell()

Yesterday I found I had need for an end-of-line conversion import hook. I looked sround but found none (did I miss some code on this thread?), so I whipped one up (below). It seems to do the job. If you see any goofs, gaffes or gotchas, or if you know of a better way to do this, please let me know. I will post this code to c.l.py in a few days for the enjoyment of all. -- David Goodger dgoodger@bigfoot.com Open-source projects: - The Go Tools Project: http://gotools.sourceforge.net - reStructuredText: http://structuredtext.sourceforge.net (soon!) -----%<----------cut----------%<----------%<----------cut----------%<----- # Import hook for end-of-line conversion, # by David Goodger (dgoodger@bigfoot.com). # Put in your sitecustomize.py, anywhere on sys.path, and you'll be able to # import Python modules with any of Unix, Mac, or Windows line endings. import ihooks, imp, py_compile class MyHooks(ihooks.Hooks): def load_source(self, name, filename, file=None): """Compile source files with any line ending.""" if file: file.close() py_compile.compile(filename) # line ending conversion is in here cfile = open(filename + (__debug__ and 'c' or 'o'), 'rb') try: return self.load_compiled(name, filename, cfile) finally: cfile.close() class MyModuleLoader(ihooks.ModuleLoader): def load_module(self, name, stuff): """Special-case package directory imports.""" file, filename, (suff, mode, type) = stuff path = None if type == imp.PKG_DIRECTORY: stuff = self.find_module_in_dir("__init__", filename, 0) file = stuff[0] # package/__init__.py path = [filename] try: # let superclass handle the rest module = ihooks.ModuleLoader.load_module(self, name, stuff) finally: if file: file.close() if path: module.__path__ = path # necessary for pkg.module imports return module ihooks.ModuleImporter(MyModuleLoader(MyHooks())).install()

"M.-A. Lemburg" wrote:
Just, I second that. I really think this is a very useful improvement to Python, I'd I'd really like to see it happen. I am probably even more out of my depth than you when it comes to suggesting impimentation, but I'd be glad to help with any parts of a PEP that I can. Guido van Rossum wrote:
Great! I have to say that it really seemed that someone must have produced an open source solution to this problem somewhere, and it turns out there is something Python related already. -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------

[Tim]
[Just van Rossum]
I don't get it: why would a thin layer on top of stdio be bad? Seems much less work than reimplementing stdio.
What does that question have to do with the snippet you quoted? In context, that snippet was saying that if you did write a small layer on top of stdio, one that just made \n show up when and only when you think Python should believe a line ends, then nothing in the tokenizer would need to change (except to call that layer instead of fgets()), and even the tokenizer's current \r\n mini-hack could be thrown away. If that's all you want, that's all it takes. If you want more than just that, you need more than just that (but I see Guido already explained that, and I explained too why the Windows Python cannot recognize \r endings with reasonable speed for *general* use short of building our own stdio -- but I don't really much care how fast the compiler runs, if all you want is the same limited level of hack afforded by the existing one-shot \r\n tokenizer trick -- and the compiler isn't using the *general*-case fileobject.c get_line() anyway). you-pay-for-what-you-want-and-the-more-you-want-the-more-you'll-pay-ly y'rs - tim

Chris Barker <chrishbarker@home.net>:
That's a good point. The only thing that could break is if you opened a non-Unix file in *text* mode, and then expected it to behave as though it had been opened in *binary* mode. I can't imagine any code being screwy enough to do that! Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Greg Ewing wrote:
Actually, I thought about it more, and of course, Guido is right. On *nix, if you open a binary file in text mode, it works just fine, as there is no difference. However, under the proposed scheme, the text mode would translate "\r" into "\n", messing up your binary data. It would also do it only with a couple of particular byte values, so it might not be obvious that anything is wrong right away. I've done that myself, by mistake. I wrote a little tool that used FTP to transfer some binary files. It worked fine under Linux, but then I tried to run it on the Mac, and the files got corrupted. It took me WAY too long to figure out that I had read the file in text mode. Personally, I've always thought it was unfortunate that the default was text mode, rather than binary, or even better, there could be no default: you have to specify either "b" or "t", then there would be no room for confusion. -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------

Chris Barker <chrishbarker@home.net>:
It took me WAY too long to figure out that I had read the file in text mode.
My favourite way of falling into that trap involves AUFS (the Appleshare Unix File Server). You're browsing the web on a Unix box and come across a juicy-looking Stuffit file. You download it into your AUFS directory, hop over to the Mac and feed it to Stuffit Expander, which promptly throws a wobbly. "Shazbot," you mutter, "it got corrupted in the download somehow." You try a couple more times, with the same result. You're just about to write to the web site maintainer telling them that their file is corrupt, when it dawns on you that: (a) AUFS performs CR/LF translation on files whose Mac type code is 'TEXT'; (b) Unix-created files default to type 'TEXT'. (Sorry, not really Python-related. Pretend you've implemented your Stuffit expander in Python...) Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Actually, that *is* the scenario I'm worried about. Someone can open a GIF file in text mode today on a Unix platform and it'll just work (until they port the program to another platform, that is. ;-). So Unix weenies haven't had much of an incentive (or warning) about using binary mode properlu. In text translation mode, if there happen to be bytes with values 0x0d in the file, they will be mangled. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Tue, 10 Apr 2001, Greg Ewing <greg@cosc.canterbury.ac.nz> wrote:
Then you've got another thing coming. Most UNIXers aren't aware that the 'b' modifier exists: open(file) opens the file, whether it is text or binary. -- "I'll be ex-DPL soon anyway so I'm |LUKE: Is Perl better than Python? looking for someplace else to grab power."|YODA: No...no... no. Quicker, -- Wichert Akkerman (on debian-private)| easier, more seductive. For public key, finger moshez@debian.org |http://www.{python,debian,gnu}.org

Disregard what I just said. The problem isn't about reading text files at all, it's about reading non-text files without explicitly opening them in binary mode. I think the trouble is with the idea that if you don't specify the mode explicitly it defaults to text mode, which on Unix just happens to be the same as binary mode. Could we change that so binary mode is the default on Unix, and if you want any line ending translation, you have to specify text mode explicitly? Is there any standard which says that text mode must be the default? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

What I said. :-)
It's pretty clear that the default is text mode. But we could add a new mode character, 't', to force text mode on. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Sat, Apr 07, 2001 at 06:25:52PM +0200, Fredrik Lundh wrote:
Exactly. That is where the correct fix should go. The compile can/should recognize all types of newlines as the NEWLINE token. Cheers, -g -- Greg Stein, http://www.lyra.org/

Fredrik Lundh wrote:
But if we only fix the compiler, we'll get complaints that other things don't work, eg. bogus tracebacks due to a non-fixed linecache.py, broken IDE's, etc. Btw. I can't seem to think of any examples that would break after such a change. I mean, who would depend on a \n text file with embedded \r's? Just

The same goes for file objects in text mode...
Yes.
probably -- but changing can break stuff (in theory, at least), and may require a PEP. changing the compiler is more of a bugfix, really...
Yes.
Yes.
On Unix, currently, tell() always give you a number that exactly matches the number of characters you've read since the beginning of the file. This would no longer be true. In general, code written on Unix with no expectation to ever leave Unix, can currently be sloppy about using binary mode, and open binary files in text mode. Such code could break. I'm sure there's plenty such code around (none written by me :-). --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido:
Maybe there should be a third mode, "extremely text mode", which Python-source-processing utilities (and anything else which wants to be cross-platform-line-ending-friendly) can use. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

I know that it's too late for 2.1, but for 2.2, I think we can do better: like Java, the import mechanism should accept all three line ending conventions on all platforms! It would also be nice if opening a file in text mode did this transformation, but alas, that would probably require more work on the file object than I care for. But import should be doable! --Guido van Rossum (home page: http://www.python.org/~guido/)

As Guido said, Java defines that source-code lines end with any of LF, CR, or CRLF, and that needn't even be consistent across lines. If source files are opened in C binary mode, this is easy enough to do but puts all the burden for line-end detection on Python. Opening source files in C text mode doesn't solve the problem either. For example, if you open a source file with CR endings in Windows C text mode, Windows thinks the entire file is "one line". I expect the same is true if CR files are opened in Unix text mode. So, in the end, binary mode appears to be better (more uniform code). But then what happens under oddball systems like OpenVMS, which seem to use radically different file structures for text and binary data? I've no idea what happens if you try to open a text file in binary mode under those. [Guido]
Well, Python source files aren't *just* read by "the compiler" in Python. For example, assorted tools in the std library analyze Python source files via opening as ordinary (Python) text files, and the runtime traceback mechanism opens Python source files in (C) text mode too. For that stuff to work correctly regardless of line ends is lots of work in lots of places. In the end I bet it would be easier to replace all direct references to C textfile operations with a "Python text file" abstraction layer. importing-is-only-the-start-of-the-battle-ly y'rs - tim

Jack Jansen <jack@oratrix.nl>:
read modules with either \r, \n or \r\n newlines Does this sound like a good idea?
YES! It's always annoyed me that the Mac (seemingly without good reason) complains about sources with \n line endings. I have often shuttled code between Mac and Unix systems during development, and having to do \r/\n translations every time is a royal pain.
Would it be of enough interest to include it in the core Lib?
I'd vote for building it right into the interpreter! Is there any reason why anyone would want *not* to have it? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

I'd vote for building it right into the interpreter! Is there any reason why anyone would want *not* to have it?
No, but (as has been explained) fixing the parser isn't enough -- all tools dealing with source would have to be fixed. Or we would have to write our own C-level file object, which has its own drawbacks. --Guido van Rossum (home page: http://www.python.org/~guido/)

I doubt that we could use anything that was done for another language, because everybody who codes this kind of thing makes it do exactly what their environment needs, e.g. in terms of error handling API, functionality, and performance.
What are the drawbacks?? (besides the below example)
The drawbacks aren't so much technical (I have a pretty good idea of how to build such a thing), they're political and psychological. There's the need for supporting the old way of doing things for years, there's the need for making it easy to convert existing code to the new way, there's the need to be no slower than the old solution, there's the need to be at least as portable as the old solution (which may mean implementing it *on top of* stdio since on some systems that's all you've got).
It would be one way towards that goal. But notice that we've already gotten most of the way there with the recent readline changes in 2.1. --Guido van Rossum (home page: http://www.python.org/~guido/)

Proposal for 2.2, outline for a PEP? 1) The Python file object needs to be modified so that in text mode it can recognize all major line ending conventions (Unix, Win and Mac). Reading data: - recognize \n, \r and \r\n as line ending, present as \n to Python Writing data: - convert \n to the platform line endings (this is already the case) This modification should be _optional_, because it may break code under unix (insert Guido's explanation here), and because it may not support oddball systems like OpenVMS. It should be _on_ by default under: - Windows - MacPython Classic - MacPython Carbon - Unix Python under MacOS X / Darwin It should probably be off by default on all other systems (I think a compile-time switch is good enough). Maybe if we advertize the potential sloppy-unix-code-breakage loud enough we can make the feature mandatory in a later release, however I don't see a practical way of issuing warnings for the situation. 2) I assume there are quite a few places where Python uses raw C text files: these places should be identified, we should figure out how much work it is to fix these so they behave just like the Python file object as described above. Who would like to team up with me to write a decent PEP and maybe an example implementation? Just

Just van Rossum: the
situation.
It should be on by default for the Python interpreter reading Python programs as making it off by default leads to the inability to run programs written with Windows or Mac tools on Unix which was the problem reported by 'dsavitsk' on comp.lang.python. If it is going to be off by default then the error message should include "Rerun with -f to fix this error". Neil

Neil Hodgson wrote:
Yes, but as was mentioned before: this will lead to other problems for which we wouldn't have a good excuse: any program printing a traceback with the traceback module will output bogus data if linecache.py will read the source files incorrectly. And that's just one example. I don't think the two features should be switchable separately. Maybe it should be on by default, provided we have a command line switch to to turn the new behavior *off*, just like there used to be a command line switch to revert to string based exceptions. Just

Just van Rossum wrote:
Proposal for 2.2, outline for a PEP?
Thanks, Just, for getting this rolling.
I agree that is should be possible to turn the proposed off, but I still think it should be on by default, even on *nix systems (which is mostly what I use, buy the way), as it would only cause a problem for "sloppy" code anyway. Would it be possible to have it be turned on/off at runtime, rather than compile time ? It would be pretty awkward to have a program need a specific version of the interpreter to run. Even a command line flag could be awkward enough, then only the main program could specify the flag, and modules might not be compatible. Another option is for the new version to have another flag or set of flags to the open command, which would indicate that the file being opened is "Unix", "Mac", "DOS", or "Any". this would make it easy to write text files in a non-native format, as well as read them. Even if we didn't go that far, we could use the "t" flag (analogous to "b" for binary), to specify the universal text format, and the default would still be the current, native format. This would keep the "sloppy" *nix code from breaking, and still give full functionality to new code. While we are at it, what would get written is something we need to consider. If we just have the above proposal, reading a file would work great, it could be on a server with a different line ending format, and that would be transparent. Writing, on the other hand is an issue. If a program is running on a windows box, and writing a file on a *nix server, what kind of line ending should it write? Would it even know what the native format is on the server? It seems we would need to be able to specify the line ending format explicitly for writing. Just a few thoughts, maybe we'll get a PEP out of this after all! -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------

Chris Barker <chrishbarker@home.net>:
Yes, I think that's the best that can be done. To do any better would require all file servers to be aware of the text/binary distinction and be willing to translate, and for there to be some way for the Python file object to communicate to the OS which mode is intended. Neither of these things are true in general. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

You might need to be able to specify a specific line ending format, but there should also be a default -- and it should be the default appropriate to the OS. So, \n on Unix, \r\n on Windows, \r on Mac running in "Mac mode", and \n on MacOS X running in "Unix mode". --Guido van Rossum (home page: http://www.python.org/~guido/)

At 21:41 -0500 4/9/01, Guido van Rossum wrote:
Is it the same in Mac OS X when reading a file from a UFS volume as from an HFS(+) volume? Only if the underlying libraries make it so. (Typing in Mac OS X, but I don't have any UFS volumes lying around.) It's a little scary to contemplate that reading two different files, which happen to be on the same disk spindle, will behave differently for the file on the HFS+ volume than for the file on the UFS volume. [There are perhaps similar issues for our Linux friends who mount Windows volumes.] What ever happened to "move text files to another system using FTP in ASCII mode?" Ah, yes...it probably died of Unicode. --John (there may no be any answers for this) Baxter -- John Baxter jwblist@olympus.net Port Ludlow, WA, USA

[me]
[JW Baxter]
Is it the same in Mac OS X when reading a file from a UFS volume as from an HFS(+) volume?
I'm not sure that the volume from which you're *reading* could or should have any influence on the default delimiter used for *writing*. The volume you're *writing* to might, if it's easy to determine -- but personally, I'd be happy with a default set at compile time.
Anyway, disk spindles are the wrong abstraction level to consider here. Who cares about what spindle your files are on?
What ever happened to "move text files to another system using FTP in ASCII mode?" Ah, yes...it probably died of Unicode.
No, obviously it's cross-platform disk sharing. The first time this came up was when it became possible to mount Unix volumes on NT boxes many years ago, and that's when Python's parser (eventually) grew the habit of silently ignoring a \r just before a \n in a source file. It's a sign of how backward the Mac world is that the problem only now pops up for the Mac. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
It's a sign of how backward the Mac world is that the problem only now pops up for the Mac. :-)
I know I shouldn't bite, but I find this a very childish remark, Guido! It's also not true... Here's an excerpt from a private thread between me, Jack and Guido. It's dated january 8, 1996, I remember I was just learning Python. (I'll give a translation below.) """
(Ik neem aan dat je bedoelt files met '\n' in plaats van '\r' als line separator.)
Hmm, ik weet niet of ik dit een goed idee vindt. Weet je wat: vraag eens wat Guido er van vind (met een cc-tje naar mij).
Geen goed idee, tenzij de C stdio library dit automatisch doet (kennelijk niet dus). Het is over het algemeel een kleine moeite dit bij het file transport recht te trekken (ftp in text mode etc.). """ Translation: """ [Just]
[Guido] (I take it you mean files with '\n' instead of '\r' as line separator.) [Jack]
Hm, I don't know whether I think this is a good idea. You know what, ask Guido what he thinks (and cc me).
[Guido] Not a good idea, unless the C stdio library does this automatically (apparently it doesn't). In general it's a small effort to correct this during the file transport (ftp in text mode etc.). """ So it's not that the problem wasn't there, it was just not taken very seriously at the time... Just

Guido van Rossum wrote:
No, obviously it's cross-platform disk sharing. The first time this came up was when it became possible to mount Unix volumes on NT boxes
I'm sure it came up before that, I know it has for me, and I don't happen to do any cross platform disk sharing. It is just a little more soluble if you aren't doing disk sharing.
many years ago, and that's when Python's parser (eventually) grew the habit of silently ignoring a \r just before a \n in a source file.
It can do that? I had no idea. Probably because I work on the Mac and Linux almost exclusively, and hardly ever encounter a Windows box.
It's a sign of how backward the Mac world is that the problem only now pops up for the Mac. :-)
Actually it's a sign of how *nix/Windows focused Python is. It's sad to see that someone thought to fix the problem for *nix/Windows, and didn't even consider the Mac (as Just pointed out the problem has been know for a long time). Frankly, it's also a symptom the the isolationist attitude of a lot of Mac users/developers. and Don't get me started on the spaces vs tabs thing! Just, Are you planning on putting together a PEP from all of this? I'd really like to see this happen! -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------

[Guido]
[Chris Barker]
It can do that? I had no idea. Probably because I work on the Mac and Linux almost exclusively, and hardly ever encounter a Windows box.
It's a sign of how backward the Mac world is that the problem only now pops up for the Mac. :-)
This is a reversal of history. The code to ignore \r when seeing \r\n originally (1995) applied to *all* platforms. I don't know why, but Jack submitted a patch to disable this behavior only when "#ifdef macintosh", in revision 2.29 of Parser/tokenizer.c, about 4 years ago. The #ifdef introduced then still exists today; 3 lines introduced by that patch start with XXX here for clarity (appropriately defined <wink>): XXX #ifndef macintosh /* replace "\r\n" with "\n" */ XXX /* For Mac we leave the \r, giving a syntax error */ pt = tok->inp - 2; if (pt >= tok->buf && *pt == '\r') { *pt++ = '\n'; *pt = '\0'; tok->inp = pt; } XXX #endif I have no idea what Mac C libraries return for text-mode reads. They must convert \r to \n, right? In which case I guess any \r remaining *should* be "an error" (but where would it come from, if the C library converts all \r thingies?). Do they leave \n alone? Etc: submit a patch that makes the code above "work", and I'm sure it would be accepted, but a non-Mac person can't guess what's needed. As to "considering the Mac", guilty as charged: I don't know anything about it. What's to consider? How often do you consider the impact of chnages on, say, OpenVMS? Same thing, provided you're as ignorant of it as I am of your system.
The std for distributed Python code is 4-space indents, no hard tab characters. So there's nothing left there to get started on <wink>. it's-not-that-we-don't-want-to-"fix"-macs-it's-that-we-don't-know- how-macs-work-or-what-"fix"-*means*-to-a-macizoid-ly y'rs - tim

Tim Peters wrote:
Interesting, I didn't know that. Jack's on holiday now, so he won't be able to comment for a while.
I have no idea what Mac C libraries return for text-mode reads. They must convert \r to \n, right?
Yes.
Nope: \r's get translated to \n and for whatever reason \n's get translated to \r... So when opening a unix file on the Mac, it will look like it has \r line endings and when opening a Windows text file on the Mac, it will appear as if it has \n\r line endings...
That's probably easy enough -- although would require changing all tokenizer code that looks for \n to also look for \r, including PyOS_ReadLine(), so it goes well beyond the snippet you posted. And then there's the Python file object... Just

end-of-line conversion? Just van Rossum <just@letterror.com>:
Unless you're using the MPW compiler, which swaps the meanings of \r and \n in the source instead! Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

[Just van Rossum]
Then it's probably a Good Thing Jack disabled this code, since it wouldn't have done anything useful on a Mac anyway (for Python to ever see \r\n the source file would have had to contain \n\r, which is nobody's text file convention).
No, there's nothing wrong with the tokenizer code: it's coded in C, and the C text convention is that lines end with \n, period. Reliance on that convention is ubiquitous -- and properly so. What we need instead are platform-specific implementations of fgets() functionality, which deliver lines containing \n where and only where the platform Python is supposed to believe a line ends. Then nothing else in the parser needs to be touched (and, indeed, the current \r\n mini-hack could be thrown away).
And then there's the Python file object...
Different issue. If this ever gets that far, note that the crunch to speed up line-at-a-time file input ended up *requiring* use of the native fgets() on Windows, as that was the only way on that platform to avoid having the OS do layers of expensive multithreading locks for each character read. So there's no efficient way in general to get Windows to recognize \r line endings short of implementing our own stdio from the ground up. On other platforms, fileobject.c's get_line() reads one character at a time, and I expect its test for "is this an EOL char?" could be liberalized at reasonable cost. OTOH, how does the new-fangled Mac OS fit into all this? Perhaps, for compatibility, their C libraries already recognize both Unix and Mac Classic line conventions, and deliver plain \n endings for both? Or did they blow that part too <wink>?

I expect that the right solution here is indeed to write our own stdio-like library from the ground up. That can solve any number of problems: telling how many characters are buffered (so you don't have to use unbuffered mode when using select or poll), platform-independent line end recognition, and super-efficient readline() to boot. But it's a lot of work, and won't be compatible with existing extensions that use FILE* (not too many I believe). --Guido van Rossum (home page: http://www.python.org/~guido/)

[Guido]
We also have the old http://sourceforge.net/tracker/?group_id=5470& atid=105470&func=detail&aid=210821 complaining that use of FILE* in our C API can make it impossible to (in that fellow's case) write an app in Borland C++ on Windows that tries to use those API functions (cuz Borland's FILE* is incompatible with MS's FILE*). I'm not sure the best solution to *that* is to give them a FILE* that's incompatible with everyone's, though <wink>>
But it's a lot of work, and won't be compatible with existing extensions that use FILE* (not too many I believe).
I'm more concerned about the "lot of work" part, with which I agree. OTOH, Plauger's book "The Standard C Library" contains source code for every library required by C89. He reported that implementing libm took him twice as long as everything else combined. But those who haven't written a libm will be prone to take a wrong lesson from that <wink>. it's-not-that-i/o-is-easy-despite-that-his-libm-code-isn't-production- quality-ly y'rs - tim

I don't get it: why would a thin layer on top of stdio be bad? Seems much less work than reimplementing stdio.
Because by layering stuff you lose performance. Example: fgets() is often implemented in a way that is faster than you can ever do yourself with portable code. (Because fgets() can peek in the buffer and see if there's a \n somewhere ahead, using memcmp() -- and if this succeeds, it can use memcpy(). You can't do that yourself - only the stdio implementation can. And this is not a hypothetical situation -- Tim used fgets() for a significant speed-up of readline() in 2.1. But if we want to use our own line end convention, we can't use fgets() any more, so we lose big. --Guido van Rossum (home page: http://www.python.org/~guido/)

[ re: various remarks about layering on stdio ] Has anybody looked at sfio ? I used it long ago for other reasons -- for a while the distribution seemed to have disappeared from att ( or maybe I just couldn't find it on netlib ), but I just did a google search and found that there is a new distribution: sfio2000: http://www.research.att.com/sw/tools/sfio/ I haven't looked at the package or the code for a LONG time & I don't know how portable it is, but it has some nice features and advantages -- if you're at the point of considering rewriting stdio it might be worth looking at. -- Steve Majewski

Steven D. Majewski wrote:
[ re: various remarks about layering on stdio ]
Has anybody looked at sfio ?
That reminds me of QIO, the stdio replacement in INN, which has already been ported to Python. -- --- Aahz (@pobox.com) Hugs and backrubs -- I break Rule 6 http://www.rahul.net/aahz Androgynous poly kinky vanilla queer het I don't really mind a person having the last whine, but I do mind someone else having the last self-righteous whine.

[Steven D. Majewski]
Did just now. Only runs on Unix boxes, so would be a heavyweight way to solve line-end problems across platforms that don't have any <wink>. Possible to run it on Windows, but only on top of the commercial UWIN Unix emulation package (http://www.research.att.com/sw/tools/uwin/). They didn't mention Macs at all. The papers should be worth reading for anyone intending to tackle this, though.

[Guido]
Well, people said "we couldn't" use fgets() for get_line() either, because Python strings can contain embedded nulls but fgets() doesn't tell you how many bytes it read and makes up null bytes of its own. But I have 200 lines of excruciating code in fileobject.c that proved them excruciatingly wrong <wink>. The same kind of excruciating crap could almost certainly be used to search for alternative line endings on top of fgets() too. We would have to layer our own buffer on top of the hidden platform buffer to get away with this, because while fgets() will stop at the first \n it sees, there's no way to ask it to stop at any other character (so in general fgets() would "over-read" when looking for a non-native line-end, and we'd have to save the excess in our own buffer). Hard to say how much that would cost. I think it surprised everyone (incl. me!) that even with all the extra buffer-filling and buffer-searching the fgets() hackery does, that method was at worst a wash with the getc_unlocked() method on all platforms tried. In any case, the fgets() hack is only *needed* on Windows, so every other platform could just make get_line()'s character-at-a-time loop search for more end conditions. This can't be impossible <wink>. s/\r\n?/\n/g-ly y'rs - tim

I understand now that I simply don't have enough clue about the implementation to even try to be involved with this. Unless it makes sense to have a PEP that doesn't touch the implementation at all (doubtful, IMHO), I'll take back my offer to write one. I still think it's an important issue, but it's simply beyond what I can deal with. To solve the issues on MacOS X, maybe it's enough to hack the Carbon version of stdio so it can handle unix text files. That way we can simply settle for unix line ending if sharing code between BSD Python and Carbon Python is desired. At the same time this would allow using CVS under Darwin for MacPython sources, which is something I look forward to... Just

Just van Rossum wrote:
Please write the results of this discussion up as a PEP. PEPs don't necessarily have to provide an implementation of what is covered; it sometimes simply suffices to start out with a summary of the discussions that have been going on. Then someone may pick up the threads from there and possibly find a solution which will then get implemented.
AFAIR, this discussion was about handling line endings in Python source code. There have been discussions about turning the tokenizer into a Unicode based machine. We could then use the Unicode tools to do line separations. I don't know why this thread lead to tweaking stdio -- after all we only need a solution for the Python tokenizer and not a general purpose stdio abstraction of text files unless I'm missing something here... -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
Aaaaaaaaaaaargh! ;-) Here we go again: fixing the tokenizer is great and all, but then what about all tools that read source files line by line? Eg. linecache.py, all IDE's, etc. etc. As Tim wrote a while back: importing-is-only-the-start-of-the-battle So no, we don't "only need a solution for the Python tokenizer"... Just

Just van Rossum wrote:
<grin> I'll repeat my question of yesterday: is there any reason why we couldn't start with QIO? I did some checking after I sent that out, and QIO claims that it can be configured to recognize different kinds of line endings. QIO is claimed to be 2-3 times faster than Python 1.5.2; don't know how that compares to 2.x. [the previous message was sent to python-dev only; this time I'm including pythonmac-sig] -- --- Aahz (@pobox.com) Hugs and backrubs -- I break Rule 6 http://www.rahul.net/aahz Androgynous poly kinky vanilla queer het I don't really mind a person having the last whine, but I do mind someone else having the last self-righteous whine.

[MAL]
I don't know why this thread lead to tweaking stdio -- after all we only need a solution for the Python tokenizer ...
[Just]
Note that this is why the topic needs a PEP: nothing here is new; the same debates reoccur every time it comes up. [Aahz]
It can be, yes, but in the same sense as Awk/Perl paragraph mode: you can tell it to consider any string (not just single character) as meaning "end of the line", but it's a *fixed* string per invocation. What people want *here* is more the ability to recognize the regular expression \r\n?|\n as ending a line, and QIO can't do that directly (as currently written). And MAL probably wants Unicode line-end detection: http://www.unicode.org/unicode/reports/tr13/
QIO is claimed to be 2-3 times faster than Python 1.5.2; don't know how that compares to 2.x.
The bulk of that was due to QIO avoiding per-character thread locks. 2.1 avoids them too, so most of QIO's speed advantage should be gone now. But QIO's internals could certainly be faster than they are (this is obscure because QIO.readline() has so many optional behaviors that the maze of if-tests makes it hard to see the speed-crucial bits; studying Perl's line-reading code is a better model, because Perl's speed-crucial inner loop has no non-essential operations -- Perl makes the *surrounding* code sort out the optional bits, instead of bogging down the loop with them).

Tim Peters wrote:
Right.
Right ;-)
Just curious: for the applications which Just has in mind, reading source code line-by-line is not really needed. Wouldn't it suffice to read the whole file, split it into lines and then let the tools process the resulting list of lines ? Maybe a naive approach, but one which will most certainly work on all platforms without having to replace stdio... -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
The point is to let existing tools work with all line end conventions *without* changing the tools. Whether this means replacing stdio I still don't know <wink>, but it sure means changing the behavior of the Python file object in text mode. Just

Just van Rossum wrote:
See... that's why we need a PEP on these things ;-) Seriously, I thought that you were only talking about being able to work on Python code from different platforms in a network (e.g. code is shared by a Windows box and development takes place on a Mac). Now it seems that you want to go for the full Monty :-) -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg wrote:
So no, we don't "only need a solution for the Python tokenizer"...
See... that's why we need a PEP on these things ;-)
Agreed. I'll try to write one, once I'm feeling better: having the flu doesn't seem to help focussing on actual content... Just

Just van Rossum wrote:
Just (or anyone else) Have you made any progress on this PEP? I'd like to see it happen, so if you havn't done it, I'll try to find the time to make a start on it myself. I have written a simple class that impliments a line-ending-neutral text file class. I wrote it because I have a need for it, and I thought it would be a reasonable prototype for any syntax and methods we might want to use in an actual implimentation. I doubt anyone would find the methods I used particularly clean or elegant (or fast) but it's the first thing I've come up with, and it seems to work. I've enclosed the module with this email. If that doesn't work, let me know and I'll put it on a website. -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------ #!/usr/bin/env python """ TextFile.py : a module that provides a UniversalTextFile class, and a replacement for the native python "open" command that provides an interface to that class. It would usually be used as: from TextFile import open then you can use the new open just like the old one (with some added flags and arguments) or import TextFile file = TextFile.open(filename,flags,[bufsize], [LineEndingType], [LineBufferSize]) """ import os ## Re-map the open function _OrigOpen = open def open(filename,flags = "",bufsize = -1, LineEndingType = "", LineBufferSize = ""): """ A new open function, that returns a regular python file object for the old calls, and returns a new nifty universal text file when required. This works just like the regular open command, except that a new flag and a new parameter has been added. Call: file = open(filename,flags = "",bufsize = -1, LineEndingType = ""): - filename is the name of the file to be opened - flags is a string of one letter flags, the same as the standard open command, plus a "t" for universal text file. - - "b" means binary file, this returns the standard binary file object - - "t" means universal text file - - "r" for read only - - "w" for write. If there is both "w" and "t" than the user can specify a line ending type to be used with the LineEndingType parameter. - - "a" means append to existing file - bufsize specifies the buffer size to be used by the system. Same as the regular open function - LineEndingType is used only for writing (and appending) files, to specify a non-native line ending to be written. - - The options are: "native", "DOS", "Posix", "Unix", "Mac", or the characters themselves( "\r\n", etc. ). "native" will result in using the standard file object, which uses whatever is native for the system that python is running on. - LineBufferSize is the size of the buffer used to read data in a readline() operation. The default is currently set to 200 characters. If you will be reading files with many lines over 200 characters long, you should set this number to the largest expected line length. """ if "t" in flags: # this is a universal text file if ("w" in flags or "a" in flags) and LineEndingType == "native": return _OrigOpen(filename,flags.replace("t",""), bufsize) return UniversalTextFile(filename,flags,LineEndingType,LineBufferSize) else: # this is a regular old file return _OrigOpen(filename,flags,bufsize) class UniversalTextFile: """ A class that acts just like a python file object, but has a mode that allows the reading of arbitrary formated text files, i.e. with either Unix, DOS or Mac line endings. [\n , \r\n, or \r] To keep it truly universal, it checks for each of these line ending possibilities at every line, so it should work on a file with mixed endings as well. """ def __init__(self,filename,flags = "",LineEndingType = "native",LineBufferSize = ""): self._file = _OrigOpen(filename,flags.replace("t","")+"b") LineEndingType = LineEndingType.lower() if LineEndingType == "native": self.LineSep = os.linesep() elif LineEndingType == "dos": self.LineSep = "\r\n" elif LineEndingType == "posix" or LineEndingType == "unix" : self.LineSep = "\n" elif LineEndingType == "mac": self.LineSep = "\r" else: self.LineSep = LineEndingType ## some attributes self.closed = 0 self.mode = flags self.softspace = 0 if LineBufferSize: self._BufferSize = LineBufferSize else: self._BufferSize = 100 def readline(self): start_pos = self._file.tell() ##print "Current file posistion is:", start_pos line = "" TotalBytes = 0 Buffer = self._file.read(self._BufferSize) while Buffer: ##print "Buffer = ",repr(Buffer) newline_pos = Buffer.find("\n") return_pos = Buffer.find("\r") if return_pos == newline_pos-1 and return_pos >= 0: # we have a DOS line line = Buffer[:return_pos]+ "\n" TotalBytes = newline_pos+1 break elif ((return_pos < newline_pos) or newline_pos < 0 ) and return_pos >=0: # we have a Mac line line = Buffer[:return_pos]+ "\n" TotalBytes = return_pos+1 break elif newline_pos >= 0: # we have a Posix line line = Buffer[:newline_pos]+ "\n" TotalBytes = newline_pos+1 break else: # we need a larger buffer NewBuffer = self._file.read(self._BufferSize) if NewBuffer: Buffer = Buffer + NewBuffer else: # we are at the end of the file, without a line ending. self._file.seek(start_pos + len(Buffer)) return Buffer self._file.seek(start_pos + TotalBytes) return line def readlines(self,sizehint = None): """ readlines acts like the regular readlines, except that it understands any of the standard text file line endings ("\r\n", "\n", "\r"). If sizehint is used, it will read a a mximum of that many bytes. It will not round up, as the regular readline does. This means that if your buffer size is less thatn the length of the next line, you won't get anything. """ if sizehint: Data = self._file.read(sizehint) else: Data = self._file.read() if len(Data) == sizehint: #print "The buffer is full" FullBuffer = 1 else: FullBuffer = 0 Data = Data.replace("\r\n","\n").replace("\r","\n") Lines = [line + "\n" for line in Data.split('\n')] #print Lines ## If the last line is only a linefeed it is an extra line if Lines[-1] == "\n": del Lines[-1] ## if it isn't then the last line didn't have a linefeed, so we need to remove the one we put on. else: ## or it's the end of the buffer if FullBuffer: #print "the file is at:",self._file.tell() #print "the last line has length:",len(Lines[-1]) self._file.seek(-(len(Lines[-1])-1),1) # reset the file position del(Lines[-1]) else: Lines[-1] = Lines[-1][:-1] return Lines def readnumlines(self,NumLines = 1): """ readnumlines is an extension to the standard file object. It returns a list containing the number of lines that are requested. I have found this to be very usefull, and allows me to avoid the many loops like: lines = [] for i in range(N): lines.append(file.readline()) Also, If I ever get around to writing this in C, it will provide a speed improvement. """ Lines = [] while len(Lines) < NumLines: Lines.append(self.readline()) return Lines def read(self,size = None): """ read acts like the regular read, except that it tranlates any of the standard text file line endings ("\r\n", "\n", "\r") into a "\n" If size is used, it will read a maximum of that many bytes, before translation. This means that if the line endings have more than one character, the size returned will be smaller. This could gbe patched, but it didn't seem worth it. If you want that much control, use a binary file. """ if size: Data = self._file.read(size) else: Data = self._file.read() return Data.replace("\r\n","\n").replace("\r","\n") def write(self,string): """ write is just like the regular one, except that it uses the line separator specified when the file was opened for writing or appending. """ self._file.write(string.replace("\n",self.LineSep)) def writelines(self,list): for line in list: self.write(line) # The rest of the standard file methods mapped def close(self): self._file.close() self.closed = 1 def flush(self): self._file.flush() def fileno(self): return self._file.fileno() def seek(self,offset,whence = 0): self._file.seek(offset,whence) def tell(self): return self._file.tell()

Yesterday I found I had need for an end-of-line conversion import hook. I looked sround but found none (did I miss some code on this thread?), so I whipped one up (below). It seems to do the job. If you see any goofs, gaffes or gotchas, or if you know of a better way to do this, please let me know. I will post this code to c.l.py in a few days for the enjoyment of all. -- David Goodger dgoodger@bigfoot.com Open-source projects: - The Go Tools Project: http://gotools.sourceforge.net - reStructuredText: http://structuredtext.sourceforge.net (soon!) -----%<----------cut----------%<----------%<----------cut----------%<----- # Import hook for end-of-line conversion, # by David Goodger (dgoodger@bigfoot.com). # Put in your sitecustomize.py, anywhere on sys.path, and you'll be able to # import Python modules with any of Unix, Mac, or Windows line endings. import ihooks, imp, py_compile class MyHooks(ihooks.Hooks): def load_source(self, name, filename, file=None): """Compile source files with any line ending.""" if file: file.close() py_compile.compile(filename) # line ending conversion is in here cfile = open(filename + (__debug__ and 'c' or 'o'), 'rb') try: return self.load_compiled(name, filename, cfile) finally: cfile.close() class MyModuleLoader(ihooks.ModuleLoader): def load_module(self, name, stuff): """Special-case package directory imports.""" file, filename, (suff, mode, type) = stuff path = None if type == imp.PKG_DIRECTORY: stuff = self.find_module_in_dir("__init__", filename, 0) file = stuff[0] # package/__init__.py path = [filename] try: # let superclass handle the rest module = ihooks.ModuleLoader.load_module(self, name, stuff) finally: if file: file.close() if path: module.__path__ = path # necessary for pkg.module imports return module ihooks.ModuleImporter(MyModuleLoader(MyHooks())).install()

"M.-A. Lemburg" wrote:
Just, I second that. I really think this is a very useful improvement to Python, I'd I'd really like to see it happen. I am probably even more out of my depth than you when it comes to suggesting impimentation, but I'd be glad to help with any parts of a PEP that I can. Guido van Rossum wrote:
Great! I have to say that it really seemed that someone must have produced an open source solution to this problem somewhere, and it turns out there is something Python related already. -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------
participants (15)
-
aahz@rahul.net
-
Chris Barker
-
David Goodger
-
Fredrik Lundh
-
Greg Ewing
-
Greg Stein
-
Guido van Rossum
-
Jack Jansen
-
John W Baxter
-
Just van Rossum
-
M.-A. Lemburg
-
Moshe Zadka
-
Neil Hodgson
-
Steven D. Majewski
-
Tim Peters