Universal newlines support in Python 3.0
Python 3.0 currently has limited universal newlines support: by default, \r\n is translated into \n for text files, but this can be controlled by the newline= keyword parameter. For details on how, see PEP 3116. The PEP prescribes that a lone \r must also be translated, though this hasn't been implemented yet (any volunteers?). However, the old universal newlines feature also set an attibute named 'newlines' on the file object to a tuple of up to three elements giving the actual line endings that were observed on the file so far (\r, \n, or \r\n). This feature is not in PEP 3116, and it is not implemented. I'm tempted to kill it. Does anyone have a use case for this? Has anyone even ever used this? -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum writes:
However, the old universal newlines feature also set an attibute named 'newlines' on the file object to a tuple of up to three elements giving the actual line endings that were observed on the file so far (\r, \n, or \r\n). This feature is not in PEP 3116, and it is not implemented. I'm tempted to kill it. Does anyone have a use case for this?
I have run into files that intentionally have more than one newline convention used (mbox and Babyl mail folders, with messages received from various platforms). However, most of the time multiple newline conventions is a sign that the file is either corrupt or isn't text. If so, then saving the file may corrupt it. The newlines attribute could be used to check for this condition.
Has anyone even ever used this?
Not I. When I care about such issues I prefer that the codec raise an exception at the time of detection.
In article <87wsw3p5em.fsf@uwakimon.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Guido van Rossum writes:
However, the old universal newlines feature also set an attibute named 'newlines' on the file object to a tuple of up to three elements giving the actual line endings that were observed on the file so far (\r, \n, or \r\n). This feature is not in PEP 3116, and it is not implemented. I'm tempted to kill it. Does anyone have a use case for this?
I have run into files that intentionally have more than one newline convention used (mbox and Babyl mail folders, with messages received from various platforms). However, most of the time multiple newline conventions is a sign that the file is either corrupt or isn't text. If so, then saving the file may corrupt it. The newlines attribute could be used to check for this condition.
There is at least one Mac source code editor (SubEthaEdit) that is all too happy to add one kind of newline to a file that started out with a different line ending character. As a result I have seen a fair number of text files with mixed line endings. I don't see as many these days, though; perhaps because the current version of SubEthaEdit handles things a bit better. So perhaps it won't matter much for Python 3000. -- Russell
On 8/13/07, Russell E Owen <rowen@cesmail.net> wrote:
In article <87wsw3p5em.fsf@uwakimon.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Guido van Rossum writes:
However, the old universal newlines feature also set an attibute named 'newlines' on the file object to a tuple of up to three elements giving the actual line endings that were observed on the file so far (\r, \n, or \r\n). This feature is not in PEP 3116, and it is not implemented. I'm tempted to kill it. Does anyone have a use case for this?
I have run into files that intentionally have more than one newline convention used (mbox and Babyl mail folders, with messages received from various platforms). However, most of the time multiple newline conventions is a sign that the file is either corrupt or isn't text. If so, then saving the file may corrupt it. The newlines attribute could be used to check for this condition.
There is at least one Mac source code editor (SubEthaEdit) that is all too happy to add one kind of newline to a file that started out with a different line ending character. As a result I have seen a fair number of text files with mixed line endings. I don't see as many these days, though; perhaps because the current version of SubEthaEdit handles things a bit better. So perhaps it won't matter much for Python 3000.
I've seen similar behavior in MS VC++ (long ago, dunno what it does these days). It would read files with \r\n and \n line endings, and whenever you edited a line, that line also got a \r\n ending. But unchanged lines that started out with \n-only endings would keep the \n only. And there was no way for the end user to see or control this. To emulate this behavior in Python you'd have to read the file in binary mode *or* we'd have to have an additional flag specifying to return line endings as encountered in the file. The newlines attribute (as defined in 2.x) doesn't help, because it doesn't tell which lines used which line ending. I think the newline feature in PEP 3116 falls short too; it seems mostly there to override the line ending *written* (from the default os.sep). I think we may need different flags for input and for output. For input, we'd need two things: (a) which are acceptable line endings; (b) whether to translate acceptable line endings to \n or not. For output, we need two things again: (c) whether to translate line endings at all; (d) which line endings to translate. I guess we could map (c) to (b) and (d) to (a) for a signature that's the same for input and output (and makes sense for read+write files as well). The default would be (a)=={'\n', '\r\n', '\r'} and (b)==True. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Aug 13, 2007, at 4:15 PM, Guido van Rossum wrote:
I've seen similar behavior in MS VC++ (long ago, dunno what it does these days). It would read files with \r\n and \n line endings, and whenever you edited a line, that line also got a \r\n ending. But unchanged lines that started out with \n-only endings would keep the \n only. And there was no way for the end user to see or control this.
To emulate this behavior in Python you'd have to read the file in binary mode *or* we'd have to have an additional flag specifying to return line endings as encountered in the file. The newlines attribute (as defined in 2.x) doesn't help, because it doesn't tell which lines used which line ending. I think the newline feature in PEP 3116 falls short too; it seems mostly there to override the line ending *written* (from the default os.sep).
I think we may need different flags for input and for output.
For input, we'd need two things: (a) which are acceptable line endings; (b) whether to translate acceptable line endings to \n or not. For output, we need two things again: (c) whether to translate line endings at all; (d) which line endings to translate. I guess we could map (c) to (b) and (d) to (a) for a signature that's the same for input and output (and makes sense for read+write files as well). The default would be (a)=={'\n', '\r\n', '\r'} and (b)==True.
I haven't thought about the output side of the equation, but I've already hit a situation where I'd like to see the input side (b) option implemented. I'm still sussing out the email package changes (down to 7F/9E of 247 tests!) but in trying to fix things I found myself wanting to open files in text mode so that I got strings out of the file instead of bytes. This was all fine except that some of the tests started failing because of the EOL translation that happens unconditionally now. The file contained \r\n and the test was ensuring these EOLs were preserved in the parsed text. I switched back to opening the file in binary mode, and doing a crufty conversion of bytes to strings (which I suspect is error prone but gets me farther along). It would have been perfect, I think, if I could have opened the file in text mode so that read() gave me strings, with universal newlines and preservation of line endings (i.e. no translation to \n). - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRsG1CXEjvBPtnXfVAQKF3AP/X+/E44KI2EB3w0i3N5cGBCajJbMV93fk j2S/lfQf4tjBH3ZFEhUnybcJxsNukYY65T4MdzKh+IgJHV5s0rQtl2Hzr85e7Y0O i5Z3N4TAKc11PjSIk6vKrkgwPCEMzvwIQ5DFxeQBF5kOF6cZuXKaeDzB6z/GBYNv YiJEnOeZkW8= =u6OL -----END PGP SIGNATURE-----
On Tue, Aug 14, 2007 at 09:58:32AM -0400, Barry Warsaw wrote:
This was all fine except that some of the tests started failing because of the EOL translation that happens unconditionally now. The file contained \r\n and the test was ensuring these EOLs were preserved in the parsed text. I switched back to opening the file in binary mode, and doing a crufty conversion of bytes to strings (which I suspect is error prone but gets me farther along).
It would have been perfect, I think, if I could have opened the file in text mode so that read() gave me strings, with universal newlines and preservation of line endings (i.e. no translation to \n).
FWIW this same issue (and solution) came up while fixing the csv tests. -- Adam Hupp | http://hupp.org/adam/
On Aug 10, 2007, at 11:23 AM, Guido van Rossum wrote:
Python 3.0 currently has limited universal newlines support: by default, \r\n is translated into \n for text files, but this can be controlled by the newline= keyword parameter. For details on how, see PEP 3116. The PEP prescribes that a lone \r must also be translated, though this hasn't been implemented yet (any volunteers?).
I'm working on this, but now I'm not sure how the file is supposed to be read when the newline parameter is \r or \r\n. Here's the PEP language: buffer is a reference to the BufferedIOBase object to be wrapped with the TextIOWrapper. encoding refers to an encoding to be used for translating between the byte-representation and character-representation. If it is None, then the system's locale setting will be used as the default. newline can be None, '\n', '\r', or '\r\n' (all other values are illegal); it indicates the translation for '\n' characters written. If None, a system-specific default is chosen, i.e., '\r\n' on Windows and '\n' on Unix/Linux. Setting newline='\n' on input means that no CRLF translation is done; lines ending in '\r\n' will be returned as '\r\n'. ('\r' support is still needed for some OSX applications that produce files using '\r' line endings; Excel (when exporting to text) and Adobe Illustrator EPS files are the most common examples. Is this ok: when newline='\r\n' or newline='\r' is passed, only that string is used to determine the end of lines. No translation to '\n' is done.
However, the old universal newlines feature also set an attibute named 'newlines' on the file object to a tuple of up to three elements giving the actual line endings that were observed on the file so far (\r, \n, or \r\n). This feature is not in PEP 3116, and it is not implemented. I'm tempted to kill it. Does anyone have a use case for this? Has anyone even ever used this?
This strikes me as a pragmatic feature, making it easy to read a file and write back the same line ending. I can include in patch. http://www.google.com/codesearch?hl=en&q=+lang:python+%22.newlines%22 +show:cz2Fhijwr3s:yutdXigOmYY:YDns9IyEkLQ&sa=N&cd=12&ct=rc&cs_p=http://f tp.gnome.org/pub/gnome/sources/meld/1.0/ meld-1.0.0.tar.bz2&cs_f=meld-1.0.0/filediff.py#a0 http://www.google.com/codesearch?hl=en&q=+lang:python+%22.newlines%22 +show:SLyZnjuFadw:kOTmKU8aU2I:VX_dFr3mrWw&sa=N&cd=37&ct=rc&cs_p=http://s vn.python.org/projects/ctypes/trunk&cs_f=ctypeslib/ctypeslib/ dynamic_module.py#a0 Thanks -Tony
On 8/11/07, Tony Lownds <tony@pagedna.com> wrote:
On Aug 10, 2007, at 11:23 AM, Guido van Rossum wrote:
Python 3.0 currently has limited universal newlines support: by default, \r\n is translated into \n for text files, but this can be controlled by the newline= keyword parameter. For details on how, see PEP 3116. The PEP prescribes that a lone \r must also be translated, though this hasn't been implemented yet (any volunteers?).
I'm working on this, but now I'm not sure how the file is supposed to be read when the newline parameter is \r or \r\n. Here's the PEP language:
buffer is a reference to the BufferedIOBase object to be wrapped with the TextIOWrapper. encoding refers to an encoding to be used for translating between the byte-representation and character-representation. If it is None, then the system's locale setting will be used as the default. newline can be None, '\n', '\r', or '\r\n' (all other values are illegal); it indicates the translation for '\n' characters written. If None, a system-specific default is chosen, i.e., '\r\n' on Windows and '\n' on Unix/Linux. Setting newline='\n' on input means that no CRLF translation is done; lines ending in '\r\n' will be returned as '\r\n'. ('\r' support is still needed for some OSX applications that produce files using '\r' line endings; Excel (when exporting to text) and Adobe Illustrator EPS files are the most common examples.
Is this ok: when newline='\r\n' or newline='\r' is passed, only that string is used to determine the end of lines. No translation to '\n' is done.
I *think* it would be more useful if it always returned lines ending in \n (not \r\n or \r). Wouldn't it? Although this is not how it currently behaves; when you set newline='\r\n', it returns the \r\n unchanged, so it would make sense to do this too when newline='\r'. Caveat user I guess.
However, the old universal newlines feature also set an attibute named 'newlines' on the file object to a tuple of up to three elements giving the actual line endings that were observed on the file so far (\r, \n, or \r\n). This feature is not in PEP 3116, and it is not implemented. I'm tempted to kill it. Does anyone have a use case for this? Has anyone even ever used this?
This strikes me as a pragmatic feature, making it easy to read a file and write back the same line ending. I can include in patch.
OK, if you think you can, that's good. It's not always sufficient (not if there was a mix of line endings) but it's a start.
http://www.google.com/codesearch?hl=en&q=+lang:python+%22.newlines%22 +show:cz2Fhijwr3s:yutdXigOmYY:YDns9IyEkLQ&sa=N&cd=12&ct=rc&cs_p=http://f tp.gnome.org/pub/gnome/sources/meld/1.0/ meld-1.0.0.tar.bz2&cs_f=meld-1.0.0/filediff.py#a0
http://www.google.com/codesearch?hl=en&q=+lang:python+%22.newlines%22 +show:SLyZnjuFadw:kOTmKU8aU2I:VX_dFr3mrWw&sa=N&cd=37&ct=rc&cs_p=http://s vn.python.org/projects/ctypes/trunk&cs_f=ctypeslib/ctypeslib/ dynamic_module.py#a0
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Aug 11, 2007, at 10:29 AM, Guido van Rossum wrote:
Is this ok: when newline='\r\n' or newline='\r' is passed, only that string is used to determine the end of lines. No translation to '\n' is done.
I *think* it would be more useful if it always returned lines ending in \n (not \r\n or \r). Wouldn't it? Although this is not how it currently behaves; when you set newline='\r\n', it returns the \r\n unchanged, so it would make sense to do this too when newline='\r'. Caveat user I guess.
Because there's an easy way to translate, having the option to not translate apply to all valid newline values is probably more useful. I do think it's easier to define the behavior this way.
OK, if you think you can, that's good. It's not always sufficient (not if there was a mix of line endings) but it's a start.
Right -Tony
On 11/08/07, Guido van Rossum <guido@python.org> wrote:
On 8/11/07, Tony Lownds <tony@pagedna.com> wrote:
Is this ok: when newline='\r\n' or newline='\r' is passed, only that string is used to determine the end of lines. No translation to '\n' is done.
I *think* it would be more useful if it always returned lines ending in \n (not \r\n or \r). Wouldn't it? Although this is not how it currently behaves; when you set newline='\r\n', it returns the \r\n unchanged, so it would make sense to do this too when newline='\r'. Caveat user I guess.
Neither this wording, nor the PEP are clear to me, but I'm assuming/hoping that there will be a way to spell the current behaviour for universal newlines on input[1], namely that files can have *either* bare \n, *or* the combination \r\n, to delimit lines. Whichever is used (I have no need for mixed-style files) gets translated to \n so that the program sees the same data regardless. [1] ... at least the bit I care about :-) This behaviour is immensely useful for uniform treatment of Windows text files, which are an inconsistent mess of \n-only and \r\n conventions. Specifically, I'm looking to replicate this behaviour:
xxd crlf 0000000: 610d 0a62 0d0a a..b..
xxd lf 0000000: 610a 620a a.b.
python Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
open('crlf').read() 'a\nb\n' open('lf').read() 'a\nb\n'
As demonstrated, this is the default in Python 2.5. I'd hope it was so in 3.0 as well. Sorry I can't test this for myself - I don't have the time/toolset to build my own Py3k on Windows... Paul.
Paul Moore schrieb:
Specifically, I'm looking to replicate this behaviour:
xxd crlf 0000000: 610d 0a62 0d0a a..b..
xxd lf 0000000: 610a 620a a.b.
python Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
open('crlf').read() 'a\nb\n' open('lf').read() 'a\nb\n'
As demonstrated, this is the default in Python 2.5. I'd hope it was so in 3.0 as well.
Note that Python does nothing special in the above case. For non-Windows platforms, you'd get two different results -- the conversion from \r\n to \n is done by the Windows C runtime since the default open() mode is text mode. Only with mode 'U' does Python use its own universal newline mode. With Python 3.0, the C library is not used and Python uses universal newline mode by default. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
On 12/08/07, Georg Brandl <g.brandl@gmx.net> wrote:
Note that Python does nothing special in the above case. For non-Windows platforms, you'd get two different results -- the conversion from \r\n to \n is done by the Windows C runtime since the default open() mode is text mode.
Only with mode 'U' does Python use its own universal newline mode.
Pah. You're right - I almost used 'U' and then "discovered" that I didn't need it (and got bitten by a portability bug as a result :-()
With Python 3.0, the C library is not used and Python uses universal newline mode by default.
That's what I expected, but I was surprised to find that the PEP is pretty unclear on this. The phrase "universal newlines" is mentioned only once, and never defined. Knowing the meaning, I can see how the PEP is intended to say that universal newlines on input is the default (and you set the newline argument to specify a *specific*, non-universal, newline value) - but I missed it on first reading. Thanks for the clarification. Paul.
Paul> ... that files can have *either* bare \n, *or* the combination Paul> \r\n, to delimit lines. As someone else pointed out, \r needs to be supported as well. Many Mac applications (Excel comes to mind) still emit text files with \r as the line terminator. Skip
participants (10)
-
Adam Hupp
-
Barry Warsaw
-
Georg Brandl
-
Guido van Rossum
-
Paul Moore
-
Russell E Owen
-
skip@pobox.com
-
Stephen J. Turnbull
-
Tony Lownds
-
Tony Lownds