New subject: Iterating non-newline-separated files should be easier

23 Jul 2014

      On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i@gmail.com> wrote:
...
Andrew Barnert <abarnert@yahoo.com> writes:
...
On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i@gmail.com> wrote:
...
Paul Moore <p.f.moore@gmail.com> writes:
...
On 21 July 2014 01:41, Andrew Barnert
<abarnert@yahoo.com.dmarc.invalid> wrote:
...
OK, I wrote up a draft PEP, and attached it to the bug (if that's
not a good thing to do, apologies); you can find it at
http://bugs.python.org/file36008/pep-newline.txt
As a suggestion, how about adding an example of a simple nul-separated
filename filter - the sort of thing that could go in a find -print0 |
xxx | xargs -0 pipeline? If I understand it, that's one of the key
motivating examples for this change, so seeing how it's done would be
a great help.
Here's the sort of thing I mean, written for newline-separated files:
import sys
def process(filename):
 """Trivial example"""
 return filename.lower()
if __name__ == '__main__':
for filename in sys.stdin:
     filename = process(filename)
     print(filename)
This is also an example of why I'm struggling to understand how an
open() parameter "solves all the cases". There's no explicit open()
call here, so how do you specify the record separator? Seeing how you
propose this would work would be really helpful to me.
`find -print0 | ./tr-filename -0 | xargs -0` example implies that you
can replace `sys.std*` streams without worrying about preserving
`sys.__std*__` streams:
#!/usr/bin/env python
import io
import re
import sys
from pathlib import Path
def transform_filename(filename: str) -> str: # example
   """Normalize whitespace in basename."""
   path = Path(filename)
   new_path = path.with_name(re.sub(r'\s+', ' ', path.name))
   path.replace(new_path) # rename on disk if necessary
   return str(new_path)
def SystemTextStream(bytes_stream, **kwargs):
   encoding = sys.getfilesystemencoding()
   return io.TextIOWrapper(bytes_stream,
       encoding=encoding,
       errors='surrogateescape' if encoding != 'mbcs' else 'strict',
       **kwargs)
nl = '\0' if '-0' in sys.argv else None
sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
for line in SystemTextStream(sys.stdin.detach(), newline=nl):
   print(transform_filename(line.rstrip(nl)), end=nl)
Nice, much more complete example than mine. I just tried to handle as
many edge cases as the original he asked about, but you handle
everything.
...
io.TextIOWrapper() plays the role of open() in this case. The code
assumes that `newline` parameter accepts '\0'.
The example function handles Unicode whitespace to demonstrate why
opaque bytes-based cookies can't be used to represent filenames in this
case even on POSIX, though which characters are recognized depends on
sys.getfilesystemencoding().
Note:
- `end=nl` is necessary because `print()` prints '\n' by default -- it
does not use `file.newline`
Actually, yes it does. Or, rather, print pastes on a '\n', but
sys.stdout.write translates any '\n' characters to sys.stdout.writenl
(a private variable that's initialized from the newline argument at
construction time if it's anything other than None or '').
You are right. I've stopped reading the source for print() function at
`PyFile_WriteString("\n", file);` line assuming that "\n" is not
translated if newline="\0". But the current behaviour if "\0" were in
"the other legal values" category (like "\r") would be to translate "\n"
[1]:
When writing output to the stream, if newline is None, any '\n'
characters written are translated to the system default line
separator, os.linesep. If newline is '' or '\n', no translation takes
place. If newline is any of the other legal values, any '\n'
characters written are translated to the given string.
[1] https://docs.python.org/3/library/io.html#io.TextIOWrapper
Example:
$ ./python -c 'import sys, io; 
sys.stdout=io.TextIOWrapper(sys.stdout.detach(), newline="\r\n"); 
sys.stdout.write("\n\r\r\n")'| xxd
0000000: 0d0a 0d0d 0d0a                           ......
"\n" is translated to b"\r\n" here and "\r" is left untouched (b"\r").
In order to newline="\0" case to work, it should behave similar to
newline='' or newline='\n' case instead i.e., no translation should take
place, to avoid corrupting embed "\n\r" characters.
The draft PEP discusses this. I think it would be more consistent to translate for \0, just like \r and \r\n.

For the your script, there is no reason to pass newline=nl to the stdout replacement. The only effect that has on output is \n replacement, which you don't want. And if we removed that effect from the proposal, it would have no effect at all on output, so why pass it?

Do you have a use case where you need to pass a non-standard newline to a text file/stream, but don't want newline replacement? Or is it just a matter of avoiding confusion if people accidentally pass it for stdout when they didn't want it?
...
My original code
works as is in this case i.e., *end=nl is still necessary*.
...
...
But of course that's the newline argument to sys.stdout, and you only
changed sys.stdin, so you do need end=nl anyway. (And you wouldn't
want output translation here anyway, because that could also translate
\n' characters in the middle of a line, re-creating the same problem
we're trying to avoid...)
But it uses sys.stdout.newline, not sys.stdin.newline.
The code affects *both* sys.stdout/sys.stdin. Look [2]:
I didn't notice that you passed it for stdout as well--as I explained above, you don't need it, and shouldn't do it.

As a side note, I think it might have been a better design to have separate arguments for input newline, output newline, and universal newlines mode, instead of cramming them all into one argument; for some simple cases the current design makes things a little less verbose, but it gets in the way for more complex cases, even today with \r or \r\n. However, I don't think that needs to be changed as part of this proposal.

It also might be nice to have a full set of PYTHONIOFOO env variables rather than just PYTHONIOENCODING, but again, I don't think that needs to be part of this proposal. And likewise for Nick Coghlan's rewrap method proposal on TextIOWrapper and maybe BufferedFoo.
...
...
...
sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
for line in SystemTextStream(sys.stdin.detach(), newline=nl):
   print(transform_filename(line.rstrip(nl)), end=nl)
[2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html
...
...
- SystemTextStream() handles undecodable in the current locale filenames
i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C)
- undecodable filenames are not supported on Windows. It is not clear
how to pass an undecodable filename via a pipe on Windows -- perhaps
`GetShortPathNameW -> fsencode -> pipe` might work in some cases. It
assumes that the short path exists and it is always encodable using
mbcs. If we can control all parts of the pipeline *and* Windows API
uses proper utf-16 (not ucs-2) then utf-8 can be used to pass
filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be
tried e.g., https://github.com/Drekin/win-unicode-console
First, don't both the Win32 APIs and the POSIX-ish layer in msvcrt on
top of it guarantee that you can never get such unencodable filenames
(sometimes by just pretending the file doesn't exist, but if possible
by having the filesystem map it to something valid, unique, and
persistent for this session, usually the short name)?
Second, trying to solve this implies that you have some other native
(as opposed to Cygwin) tool that passes or accepts such filenames over
simple pipes (as opposed to PowerShell typed ones). Are there any?
What does, say, mingw's find do with invalid filenames if it finds
them?
In short: I don't know :)
To be clear, I'm talking about native Windows applications (not
find/xargs on Cygwin). The goal is to process robustly *arbitrary*
filenames on Windows via a pipe (SystemTextStream()) or network (bytes
interface).
Yes, I assumed that, I just wanted to make that clear.

My point is that if there isn't already an ecosystem of tools that do so on Windows, or a recommended answer from Microsoft, we don't need to fit into existing practices here. (Actually, there _is_ a recommended answer from Microsoft, but it's "don't send encoded filenames over a binary stream, send them as an array of UTF-16 strings over PowerShell cmdlet typed pipes"--and, more generally, "don't use any ANSI interfaces except for backward compatibility reasons".) 

At any rate, if the filenames-over-pipes encoding problem exists on Windows, and if it's solvable, it's still outside the scope of this proposal, unless you think the documentation needs a completely worked example that shows how to interact with some Windows tool, alongside one for interacting with find -print0 on Unix. (And I don't think it does. If we want a Windows example, resource compiler string input files, which are \0-terminated UTF-16, probably serve better.)
...
I know that (A)nsi API (and therefore "POSIX-ish layer" that uses narrow
strings such main(), fopen(), fstream is broken e.g., Thai filenames on
Greek computer [3].
Yes, and broken in a way that people cannot easily work around except by using the UTF-16 interfaces. That's been Microsoft's recommended answer to the problem since NT 3.5, Win 95, and MSVCRT 3: if you want to handle all filenames, use _wmain, _wfopen, etc.--or, better, use CreateFileW instead of fopen. They never really addressed the issue of passing filenames between command-line tools at all, until PowerShell, where you pass them as a list of UTF-16 strings rather than a stream of newline-separated encoded bytes. (As a side note, I have no idea how well Python works for writing PowerShell cmdlets, but I don't think that's relevant to the current proposal.)
...
Unicode (W) API should enforce utf-16 in principle
since Windows 2000 [4]. But I expect ucs-2 shows its ugly head in many
places due to bad programming practices (based on the common wrong
assumption that Unicode == UTF-16 == UCS-2) and/or bugs that are not
fixed due to MS' backwards compatibility policies in the past [5].
Yes, I've run into such bugs in the past. It's even more fun when you're dealing with unterminated string with separate length interfaces. Fortunately, as far as I know, no such bugs affect reading and writing binary files, pipes, and sockets, so they don't affect us here.

Re: [Python-ideas] Iterating non-newline-separated files should be easier

Andrew Barnert

Akira Li

Andrew Barnert

Nick Coghlan

Akira Li

Andrew Barnert

Akira Li

Andrew Barnert

tags

participants (3)