[Python-ideas] Iterating non-newline-separated files should be easier

Wed Jul 23 06:40:54 CEST 2014

On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i at gmail.com> wrote:

> Paul Moore <p.f.moore at gmail.com> writes:
> 
>> On 21 July 2014 01:41, Andrew Barnert
>> <abarnert at yahoo.com.dmarc.invalid> wrote:
>>> OK, I wrote up a draft PEP, and attached it to the bug (if that's
>>> not a good thing to do, apologies); you can find it at
>>> http://bugs.python.org/file36008/pep-newline.txt
>> 
>> As a suggestion, how about adding an example of a simple nul-separated
>> filename filter - the sort of thing that could go in a find -print0 |
>> xxx | xargs -0 pipeline? If I understand it, that's one of the key
>> motivating examples for this change, so seeing how it's done would be
>> a great help.
>> 
>> Here's the sort of thing I mean, written for newline-separated files:
>> 
>> import sys
>> 
>> def process(filename):
>>    """Trivial example"""
>>    return filename.lower()
>> 
>> if __name__ == '__main__':
>> 
>>    for filename in sys.stdin:
>>        filename = process(filename)
>>        print(filename)
>> 
>> This is also an example of why I'm struggling to understand how an
>> open() parameter "solves all the cases". There's no explicit open()
>> call here, so how do you specify the record separator? Seeing how you
>> propose this would work would be really helpful to me.
> 
> `find -print0 | ./tr-filename -0 | xargs -0` example implies that you
> can replace `sys.std*` streams without worrying about preserving
> `sys.__std*__` streams:
> 
>  #!/usr/bin/env python
>  import io
>  import re
>  import sys
>  from pathlib import Path
> 
>  def transform_filename(filename: str) -> str: # example
>      """Normalize whitespace in basename."""
>      path = Path(filename)
>      new_path = path.with_name(re.sub(r'\s+', ' ', path.name))
>      path.replace(new_path) # rename on disk if necessary
>      return str(new_path)
> 
>  def SystemTextStream(bytes_stream, **kwargs):
>      encoding = sys.getfilesystemencoding()
>      return io.TextIOWrapper(bytes_stream,
>          encoding=encoding,
>          errors='surrogateescape' if encoding != 'mbcs' else 'strict',
>          **kwargs)
> 
>  nl = '\0' if '-0' in sys.argv else None
>  sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
>  for line in SystemTextStream(sys.stdin.detach(), newline=nl):
>      print(transform_filename(line.rstrip(nl)), end=nl)

Nice, much more complete example than mine. I just tried to handle as many edge cases as the original he asked about, but you handle everything.

> io.TextIOWrapper() plays the role of open() in this case. The code
> assumes that `newline` parameter accepts '\0'.
> 
> The example function handles Unicode whitespace to demonstrate why
> opaque bytes-based cookies can't be used to represent filenames in this
> case even on POSIX, though which characters are recognized depends on
> sys.getfilesystemencoding().
> 
> Note:
> 
> - `end=nl` is necessary because `print()` prints '\n' by default -- it
>  does not use `file.newline`

Actually, yes it does. Or, rather, print pastes on a '\n', but sys.stdout.write translates any '\n' characters to sys.stdout.writenl (a private variable that's initialized from the newline argument at construction time if it's anything other than None or '').

But of course that's the newline argument to sys.stdout, and you only changed sys.stdin, so you do need end=nl anyway. (And you wouldn't want output translation here anyway, because that could also translate '\n' characters in the middle of a line, re-creating the same problem we're trying to avoid...)

But it uses sys.stdout.newline, not sys.stdin.newline.

> - `-0` option is required in the current implementation if filenames may
>  have a trailing whitespace. It can be improved  
> - SystemTextStream() handles undecodable in the current locale filenames
>  i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C)
> - undecodable filenames are not supported on Windows. It is not clear
>  how to pass an undecodable filename via a pipe on Windows -- perhaps
>  `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It
>  assumes that the short path exists and it is always encodable using
>  mbcs. If we can control all parts of the pipeline *and* Windows API
>  uses proper utf-16 (not ucs-2) then utf-8 can be used to pass
>  filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be
>  tried e.g., https://github.com/Drekin/win-unicode-console

First, don't both the Win32 APIs and the POSIX-ish layer in msvcrt on top of it guarantee that you can never get such unencodable filenames (sometimes by just pretending the file doesn't exist, but if possible by having the filesystem map it to something valid, unique, and persistent for this session, usually the short name)?

Second, trying to solve this implies that you have some other native (as opposed to Cygwin) tool that passes or accepts such filenames over simple pipes (as opposed to PowerShell typed ones). Are there any? What does, say, mingw's find do with invalid filenames if it finds them?

On Unix, of course, it's a real problem.