[Python-ideas] Iterating non-newline-separated files should be easier

Akira Li 4kir4.1i at gmail.com
Tue Jul 22 18:05:42 CEST 2014


Paul Moore <p.f.moore at gmail.com> writes:

> On 21 July 2014 01:41, Andrew Barnert
> <abarnert at yahoo.com.dmarc.invalid> wrote:
>> OK, I wrote up a draft PEP, and attached it to the bug (if that's
>> not a good thing to do, apologies); you can find it at
>> http://bugs.python.org/file36008/pep-newline.txt
>
> As a suggestion, how about adding an example of a simple nul-separated
> filename filter - the sort of thing that could go in a find -print0 |
> xxx | xargs -0 pipeline? If I understand it, that's one of the key
> motivating examples for this change, so seeing how it's done would be
> a great help.
>
> Here's the sort of thing I mean, written for newline-separated files:
>
> import sys
>
> def process(filename):
>     """Trivial example"""
>     return filename.lower()
>
> if __name__ == '__main__':
>
>     for filename in sys.stdin:
>         filename = process(filename)
>         print(filename)
>
> This is also an example of why I'm struggling to understand how an
> open() parameter "solves all the cases". There's no explicit open()
> call here, so how do you specify the record separator? Seeing how you
> propose this would work would be really helpful to me.
>

`find -print0 | ./tr-filename -0 | xargs -0` example implies that you
can replace `sys.std*` streams without worrying about preserving
`sys.__std*__` streams:

  #!/usr/bin/env python
  import io
  import re
  import sys
  from pathlib import Path

  def transform_filename(filename: str) -> str: # example
      """Normalize whitespace in basename."""
      path = Path(filename)
      new_path = path.with_name(re.sub(r'\s+', ' ', path.name))
      path.replace(new_path) # rename on disk if necessary
      return str(new_path)

  def SystemTextStream(bytes_stream, **kwargs):
      encoding = sys.getfilesystemencoding()
      return io.TextIOWrapper(bytes_stream,
          encoding=encoding,
          errors='surrogateescape' if encoding != 'mbcs' else 'strict',
          **kwargs)

  nl = '\0' if '-0' in sys.argv else None
  sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
  for line in SystemTextStream(sys.stdin.detach(), newline=nl):
      print(transform_filename(line.rstrip(nl)), end=nl)

io.TextIOWrapper() plays the role of open() in this case. The code
assumes that `newline` parameter accepts '\0'.

The example function handles Unicode whitespace to demonstrate why
opaque bytes-based cookies can't be used to represent filenames in this
case even on POSIX, though which characters are recognized depends on
sys.getfilesystemencoding().

Note:

- `end=nl` is necessary because `print()` prints '\n' by default -- it
  does not use `file.newline`
- `-0` option is required in the current implementation if filenames may
  have a trailing whitespace. It can be improved  
- SystemTextStream() handles undecodable in the current locale filenames
  i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C)
- undecodable filenames are not supported on Windows. It is not clear
  how to pass an undecodable filename via a pipe on Windows -- perhaps
  `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It
  assumes that the short path exists and it is always encodable using
  mbcs. If we can control all parts of the pipeline *and* Windows API
  uses proper utf-16 (not ucs-2) then utf-8 can be used to pass
  filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be
  tried e.g., https://github.com/Drekin/win-unicode-console


--
Akira



More information about the Python-ideas mailing list