[Python-ideas] Iterating non-newline-separated files should be easier
Akira Li
4kir4.1i at gmail.com
Tue Jul 22 18:05:42 CEST 2014
Paul Moore <p.f.moore at gmail.com> writes:
> On 21 July 2014 01:41, Andrew Barnert
> <abarnert at yahoo.com.dmarc.invalid> wrote:
>> OK, I wrote up a draft PEP, and attached it to the bug (if that's
>> not a good thing to do, apologies); you can find it at
>> http://bugs.python.org/file36008/pep-newline.txt
>
> As a suggestion, how about adding an example of a simple nul-separated
> filename filter - the sort of thing that could go in a find -print0 |
> xxx | xargs -0 pipeline? If I understand it, that's one of the key
> motivating examples for this change, so seeing how it's done would be
> a great help.
>
> Here's the sort of thing I mean, written for newline-separated files:
>
> import sys
>
> def process(filename):
> """Trivial example"""
> return filename.lower()
>
> if __name__ == '__main__':
>
> for filename in sys.stdin:
> filename = process(filename)
> print(filename)
>
> This is also an example of why I'm struggling to understand how an
> open() parameter "solves all the cases". There's no explicit open()
> call here, so how do you specify the record separator? Seeing how you
> propose this would work would be really helpful to me.
>
`find -print0 | ./tr-filename -0 | xargs -0` example implies that you
can replace `sys.std*` streams without worrying about preserving
`sys.__std*__` streams:
#!/usr/bin/env python
import io
import re
import sys
from pathlib import Path
def transform_filename(filename: str) -> str: # example
"""Normalize whitespace in basename."""
path = Path(filename)
new_path = path.with_name(re.sub(r'\s+', ' ', path.name))
path.replace(new_path) # rename on disk if necessary
return str(new_path)
def SystemTextStream(bytes_stream, **kwargs):
encoding = sys.getfilesystemencoding()
return io.TextIOWrapper(bytes_stream,
encoding=encoding,
errors='surrogateescape' if encoding != 'mbcs' else 'strict',
**kwargs)
nl = '\0' if '-0' in sys.argv else None
sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
for line in SystemTextStream(sys.stdin.detach(), newline=nl):
print(transform_filename(line.rstrip(nl)), end=nl)
io.TextIOWrapper() plays the role of open() in this case. The code
assumes that `newline` parameter accepts '\0'.
The example function handles Unicode whitespace to demonstrate why
opaque bytes-based cookies can't be used to represent filenames in this
case even on POSIX, though which characters are recognized depends on
sys.getfilesystemencoding().
Note:
- `end=nl` is necessary because `print()` prints '\n' by default -- it
does not use `file.newline`
- `-0` option is required in the current implementation if filenames may
have a trailing whitespace. It can be improved
- SystemTextStream() handles undecodable in the current locale filenames
i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C)
- undecodable filenames are not supported on Windows. It is not clear
how to pass an undecodable filename via a pipe on Windows -- perhaps
`GetShortPathNameW -> fsencode -> pipe` might work in some cases. It
assumes that the short path exists and it is always encodable using
mbcs. If we can control all parts of the pipeline *and* Windows API
uses proper utf-16 (not ucs-2) then utf-8 can be used to pass
filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be
tried e.g., https://github.com/Drekin/win-unicode-console
--
Akira
More information about the Python-ideas
mailing list