[Python-ideas] Iterating non-newline-separated files should be easier

Thu Jul 24 11:07:59 CEST 2014

Andrew Barnert <abarnert at yahoo.com> writes:

> On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i at gmail.com> wrote:
>> Andrew Barnert <abarnert at yahoo.com> writes:
>>> On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i at gmail.com> wrote:
>>>> Paul Moore <p.f.moore at gmail.com> writes:
>>>>> On 21 July 2014 01:41, Andrew Barnert
>>>>> <abarnert at yahoo.com.dmarc.invalid> wrote:
>>>>>> OK, I wrote up a draft PEP, and attached it to the bug (if that's
>>>>>> not a good thing to do, apologies); you can find it at
>>>>>> http://bugs.python.org/file36008/pep-newline.txt
>>>>>
>>>>> As a suggestion, how about adding an example of a simple nul-separated
>>>>> filename filter - the sort of thing that could go in a find -print0 |
>>>>> xxx | xargs -0 pipeline? If I understand it, that's one of the key
>>>>> motivating examples for this change, so seeing how it's done would be
>>>>> a great help.
>>>>
>>>> `find -print0 | ./tr-filename -0 | xargs -0` example implies that you
>>>> can replace `sys.std*` streams without worrying about preserving
>>>> `sys.__std*__` streams:
>>>>
>>>> #!/usr/bin/env python
>>>> import io
>>>> import re
>>>> import sys
>>>> from pathlib import Path
>>>>
>>>> def transform_filename(filename: str) -> str: # example
>>>>    """Normalize whitespace in basename."""
>>>>    path = Path(filename)
>>>>    new_path = path.with_name(re.sub(r'\s+', ' ', path.name))
>>>>    path.replace(new_path) # rename on disk if necessary
>>>>    return str(new_path)
>>>>
>>>> def SystemTextStream(bytes_stream, **kwargs):
>>>>    encoding = sys.getfilesystemencoding()
>>>>    return io.TextIOWrapper(bytes_stream,
>>>>        encoding=encoding,
>>>>        errors='surrogateescape' if encoding != 'mbcs' else 'strict',
>>>>        **kwargs)
>>>>
>>>> nl = '\0' if '-0' in sys.argv else None
>>>> sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
>>>> for line in SystemTextStream(sys.stdin.detach(), newline=nl):
>>>>    print(transform_filename(line.rstrip(nl)), end=nl)
>>>
>>> Nice, much more complete example than mine. I just tried to handle as
>>> many edge cases as the original he asked about, but you handle
>>> everything.
>>>>
>>>> io.TextIOWrapper() plays the role of open() in this case. The code
>>>> assumes that `newline` parameter accepts '\0'.
>>>>
>>>> The example function handles Unicode whitespace to demonstrate why
>>>> opaque bytes-based cookies can't be used to represent filenames in this
>>>> case even on POSIX, though which characters are recognized depends on
>>>> sys.getfilesystemencoding().
>>>>
>>>> Note:
>>>>
>>>> - `end=nl` is necessary because `print()` prints '\n' by default -- it
>>>> does not use `file.newline`
>>>
>>> Actually, yes it does. Or, rather, print pastes on a '\n', but
>>> sys.stdout.write translates any '\n' characters to sys.stdout.writenl
>>> (a private variable that's initialized from the newline argument at
>>> construction time if it's anything other than None or '').
>>
>> You are right. I've stopped reading the source for print() function at
>> `PyFile_WriteString("\n", file);` line assuming that "\n" is not
>> translated if newline="\0". But the current behaviour if "\0" were in
>> "the other legal values" category (like "\r") would be to translate "\n"
>> [1]:
>>
>> When writing output to the stream, if newline is None, any '\n'
>> characters written are translated to the system default line
>> separator, os.linesep. If newline is '' or '\n', no translation takes
>> place. If newline is any of the other legal values, any '\n'
>> characters written are translated to the given string.
>>
>> [1] https://docs.python.org/3/library/io.html#io.TextIOWrapper
>>
>> Example:
>>
>> $ ./python -c 'import sys, io;
>> sys.stdout=io.TextIOWrapper(sys.stdout.detach(), newline="\r\n");
>> sys.stdout.write("\n\r\r\n")'| xxd
>> 0000000: 0d0a 0d0d 0d0a                           ......
>>
>> "\n" is translated to b"\r\n" here and "\r" is left untouched (b"\r").
>>
>> In order to newline="\0" case to work, it should behave similar to
>> newline='' or newline='\n' case instead i.e., no translation should take
>> place, to avoid corrupting embed "\n\r" characters.
>
> The draft PEP discusses this. I think it would be more consistent to
> translate for \0, just like \r and \r\n.

I read the [draft]. No translation is a better choice here. Otherwise
(at the very least) it breaks `find -print0` use case.

[draft] http://bugs.python.org/file36008/pep-newline.txt

Simple things should be simple (i.e., no translation unless special case):

- binary file -- a stream of bytes: no structure, no translation on
  read/write
- text file -- a stream of Unicode codepoints
- file with fixed-length chunks:

    for chunk in iter(partial(file.read, chunksize), EOF):
        pass

- file with variable-length records (aka lines) which end with a
  separator or EOF: no translation, no escaping (no embed separators):

    for line in file:
        pass

  or

    line = file.readline() # next(file)

newline in {None, '', '\r', '\r\n'} is a (very important) special case
that represents the complicated legacy behavior for text files.

newline='\0' (like '\n') should be a *much simpler* case: no
translation on read/write, no escaping (no embed '\0', each '\0' in the
stream is a separator).

newline='\0' is simple to explain: readline/next return everything until
the next '\0' (including it) or EOF. It is simple to implement - no
translation is required.

readline(keep_end=True) keyword-only parameter and/or chomp()-like
method could be added to simplify removing a trailing newline.

newline in {"\N{NEL}", "\n\n", "\r\r", "\n\r"} behave like newline="\n"
i.e., no translation. New *docs for writing text files*:

  When writing output to the stream:

  - if newline is None, any '\n' characters written are translated to
    the system default line separator, os.linesep
  - if newline is '\r' or '\r\n', any '\n' characters written are
    translated to the given string
  - no translation takes place for any other newline value.

The docs for binary files are simpler:

   No translation takes place for any newline value. The line terminator
   is newline parameter (default is b'\n').

The new *docs for reading text files*:

  When reading input from the stream:

  - if newline is None, universal newlines mode is enabled: lines in the
    input can end in '\n', '\r', or '\r\n', and these are translated
    into '\n' before being returned to the caller
  - if newline is '', universal newlines mode is enabled, but line
    endings are returned to the caller untranslated
  - if newline is any other value, input lines are only terminated by
    the given string, and the line ending is returned to the caller
    untranslated.

The new behavior being more powerful is no more complex than the old one
https://docs.python.org/3.4/library/io.html#io.TextIOWrapper

Backwards compatibility is preserved except that newline parameter
accepts more values.

> For the your script, there is no reason to pass newline=nl to the
> stdout replacement. The only effect that has on output is \n
> replacement, which you don't want. And if we removed that effect from
> the proposal, it would have no effect at all on output, so why pass
> it?

Keep in mind, I expect that newline='\0' does *not* translate '\n' to
'\0'. If you remove newline=nl then embed \n might be corrupted i.e., it
breaks `find -print0` use-case. Both newline=nl for stdout and end=nl
are required here. Though (optionally) it would be nice to change
`print()` so that it would use `end=file.newline or '\n'` by default
instead.

There is also line_buffering parameter. From the docs:

  If line_buffering is True, flush() is implied when a call to write
  contains a newline character.

i.e., you might also need newline=nl to flush() the stream in time.

For example, the absense of the flush() call on newline may lead to a
deadlock if subprocess module is used to implement pexpect-like
behavior. There are corresponding Python issues:

- text mode http://bugs.python.org/issue21332 : add line_buffering=True
  if bufsize=1, to avoid a deadlock (regression from Python 2 behavior)

- binary mode http://bugs.python.org/issue21471 : implement
  line_buffering=True behavior for binary files when bufsize=1

> Do you have a use case where you need to pass a non-standard newline
> to a text file/stream, but don't want newline replacement?

`find -print0` use case that my code implements above.

> Or is it just a matter of avoiding confusion if people accidentally
> pass it for stdout when they didn't want it?

See the explanation above that starts with "Simple things should be simple."

>> My original code
>> works as is in this case i.e., *end=nl is still necessary*.
>
>>> But of course that's the newline argument to sys.stdout, and you only
>>> changed sys.stdin, so you do need end=nl anyway. (And you wouldn't
>>> want output translation here anyway, because that could also translate
>>> \n' characters in the middle of a line, re-creating the same problem
>>> we're trying to avoid...)
>>>
>>> But it uses sys.stdout.newline, not sys.stdin.newline.
>>
>> The code affects *both* sys.stdout/sys.stdin. Look [2]:
>
> I didn't notice that you passed it for stdout as well--as I explained
> above, you don't need it, and shouldn't do it.

Both newline=nl and end=nl are needed because I assume that there is no
newline translation in newline='\0' case. See the explanation
above. Here's the same code for context:

  sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl)
  for line in SystemTextStream(sys.stdin.detach(), newline=nl):
      print(transform_filename(line.rstrip(nl)), end=nl)

[2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html

> As a side note, I think it might have been a better design to have
> separate arguments for input newline, output newline, and universal
> newlines mode, instead of cramming them all into one argument; for
> some simple cases the current design makes things a little less
> verbose, but it gets in the way for more complex cases, even today
> with \r or \r\n. However, I don't think that needs to be changed as
> part of this proposal.

Usually different objects are used for input and output i.e., a single
newline parameter allows input newlines to be different from output
newlines. 

The newline behavior for reading and writing is different but it is
closely related. Having two parameters wouldn't make the documentation
simpler.

Separate parameters might be useful if the same file object is used for
reading and writing *and* input/output newlines are different from each
other. But I don't think it is worth it to complicate the common case
(separate objects).

--
Akira