[Python-ideas] Iterating non-newline-separated files should be easier

Sat Jul 19 11:01:59 CEST 2014

On Sat, Jul 19, 2014 at 04:18:35AM -0400, Nick Coghlan wrote:
> On 19 July 2014 03:32, Chris Angelico <rosuav at gmail.com> wrote:
> > On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> >> I still favour my proposal there to add a separate "readrecords()"
> >> method, rather than reusing the line based iteration methods - lines
> >> and arbitrary records *aren't* the same thing
> >
> > But they might well be the same thing. Look at all the Unix commands
> > that usually separate output with \n, but can be told to separate with
> > \0 instead. If you're reading from something like that, it should be
> > just as easy to split on \n as on \0.
> 
> Python isn't Unix, and Python has never supported \0 as a "line
> ending". Changing the meaning of existing constructs is fraught with
> complexity, and should only be done when there is absolutely no
> alternative. In this case, there's an alternative: a new method,
> specifically for reading arbitrary records.

I don't have an opinion one way or the other, but I don't quite see why 
you're worried about allowing the newline parameter to be set to some 
arbitrary separator. The best I can come up with is a scenario something 
like this:

I open a file with some record-separator

  fp = open(filename, newline="\0")

then pass it to a function:

  spam(fp)

which assumes that each chunk ends with a linefeed:

   assert next(fp).endswith('\n')

But in a case like that, the function is already buggy. I can see at 
least two problems with such an assumption:

- what if universal newlines has been turned off and you're reading
  a file created under (e.g.) classic Mac OS or RISC OS?

- what if the file contains a single line which does not end with an
  end of line character at all?

   open('/tmp/junk', 'wb').write("hello world!")
   next(open('/tmp/junk', 'r'))

Have I missed something?

Although I'm don't mind whether files grow a readrecords() method, or 
re-use the readlines() method, I'm not convinced that API decisions 
should be driven solely by the needs of programs which are already 
buggy.

-- 
Steven