[Python-ideas] Iterating non-newline-separated files should be easier

Fri Jul 18 02:04:00 CEST 2014

On Thursday, July 17, 2014 3:21 PM, Andrew Barnert <abarnert at yahoo.com> wrote:

>  On Thursday, July 17, 2014 2:40 PM, Alexander Heger <python at 2sn.net> wrote:

>>  Could the "split" (or splitline) keyword-only
>> parameter instead be passed to the open function 
>> (and the __init__ of IOBase and be stored there)?
> 
> Good idea. It's less powerful/flexible, but probably
> good enough for almost all use cases. (I can't think
> of any file where I'd need to split part of it on \0
> and the rest on \n…) Also, it means you can stick with
> the normal __iter__ instead of needing a separate
> iterlines method.

It turns out to be even simpler than I expected.

I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO.

For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it.

(Of course you'd also want to add it to all of the stdlib cases like zipfile.ZipFile.open/zipfile.ExtZipFile.__init__, but there aren't too many of those.)

This means that the buffer underlying a text file with a non-standard newline doesn't automatically have a matching newline. I think that's a good thing ('\r\n' and '\r' would need exceptions for backward compatibility; '\0'.encode('utf-16-le') isn't a very useful thing to split on; etc.), but doing it the other way is almost as easy, and very little code will never care.