[Python-ideas] Iterating non-newline-separated files should be easier

Andrew Barnert abarnert at yahoo.com
Fri Jul 18 18:43:26 CEST 2014


Before responding to Wolfgang, something that occurred to me overnight: The only insurmountable problem with Guido's suggestion of "just unwrap and rewrap the raw or buffer in a subclass that adds this behavior" is that you can't write such a subclass of TextIOWrapper, because it has no way to either peek at or push back onto the buffer. So... Why not add one of those? 

Pushing back is easier to implement (since it's already there as a private method), but a bit funky, and peeking would mean it works the same way as with buffered binary files. But I'll take a look at the idiomatic way to do similar things in other languages (C stdio, C++ iostreams, etc.), and make sure that peek is actually sensible for TextIOWrapper, before arguing for it.

While we're at it, it might be nice for the peek method to be documented as an (optional, like raw, etc.?) member of the two ABCs instead of just something that one implementation happens to have, and that the mixin code will use if it happens to be present. (Binary readline uses peek if it exists, falls back to byte by byte if not.)

On Jul 18, 2014, at 4:53, Wolfgang Maier <wolfgang.maier at biologie.uni-freiburg.de> wrote:

> On 07/18/2014 02:04 AM, Andrew Barnert wrote:
>> On Thursday, July 17, 2014 3:21 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
>> 
>> 
>> 
>>>   On Thursday, July 17, 2014 2:40 PM, Alexander Heger <python at 2sn.net> wrote:
>> 
>>>>   Could the "split" (or splitline) keyword-only
>>>> parameter instead be passed to the open function
>>>> (and the __init__ of IOBase and be stored there)?
>>> 
>>> Good idea. It's less powerful/flexible, but probably
>>> good enough for almost all use cases. (I can't think
>>> of any file where I'd need to split part of it on \0
>>> and the rest on \n…) Also, it means you can stick with
>>> the normal __iter__ instead of needing a separate
>>> iterlines method.
>> 
>> It turns out to be even simpler than I expected.
>> 
>> I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO.
>> 
>> For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it.
> 
> You are not the first one to come up with this idea and suggesting solutions. This whole thing has been hanging around on the bug tracker as an unresolved issue (started by Nick Coghlan) since almost a decade:
> 
> http://bugs.python.org/issue1152248
> 
> Ever since discovering it, I've been sticking to the recipe provided by Douglas Alan:
> 
> http://bugs.python.org/issue1152248#msg109117

Thanks.

Douglas's recipe is effectively the same as my resplit, except less general (since it consumes a file rather than any iterable), and some, but not all, of the limitations of that approach were mentioned. And R. David Murray's hack patch is the basically the same as the text half of my patch. 

The discussion there is also useful, as it raises the similar features in perl, awk, bash, etc.--all of which work by having the user change either a global or something on the file object, rather than putting it in the line-reading code, which reinforces my belief that Alexander's idea of putting the separator value it in the file constructors was right, and my initially putting it in readline or a new readuntil method was wrong.

> Not that I wouldn't like to see this feature to be shipping with Python, but it may help to read through all aspects of the problem that have been discussed before.
> 
> Best,
> Wolfgang
> 
> 
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/


More information about the Python-ideas mailing list