Stream programming

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Mar 23 23:23:44 EDT 2012


On Fri, 23 Mar 2012 17:00:23 +0100, Kiuhnm wrote:

> I've been writing a little library for handling streams as an excuse for
> doing a little OOP with Python.
> 
> I don't share some of the views on readability expressed on this ng.
> Indeed, I believe that a piece of code may very well start as complete
> gibberish and become a pleasure to read after some additional
> information is provided.
[...]
> numbers - push - avrg - 'med' - pop - filter(lt('med'), ge('med'))\
>      - ['same', 'same'] - streams(cat) - 'same'
> 
> Ok, we're at the "complete gibberish" phase.
> 
> Time to give you the "additional information".

There are multiple problems with your DSL. Having read your explanation, 
and subsequent posts, I think I understand the data model, but the syntax 
itself is not very good and far from readable. It is just too hard to 
reason about the code.

Your syntax conflicts with established, far more common, use of the same 
syntax: you use - to mean "call a function" and | to join two or more 
streams into a flow.

You also use () for calling functions, and the difference between - and 
() isn't clear. So a mystery there -- your DSL seems to have different 
function syntax, depending on... what?

The semantics are unclear even after your examples. To understand your 
syntax, you give examples, but to understand the examples, the reader 
needs to understand the syntax. That suggests that the semantics are 
unclear even in your own mind, or at least too difficult to explain in 
simple examples.

Take this example:

> Flows can be saved (push) and restored (pop) :
>    [1,2,3,4] - push - by(2) - 'double' - pop | val('double')
>        <=> [1,2,3,4] | [2,4,6,8]

What the hell does that mean? The reader initially doesn't know what 
*any* of push, by(2), pop or val('double') means. All they see is an 
obfuscated series of calls that starts with a stream as input, makes a 
copy of it, and doubles the entries in the copy: you make FIVE function 
calls to perform TWO conceptual operations. So the reader can't even map 
a function call to a result.

With careful thought and further explanations from you, the reader (me) 
eventually gets a mental model here. Your DSL has a single input which is 
pipelined through a series of function calls by the - operator, plus a 
separate stack. (I initially thought that, like Forth, your DSL was stack 
based. But it isn't, is it?)

It seems to me that the - operator is only needed as syntactic sugar to 
avoid using reverse Polish notation and an implicit stack. Instead of the 
Forth-like:

[1,2,3,4] dup 2 *

your DSL has an explicit stack, and an explicit - operator to call a 
function. Presumably "[1,2] push" would be a syntax error.

I think this is a good example of an inferior syntax. Contrast your:

[1,2,3,4] - push - by(2) - 'double' - pop | val('double')

with the equivalent RPL:

[1,2,3,4] dup 2 *


Now *that* is a pleasure to read, once you wrap your head around reverse 
Polish notation and the concept of a stack. Which you need in your DSL 
anyway, to understand push and pop.

You say that this is an "easier way to get the same result":

[1,2,3,4] - [id, by(2)]

but it isn't, is it? The more complex example above ends up with two 
streams joined in a single flow:

[1,2,3,4]|[2,4,6,8]

whereas the shorter version using the magic "id" gives you a single 
stream containing nested streams:

[[1,2,3,4], [2,4,6,8]]


So, how could you make this more readable?

* Don't fight the reader's expectations. If they've programmed in Unix 
shells, they expect | as the pipelining operator. If they haven't, they 
probably will find >> easy to read as a dataflow operator. Either way, 
they're probably used to seeing a|b as meaning "or" (as in "this stream, 
or this stream") rather than the way you seem to be using it ("this 
stream, and this stream").

Here's my first attempt at improved syntax that doesn't fight the user:

[1,2,3,4] >> push >> by(2) >> 'double' >> pop & val('double')

"push" and "pop" are poor choices of words. Push does not actually push 
its input onto the stack, which would leave the input stream empty. It 
makes a copy. You explain what they do:

"Flows can be saved (push) and restored (pop)"

so why not just use SAVE and RESTORE as your functions? Or if they're too 
verbose, STO and RCL, or my preference, store and recall.

[1,2,3,4] >> store >> by(2) >> 'double' >> recall & val('double')

I'm still not happy with & for the join operator. I think that the use of 
+ for concatenate and & for join is just one of those arbitrary choices 
that the user will have to learn. Although I'm tempted to try using a 
colon instead.

[1,2,3]:[4,5,6]

would be a flow with two streams.

I don't like the syntax for defining and using names. Here's a random 
thought:

[1,2,3,4] >> store >> by(2) >> @double >> recall & double 

Use @name to store to a name, and the name alone to retrieve from it. But 
I haven't given this too much thought, so it too might suck.

Some other problems with your DSL:

> A flow can be transformed:
>    [1,2] - f <=> [f(1),f(2)]

but that's not consistently true. For instance:

[1,2] - push  <=/=>  [push(1), push(2)]


So the reader needs to know all the semantics of the particular function 
f before being able to reason about the flow. Your DSL displays magic 
behaviour, which is bad and makes it hard to read the code because the 
reader may not know which functions are magic and which are not.


> Some functions are special and almost any function can be made special:
>    [1,2,3,4,5] - filter(isprime) <=> [2,3,5] 
>    [[],(1,2),[3,4,5]] - flatten <=> [1,2,3,4,5]

You say that as if it were a good thing.



-- 
Steven



More information about the Python-list mailing list