On Sat, Mar 15, 2014 at 3:41 AM, Nathaniel Smith <njs@pobox.com> wrote:
> I think we need to
> know something about how often the Mat @ Mat @ vec type cases arise in
> practice. How often do non-scalar * and np.dot show up in the same
> expression? How often does it look like a * np.dot(b, c), and how often does
> it look like np.dot(a * b, c)? How often do we see expressions like
> np.dot(np.dot(a, b), c), and how often do we see expressions like np.dot(a,
> np.dot(b, c))? This would really help guide the debate. I don't have this
> data, and I'm not sure the best way to get it. A super-fancy approach would
> be to write a little script that uses the 'ast' module to count things
> automatically. A less fancy approach would be to just pick some code you've
> written, or a well-known package, grep through for calls to 'dot', and make
> notes on what you see. (An advantage of the less-fancy approach is that as a
> human you might be able to tell the difference between scalar and non-scalar
> *, or check whether it actually matters what order the 'dot' calls are done
> in.)

Okay, I wrote a little script [1] to scan Python source files look for things like 'dot(a, dot(b, c))' or 'dot(dot(a, b), c)', or the ndarray.dot method equivalents. So what we get out is:
- a count of how many 'dot' calls there are
- a count of how often we see left-associative nestings: dot(dot(a, b), c)
- a count of how often we see right-associative nestings: dot(a, dot(b, c))

Running it on a bunch of projects, I get:

| project      | dots | left | right | right/left |
| scipy        |  796 |   53 |    27 |       0.51 |
| nipy         |  275 |    3 |    19 |       6.33 |
| scikit-learn |  472 |   11 |    10 |       0.91 |
| statsmodels  |  803 |   46 |    38 |       0.83 |
| astropy      |   17 |    0 |     0 |        nan |
| scikit-image |   15 |    1 |     0 |       0.00 |
| total        | 2378 |  114 |    94 |       0.82 |

(Any other projects worth trying? This is something that could vary a lot between different projects, so it seems more important to get lots of projects here than to get a few giant projects. Or if anyone wants to run the script on their own private code, please do! Running it on my personal pile of random junk finds 3 left-associative and 1 right.)

Two flaws with this approach:
1) Probably some proportion of those nested dot calls are places where it doesn't actually matter which evaluation order one uses -- dot() forces you to pick one, so you have to. If people prefer to, say, use the "left" form in cases where it doesn't matter, then this could bias the left-vs-right results -- hard to say. (Somewhere in this thread it was suggested that the use of the .dot method could create such a bias, because a.dot(b).dot(c) is more natural than a.dot(b.dot(c)), but only something like 6% of the dot calls here use the method form, so this probably doesn't matter.)

OTOH, this also means that the total frequency of @ expressions where associativity even matters at all is probably *over*-estimated by the above.

2) This approach misses cases where the cumbersomeness of dot has caused people to introduce temporary variables, like 'foo = np.dot(a, b); bar = np.dot(foo, c)'. So this causes us to *under*-estimate how often associativity matters. I did read through the 'dot' uses in scikit-learn and nipy, though, and only caught a handful of such cases, so I doubt it changes anything much.


[1] https://gist.github.com/njsmith/9157645#file-grep-dot-dot-py

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh