[Python-ideas] Python multi-dimensional array constructor

Wed Oct 19 23:31:47 EDT 2016

On Wed, Oct 19, 2016 at 03:08:21PM -0400, Todd wrote:

[taking your later comment out of the order it was written]
> If this sort of thing doesn't interest you I won't be offended if you stop
> reading now, and I apologize if it is considered off-topic for this ML.

No problem Todd, we shouldn't be offended by ideas, and this is 
definitely on-topic.

> I have been thinking about how to go about having a multidimensional array
> constructor in python.  I know that Python doesn't have a built-in
> multidimensional array class and won't for the foreseeable future.

Generally speaking, Python doesn't invent syntax just on the off-chance 
that it will come in handy, nor does it typically invent operators for 
third-party libraries to use if they have no use in the built-ins.

I'm only aware of two exceptions to this, and both were added for numpy: 
extended slicing seq[start:end:step] and matrix multiplication A @ B. 
Extended slicing now is used by the built-ins, but originally it was 
added specifically for numpy.

However, in both cases, the suggestion came from the numpy developers 
themselves, and they had a specific, concrete need for the feature. Both 
features were solutions to real problems found by numpy users. I wasn't 
around when extended slicing was added, but matrix multiplication is an 
excellent example of a well-researched, well-written PEP:

http://python.org/dev/peps/pep-0465/

Whereas your suggestion seems more like a solution in search of a 
problem. You've come up with syntax for building arrays, but you don't 
seem to know which, if any, array will use this; nor do you seem to have 
identified an actual problem with the existing solution used by numpy 
(apart from calling them "somewhat verbose").

> The problem is finding an operator that isn't already being used, wouldn't
> conflict with existing rules, wouldn't break existing code, but that would
> still be at clearer and and more concise than the current syntax.

Just a brief note on terminology: you're not describing an operator, 
you're describing a "display" syntax: delimiters used to build a type 
such as tuple, list or dict. I still think of them as "list literals" 
etc, [1, 2, 3, 4] for example, even though technically they are not 
necessary literals (i.e. known at compile-time) and officially they are 
called "list displays" etc.

> The notation I came up with uses "[|" and "|]".  I picked this for 4
> reasons.  First, it isn't currently valid python syntax.  Second, it is
> clearly connected with the list constructor "[ ]".  Third, it is
> reminiscent of the "⟦ ⟧" symbols used for matrices in mathematics.

Sometimes used for matrices. Its more common to use a multiple-line 
version of [ ] which is, of course, hard to type in a regular editor :-)

See examples of matricies here:

http://mathworld.wolfram.com/Matrix.html

Moving on to the multi-dimensional examples you give:

> For a 2D array, you would use two vertical bars as a dimension separator
> "||" (multiple vertical bars are also not valid python syntax):
> 
> a = [| 0, 1, 2 || 3, 4, 5 |]
> 
> Or, on multiple lines (whitespace is ignored):
> 
> a = [| 0, 1, 2 ||
>        3, 4, 5 |]

To me, that looks decidedly strange. The | symbol has the disadvantage 
that you cannot tell which is opening a row and which is closing a row. 
The above looks like:

- first row: opened with a single bar, closed with two bars;
- second row: no opening delimiter at all, closed with a single bar.

I think that you have to compete with existing syntax for nested lists. 
The lowest common denominator for any array is to use nested lists and a 
function call. Nested lists can be easily converted into *any* array 
type you like, rather than picking out one, and only one, array type for 
special treatment.

If Python had a built-in array type, then maybe this would be justified, 
but it doesn't, and isn't likely to get one: lists fill the role that 
arrays do in most other languages. There is an array type in the 
standard library, array.array, but its not built-in and not important 
enough to be built-in or to get special syntax of its own.

And I'm not sure that numpy users miss the ability to write 
multi-dimensional arrays using syntax instead of a function call. 
Normally they would want the ability to specify a type and an order 
(rows first, like C, or columns first, like Fortran), and I think that 
for multi-dimensional arrays it is more usual and simpler to write out 
the values in a linear array and tell the array constructor to 
re-arrange them. Trying to write out a visual representation of anything 
with more than two dimensions is cumbersome when you are limited to the 
flat plan of a text file.

Consider:

[[[1, 2], [3, 4]], [[5, 6], [7, 8]]]

If your editor can highlight matching brackets, its quite easy to see 
where each row and plane begins and ends. Whereas your suggested syntax 
looks to me like a whole bunch of confusing lines. I cannot even work 
out what are the dimensions of this example:

> b = [||| 0, 1, 2
>       || 3, 4, 5
>      ||| 6, 7, 8
>       || 9, 10, 11
>      |||]

although if I sit and stare at it for a while I might guess... 4*3? If I 
already know it is meant to be 3D, then I might be able to work out that 
the extra bar means something, and guess 2*3*2, but I really wouldn't 
want to bet my sanity on understanding what those lines mean.

(Especially since, later on, the exact number and placement of lines is 
optional.)

What's the rule for when to use triple bars ||| and when to use double 
bars || or a single bar | ? It's a mystery to me. At least with matching 
left and right delimiters [ ] I can match them up to see where they 
begin and end.

> The rule for the number of dimensions is just the highest-specified
> dimension.  So these are equivalent:
> 
> a = [| 0, 1, 2 ||
>        3, 4, 5 |]
> 
> b = [|| 0, 1, 2 ||
>         3, 4, 5 ||]

Okay, now I'm completely lost. Doesn't the first example with a single 
vertical bar | mean that it is a 1D array? What's the "highest-specified 
dimension"? Are you suggesting that we have to count vertical bars to 
work out the dimension?

> This also means you would only strictly need to set the dimensions at one
> end.  That means these are equivalent, although the second and third case
> should be discouraged:
> 
> a = [|| 0, 1, 2 ||]
> 
> b = [| 0, 1, 2 ||]
> 
> c = [|| 0, 1, 2 |]

This strikes me as a HUGE bug magnet. More like a bug black hole 
actually, sucking in bugs from all through the universe and inserting 
them into your arrays... *wink*

Effectively, what you are saying is that *as an intentional feature*, a 
stray | accidentally inserted into your array will not cause a syntax 
error, but will instead increase the number of dimensions of the array. 
So instead of having a 17*10*30 array as you expected, you have a 
1*17*10*30 or 17*10*30*1 array, which may or may not fail deep in your 
code with some more or less unexpected and hard to diagnose error.

This (anti-)feature also makes syntax highlighting of matching bars 
impossible, instead of merely fiendishly difficult. Since it isn't an 
error for the bars not to match, you can't even count the bars to work 
out which ones are supposed to match. You have to somehow intuit or 
guess what the dimensions of the array are supposed to be, then reason 
backwards to see whether the right number of bars are in the right 
places to be compatible with those dimensions, and if not, your guess of 
the dimensions might be wrong... or not.

> At least in my opinion, this sort of approach really shines when making
> higher-dimensional arrays.

You should compare your approach to that of mathematicians and other 
programming languages.

Mathematicians don't really use multi-dimensional arrays. They have 
vectors, which are 1D, and matrices which are 2D, then they have tensors 
which confuse me, but they don't seem to use anything which corresponds 
to a simple higher-dimension analog of matrices. Tensors come close, but 
they don't seem to have anything like matrix-notation for tensors. 
(Given that tensors are often infinite dimensional, I'm hardly 
surprised.)

Matlab has syntax for 2D arrays, which can be expanded into 3D:

A = [1 2; 3 4];
A(:,:,2) = [5 6; 7 8]

R has an array function:

> array(1:8, c(2,2,2))
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    5    7
[2,]    6    8

Differences in ordering (row first or column first) aside, they are 
equivalent to Python's:

[[[1, 2], [3, 4]],
 [[5, 6], [7, 8]],
 ]

My HP-48 calculator uses square brackets for matrixes, with the 
convenience that in the calculator interface I only need to close the 
first pair of brackets:

2D: I can enter the keystrokes:

    [[1 2] 3 4

to get the 2D matrix:

    [[ 1 2 ]
     [ 3 4 ]]

but it has no support for 3D arrays.

Here's how C# does it:

https://msdn.microsoft.com/en-us/library/2yd9wwz4.aspx

> a = [|||| 48, 11, 141, 13, -60, -37, 58, -52, -29, 134
>        || -6, 96, -66, 137, -59, -147, -118, -104, -123, -7
>       ||| -103, 50, -89, -12,  28, -12, 119, -131, -73, 21
>        || -58, 105, 25, -138, -106, -118, -29, -49, -63, -56
>      |||| -43, -34, 101, -115, 41, 121, 3, -117, 101, -145
>        || 100, -128, 76, 128, -113, -90, 52, -91, -72, -15
>       ||| 22, -65, -118, 134, -58, 55, -73, -118, -53, -60
>        || -85, -136, 83, -66, -35, -117, -71, 115, -56, 133
>      ||||]

I wouldn't even want to guess what dimensions that is supposed to be. 10 
columns, because I can count them, but everything else is a mystery.

> Compared to the current approach:
> 
> a = np.ndarray([[[[48, 11, 141, 13, -60, -37, 58, -52, -29, 134],
>                   [-6, 96, -66, 137, -59, -147, -118, -104, -123, -7]],
>                  [[-103, 50, -89, -12,  28, -12, 119, -131, -73, 21],
>                   [-58, 105, 25, -138, -106, -118, -29, -49, -63, -56]]],
>                 [[[-43, -34, 101, -115, 41, 121, 3, -117, 101, -145],
>                   [100, -128, 76, 128, -113, -90, 52, -91, -72, -15]],
>                  [[22, -65, -118, 134, -58, 55, -73, -118, -53, -60],
>                   [-85, -136, 83, -66, -35, -117, -71, 115, -56, 133]]]])

But that's easy! Look at the nested brackets. The opening sequence tells 
you that there are four dimensions:

    [[[[

I can count the ten columns (and if I align them, I can visually verify 
that each row has exactly ten columns). Looking at the nested lists, I 
see:
    [[[[ten columns],
       [ten columns]],

so that's two rows by ten, then continuing:

     [2 x 10]],

which closes another layer, so that's 2 items in the third dimension, 
then when have another dimension:

   [2 x 10 x 2]]

and the array is closed, giving us in total:

2 x 10 x 2 x 2

In my opinion anyone trying to write out a single 4D array like this is 
opening themselves up to a hiding for nothing, even with clear nesting and 
matching open/close delimiters. Since we don't have 4D text files, it's 
better to write:

L = [48, 11, 141, 13, -60, -37, 58, -52, -29, 134,
    -6, 96, -66, 137, -59, -147, -118, -104, -123, -7,
    -103, 50, -89, -12,  28, -12, 119, -131, -73, 21,
    -58, 105, 25, -138, -106, -118, -29, -49, -63, -56,
    -43, -34, 101, -115, 41, 121, 3, -117, 101, -145,
    100, -128, 76, 128, -113, -90, 52, -91, -72, -15,
    22, -65, -118, 134, -58, 55, -73, -118, -53, -60,
    -85, -136, 83, -66, -35, -117, -71, 115, -56, 133]
assert len(L) == 2*10*2*2
arr = array(L, dim=(2,10,2,2))

or something similar, and let the array constructor resize as needed.

-- 
Steve