[Python-ideas] Python multi-dimensional array constructor
Steven D'Aprano
steve at pearwood.info
Wed Oct 19 23:31:47 EDT 2016
On Wed, Oct 19, 2016 at 03:08:21PM -0400, Todd wrote:
[taking your later comment out of the order it was written]
> If this sort of thing doesn't interest you I won't be offended if you stop
> reading now, and I apologize if it is considered off-topic for this ML.
No problem Todd, we shouldn't be offended by ideas, and this is
definitely on-topic.
> I have been thinking about how to go about having a multidimensional array
> constructor in python. I know that Python doesn't have a built-in
> multidimensional array class and won't for the foreseeable future.
Generally speaking, Python doesn't invent syntax just on the off-chance
that it will come in handy, nor does it typically invent operators for
third-party libraries to use if they have no use in the built-ins.
I'm only aware of two exceptions to this, and both were added for numpy:
extended slicing seq[start:end:step] and matrix multiplication A @ B.
Extended slicing now is used by the built-ins, but originally it was
added specifically for numpy.
However, in both cases, the suggestion came from the numpy developers
themselves, and they had a specific, concrete need for the feature. Both
features were solutions to real problems found by numpy users. I wasn't
around when extended slicing was added, but matrix multiplication is an
excellent example of a well-researched, well-written PEP:
http://python.org/dev/peps/pep-0465/
Whereas your suggestion seems more like a solution in search of a
problem. You've come up with syntax for building arrays, but you don't
seem to know which, if any, array will use this; nor do you seem to have
identified an actual problem with the existing solution used by numpy
(apart from calling them "somewhat verbose").
> The problem is finding an operator that isn't already being used, wouldn't
> conflict with existing rules, wouldn't break existing code, but that would
> still be at clearer and and more concise than the current syntax.
Just a brief note on terminology: you're not describing an operator,
you're describing a "display" syntax: delimiters used to build a type
such as tuple, list or dict. I still think of them as "list literals"
etc, [1, 2, 3, 4] for example, even though technically they are not
necessary literals (i.e. known at compile-time) and officially they are
called "list displays" etc.
> The notation I came up with uses "[|" and "|]". I picked this for 4
> reasons. First, it isn't currently valid python syntax. Second, it is
> clearly connected with the list constructor "[ ]". Third, it is
> reminiscent of the "⟦ ⟧" symbols used for matrices in mathematics.
Sometimes used for matrices. Its more common to use a multiple-line
version of [ ] which is, of course, hard to type in a regular editor :-)
See examples of matricies here:
http://mathworld.wolfram.com/Matrix.html
Moving on to the multi-dimensional examples you give:
> For a 2D array, you would use two vertical bars as a dimension separator
> "||" (multiple vertical bars are also not valid python syntax):
>
> a = [| 0, 1, 2 || 3, 4, 5 |]
>
> Or, on multiple lines (whitespace is ignored):
>
> a = [| 0, 1, 2 ||
> 3, 4, 5 |]
To me, that looks decidedly strange. The | symbol has the disadvantage
that you cannot tell which is opening a row and which is closing a row.
The above looks like:
- first row: opened with a single bar, closed with two bars;
- second row: no opening delimiter at all, closed with a single bar.
I think that you have to compete with existing syntax for nested lists.
The lowest common denominator for any array is to use nested lists and a
function call. Nested lists can be easily converted into *any* array
type you like, rather than picking out one, and only one, array type for
special treatment.
If Python had a built-in array type, then maybe this would be justified,
but it doesn't, and isn't likely to get one: lists fill the role that
arrays do in most other languages. There is an array type in the
standard library, array.array, but its not built-in and not important
enough to be built-in or to get special syntax of its own.
And I'm not sure that numpy users miss the ability to write
multi-dimensional arrays using syntax instead of a function call.
Normally they would want the ability to specify a type and an order
(rows first, like C, or columns first, like Fortran), and I think that
for multi-dimensional arrays it is more usual and simpler to write out
the values in a linear array and tell the array constructor to
re-arrange them. Trying to write out a visual representation of anything
with more than two dimensions is cumbersome when you are limited to the
flat plan of a text file.
Consider:
[[[1, 2], [3, 4]], [[5, 6], [7, 8]]]
If your editor can highlight matching brackets, its quite easy to see
where each row and plane begins and ends. Whereas your suggested syntax
looks to me like a whole bunch of confusing lines. I cannot even work
out what are the dimensions of this example:
> b = [||| 0, 1, 2
> || 3, 4, 5
> ||| 6, 7, 8
> || 9, 10, 11
> |||]
although if I sit and stare at it for a while I might guess... 4*3? If I
already know it is meant to be 3D, then I might be able to work out that
the extra bar means something, and guess 2*3*2, but I really wouldn't
want to bet my sanity on understanding what those lines mean.
(Especially since, later on, the exact number and placement of lines is
optional.)
What's the rule for when to use triple bars ||| and when to use double
bars || or a single bar | ? It's a mystery to me. At least with matching
left and right delimiters [ ] I can match them up to see where they
begin and end.
> The rule for the number of dimensions is just the highest-specified
> dimension. So these are equivalent:
>
> a = [| 0, 1, 2 ||
> 3, 4, 5 |]
>
> b = [|| 0, 1, 2 ||
> 3, 4, 5 ||]
Okay, now I'm completely lost. Doesn't the first example with a single
vertical bar | mean that it is a 1D array? What's the "highest-specified
dimension"? Are you suggesting that we have to count vertical bars to
work out the dimension?
> This also means you would only strictly need to set the dimensions at one
> end. That means these are equivalent, although the second and third case
> should be discouraged:
>
> a = [|| 0, 1, 2 ||]
>
> b = [| 0, 1, 2 ||]
>
> c = [|| 0, 1, 2 |]
This strikes me as a HUGE bug magnet. More like a bug black hole
actually, sucking in bugs from all through the universe and inserting
them into your arrays... *wink*
Effectively, what you are saying is that *as an intentional feature*, a
stray | accidentally inserted into your array will not cause a syntax
error, but will instead increase the number of dimensions of the array.
So instead of having a 17*10*30 array as you expected, you have a
1*17*10*30 or 17*10*30*1 array, which may or may not fail deep in your
code with some more or less unexpected and hard to diagnose error.
This (anti-)feature also makes syntax highlighting of matching bars
impossible, instead of merely fiendishly difficult. Since it isn't an
error for the bars not to match, you can't even count the bars to work
out which ones are supposed to match. You have to somehow intuit or
guess what the dimensions of the array are supposed to be, then reason
backwards to see whether the right number of bars are in the right
places to be compatible with those dimensions, and if not, your guess of
the dimensions might be wrong... or not.
> At least in my opinion, this sort of approach really shines when making
> higher-dimensional arrays.
You should compare your approach to that of mathematicians and other
programming languages.
Mathematicians don't really use multi-dimensional arrays. They have
vectors, which are 1D, and matrices which are 2D, then they have tensors
which confuse me, but they don't seem to use anything which corresponds
to a simple higher-dimension analog of matrices. Tensors come close, but
they don't seem to have anything like matrix-notation for tensors.
(Given that tensors are often infinite dimensional, I'm hardly
surprised.)
Matlab has syntax for 2D arrays, which can be expanded into 3D:
A = [1 2; 3 4];
A(:,:,2) = [5 6; 7 8]
R has an array function:
> array(1:8, c(2,2,2))
, , 1
[,1] [,2]
[1,] 1 3
[2,] 2 4
, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8
Differences in ordering (row first or column first) aside, they are
equivalent to Python's:
[[[1, 2], [3, 4]],
[[5, 6], [7, 8]],
]
My HP-48 calculator uses square brackets for matrixes, with the
convenience that in the calculator interface I only need to close the
first pair of brackets:
2D: I can enter the keystrokes:
[[1 2] 3 4
to get the 2D matrix:
[[ 1 2 ]
[ 3 4 ]]
but it has no support for 3D arrays.
Here's how C# does it:
https://msdn.microsoft.com/en-us/library/2yd9wwz4.aspx
> a = [|||| 48, 11, 141, 13, -60, -37, 58, -52, -29, 134
> || -6, 96, -66, 137, -59, -147, -118, -104, -123, -7
> ||| -103, 50, -89, -12, 28, -12, 119, -131, -73, 21
> || -58, 105, 25, -138, -106, -118, -29, -49, -63, -56
> |||| -43, -34, 101, -115, 41, 121, 3, -117, 101, -145
> || 100, -128, 76, 128, -113, -90, 52, -91, -72, -15
> ||| 22, -65, -118, 134, -58, 55, -73, -118, -53, -60
> || -85, -136, 83, -66, -35, -117, -71, 115, -56, 133
> ||||]
I wouldn't even want to guess what dimensions that is supposed to be. 10
columns, because I can count them, but everything else is a mystery.
> Compared to the current approach:
>
> a = np.ndarray([[[[48, 11, 141, 13, -60, -37, 58, -52, -29, 134],
> [-6, 96, -66, 137, -59, -147, -118, -104, -123, -7]],
> [[-103, 50, -89, -12, 28, -12, 119, -131, -73, 21],
> [-58, 105, 25, -138, -106, -118, -29, -49, -63, -56]]],
> [[[-43, -34, 101, -115, 41, 121, 3, -117, 101, -145],
> [100, -128, 76, 128, -113, -90, 52, -91, -72, -15]],
> [[22, -65, -118, 134, -58, 55, -73, -118, -53, -60],
> [-85, -136, 83, -66, -35, -117, -71, 115, -56, 133]]]])
But that's easy! Look at the nested brackets. The opening sequence tells
you that there are four dimensions:
[[[[
I can count the ten columns (and if I align them, I can visually verify
that each row has exactly ten columns). Looking at the nested lists, I
see:
[[[[ten columns],
[ten columns]],
so that's two rows by ten, then continuing:
[2 x 10]],
which closes another layer, so that's 2 items in the third dimension,
then when have another dimension:
[2 x 10 x 2]]
and the array is closed, giving us in total:
2 x 10 x 2 x 2
In my opinion anyone trying to write out a single 4D array like this is
opening themselves up to a hiding for nothing, even with clear nesting and
matching open/close delimiters. Since we don't have 4D text files, it's
better to write:
L = [48, 11, 141, 13, -60, -37, 58, -52, -29, 134,
-6, 96, -66, 137, -59, -147, -118, -104, -123, -7,
-103, 50, -89, -12, 28, -12, 119, -131, -73, 21,
-58, 105, 25, -138, -106, -118, -29, -49, -63, -56,
-43, -34, 101, -115, 41, 121, 3, -117, 101, -145,
100, -128, 76, 128, -113, -90, 52, -91, -72, -15,
22, -65, -118, 134, -58, 55, -73, -118, -53, -60,
-85, -136, 83, -66, -35, -117, -71, 115, -56, 133]
assert len(L) == 2*10*2*2
arr = array(L, dim=(2,10,2,2))
or something similar, and let the array constructor resize as needed.
--
Steve
More information about the Python-ideas
mailing list