Mapping with continguous ranges of keys
tomuxiong at gmx.com
Fri Dec 16 13:04:05 EST 2016
On 12/15/2016 11:57 PM, Terry Reedy wrote:
> On 12/15/2016 4:30 PM, Thomas Nyberg wrote:
>> On 12/15/2016 12:48 PM, Terry Reedy wrote:
>>> A sparse array has at least half missing values. This one has none on
>>> the defined domain, but contiguous dupicates.
>> I'm sorry for devolving into semantics, but there certainly isn't a
>> single definition of "sparse array" out there. For example, the
>> definition in wikipedia (https://en.wikipedia.org/wiki/Sparse_array)
>> doesn't agree with you:
> I think it does ;-).
>> "In computer science, a sparse array is an array in which most of the
>> elements have the default value (usually 0 or null)."
>> Personally my usage of sparse arrays in scipy has _always_ had all
>> defined values it's just that the default value was 0. I never deal
>> with "missing" values.
> Lucky you. In statistics and analysis of real, experimental data,
> missing values are often a possibility. They are always a nuisance,
> sometimes a major one. When the data are represented by arrays, rather
> than by some compacted form, some value has to be chosen to represent
> the absence of of data.
I thought that the definitions were different because yours said "at
least half" and the other said "most" in addition to your saying
"missing values" and the other said "default value". Ignoring the first
minor point, if by "missing value" you basically mean "default value",
then yes I agree that the definition is the same. Also I didn't mean
that I was "lucky" to never have dealt with missing values (strictly
speaking I have, I just do it rarely), but my point was that I deal with
sparse matricies/arrays/structures quite often, but I rarely deal with
"missing values" like nulls/nans/Nones. But that point is moot given
that you apparently meant the same thing with "missing value" as I did
with "default value" so I don't think we disagree here.
Taking a step back, the real point of sparse anything is: can you
represent it in a space/speed efficient manner which is "stable" under
whatever operations you care about. I.e. if you have a default value of
0, then you have the property that 0 + x = x and 0 * x = 0 which means
that sparse matrices with default value 0 are stable under addition and
multiplication. If you have a default value of nan (or None or null) you
usually have the convention that nan * x = nan (possibly depends on the
sign of x) and nan + x = nan which means that a sparse matrix with nans
as default is also stable under addition and multiplication. If you
chose a default value of (say) 1 you would run into issues with the
stability of these operations. It wouldn't mean it's not "sparse", but
it would mean that the operations you care about might not work as well.
The extension to the thread at and is just that now you have multiple
default values and the fact that they are not assigned at random, but
are instead runs of constant values means that you can put a sparse
structure on this (with a suitable definition of "sparse").
> Let's devolve to a memory-based language like C. An physical array
> consisting of sequential memory locations must have *some* bit pattern
> in every byte and hence in every multibyte block. If I remember
> correctly, C malloc initialized bytes to all 0 bits, which is an int
> value of 0 also. If there is no meaningful value, there must be a
> default value that means 'missing'. 0 may mean 'missing'. Null values
> are by definition non-values. Python uses None as a Null. If the
> example had been populated largely with Nones, I would have called it
It's not really super pertinent to this discussion, but malloc does not
guarantee that the values are zeroed. That is guaranteed by calloc though:
Also the point with a sparse representation is that your default value
wouldn't exist in memory anywhere and that instead the operations would
understand that it exists by other factors. For example, a sparse matrix
with all 0s might* be represented by rows of the form
where i and j are the row and column indices and v is the value at the
position. So in this representation we would have as many rows as we
would non-default values.
*I say might because there are usually more compact forms like this.
This isn't the internal representation of scipy.sparse.csr_matrix, for
Regardless, I think I wrote too much to say basically that I don't think
we're really disagreeing except possibly slightly on perspective.
More information about the Python-list