Hi all, Say python’s builtin `int` type. It can be as large as memory allows. np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers. So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object)) Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize) Regards, dgpb
It is possible to do this using the new DType system. Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype It adds a significant amount of complexity to store data outside the array buffer and introduces the possibility of use-after-free and dangling reference errors that are impossible if the array does not use embedded references, so that’s the main reason it hasn’t been done much. On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Hi all,
Say python’s builtin `int` type. It can be as large as memory allows.
np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers.
So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object))
Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize)
Regards, dgpb
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
Thank you for this. I am just starting to think about these things, so I appreciate your patience. But isn’t it still true that all elements of an array are still of the same size in memory? I am thinking along the lines of per-element dynamic memory management. Such that if I had array [0, 1e10000], the first element would default to reasonably small size in memory.
On 13 Mar 2024, at 16:29, Nathan <nathan.goldbaum@gmail.com> wrote:
It is possible to do this using the new DType system.
Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype <https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype>
It adds a significant amount of complexity to store data outside the array buffer and introduces the possibility of use-after-free and dangling reference errors that are impossible if the array does not use embedded references, so that’s the main reason it hasn’t been done much.
On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis <dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>> wrote: Hi all,
Say python’s builtin `int` type. It can be as large as memory allows.
np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers.
So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object))
Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize)
Regards, dgpb
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: nathan12343@gmail.com <mailto:nathan12343@gmail.com> _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
Yes, an array of references still has a fixed size width in the array buffer. You can think of each entry in the array as a pointer to some other memory on the heap, which can be a dynamic memory allocation. There's no way in NumPy to support variable-sized array elements in the array buffer, since that assumption is key to how numpy implements strided ufuncs and broadcasting., On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Thank you for this.
I am just starting to think about these things, so I appreciate your patience.
But isn’t it still true that all elements of an array are still of the same size in memory?
I am thinking along the lines of per-element dynamic memory management. Such that if I had array [0, 1e10000], the first element would default to reasonably small size in memory.
On 13 Mar 2024, at 16:29, Nathan <nathan.goldbaum@gmail.com> wrote:
It is possible to do this using the new DType system.
Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
It adds a significant amount of complexity to store data outside the array buffer and introduces the possibility of use-after-free and dangling reference errors that are impossible if the array does not use embedded references, so that’s the main reason it hasn’t been done much.
On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Hi all,
Say python’s builtin `int` type. It can be as large as memory allows.
np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers.
So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object))
Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize)
Regards, dgpb
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
By the way, I think I am referring to integer arrays. (Or integer part of floats.) I don’t think what I am saying sensibly applies to floats as they are. Although, new float type could base its integer part on such concept. — Where I am coming from is that I started to hit maximum bounds on integer arrays, where most of values are very small and some become very large. And I am hitting memory limits. And I don’t have many zeros, so sparse arrays aren’t an option. Approximately: 90% of my arrays could fit into `np.uint8` 1% requires `np.uint64` the rest 9% are in between. And there is no predictable order where is what, so splitting is not an option either.
On 13 Mar 2024, at 17:53, Nathan <nathan.goldbaum@gmail.com> wrote:
Yes, an array of references still has a fixed size width in the array buffer. You can think of each entry in the array as a pointer to some other memory on the heap, which can be a dynamic memory allocation.
There's no way in NumPy to support variable-sized array elements in the array buffer, since that assumption is key to how numpy implements strided ufuncs and broadcasting.,
On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>> wrote: Thank you for this.
I am just starting to think about these things, so I appreciate your patience.
But isn’t it still true that all elements of an array are still of the same size in memory?
I am thinking along the lines of per-element dynamic memory management. Such that if I had array [0, 1e10000], the first element would default to reasonably small size in memory.
On 13 Mar 2024, at 16:29, Nathan <nathan.goldbaum@gmail.com <mailto:nathan.goldbaum@gmail.com>> wrote:
It is possible to do this using the new DType system.
Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype <https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype>
It adds a significant amount of complexity to store data outside the array buffer and introduces the possibility of use-after-free and dangling reference errors that are impossible if the array does not use embedded references, so that’s the main reason it hasn’t been done much.
On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis <dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>> wrote: Hi all,
Say python’s builtin `int` type. It can be as large as memory allows.
np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers.
So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object))
Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize)
Regards, dgpb
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: nathan12343@gmail.com <mailto:nathan12343@gmail.com> _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: nathan12343@gmail.com <mailto:nathan12343@gmail.com> _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
I am not sure what kind of a scheme would support various-sized native ints. Any scheme that puts pointers in the array is going to be worse: the pointers will be 64-bit. You could store offsets to data, but then you would need to store both the offsets and the contiguous data, nearly doubling your storage. What shape are your arrays, that would be the minimum size of the offsets? Matti On 13/3/24 18:15, Dom Grigonis wrote:
By the way, I think I am referring to integer arrays. (Or integer part of floats.)
I don’t think what I am saying sensibly applies to floats as they are.
Although, new float type could base its integer part on such concept.
—
Where I am coming from is that I started to hit maximum bounds on integer arrays, where most of values are very small and some become very large. And I am hitting memory limits. And I don’t have many zeros, so sparse arrays aren’t an option.
Approximately: 90% of my arrays could fit into `np.uint8` 1% requires `np.uint64` the rest 9% are in between.
And there is no predictable order where is what, so splitting is not an option either.
On 13 Mar 2024, at 17:53, Nathan <nathan.goldbaum@gmail.com> wrote:
Yes, an array of references still has a fixed size width in the array buffer. You can think of each entry in the array as a pointer to some other memory on the heap, which can be a dynamic memory allocation.
There's no way in NumPy to support variable-sized array elements in the array buffer, since that assumption is key to how numpy implements strided ufuncs and broadcasting.,
On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Thank you for this.
I am just starting to think about these things, so I appreciate your patience.
But isn’t it still true that all elements of an array are still of the same size in memory?
I am thinking along the lines of per-element dynamic memory management. Such that if I had array [0, 1e10000], the first element would default to reasonably small size in memory.
On 13 Mar 2024, at 16:29, Nathan <nathan.goldbaum@gmail.com> wrote:
It is possible to do this using the new DType system.
Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
It adds a significant amount of complexity to store data outside the array buffer and introduces the possibility of use-after-free and dangling reference errors that are impossible if the array does not use embedded references, so that’s the main reason it hasn’t been done much.
On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Hi all,
Say python’s builtin `int` type. It can be as large as memory allows.
np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers.
So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object))
Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize)
Regards, dgpb
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: matti.picus@gmail.com
This might be a good application of Awkward Array (https://awkward-array.org), which applies a NumPy-like interface to arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged), a restriction of that to only variable-length lists, but satisfying the Array API standard. The variable-length data in Awkward Array hasn't been used to represent arbitrary precision integers, though. It might be a good application of "behaviors," which are documented here: https://awkward-array.org/doc/main/reference/ak.behavior.html In principle, it would be possible to define methods and overload NumPy ufuncs to interpret variable-length lists of int8 as integers with arbitrary precision. Numba might be helpful in accelerating that if normal NumPy-style vectorization is insufficient. If you're interested in following this route, I can help with first implementations of that arbitrary precision integer behavior. (It's an interesting application!) Jim On Wed, Mar 13, 2024, 12:28 PM Matti Picus <matti.picus@gmail.com> wrote:
I am not sure what kind of a scheme would support various-sized native ints. Any scheme that puts pointers in the array is going to be worse: the pointers will be 64-bit. You could store offsets to data, but then you would need to store both the offsets and the contiguous data, nearly doubling your storage. What shape are your arrays, that would be the minimum size of the offsets?
Matti
On 13/3/24 18:15, Dom Grigonis wrote:
By the way, I think I am referring to integer arrays. (Or integer part of floats.)
I don’t think what I am saying sensibly applies to floats as they are.
Although, new float type could base its integer part on such concept.
—
Where I am coming from is that I started to hit maximum bounds on integer arrays, where most of values are very small and some become very large. And I am hitting memory limits. And I don’t have many zeros, so sparse arrays aren’t an option.
Approximately: 90% of my arrays could fit into `np.uint8` 1% requires `np.uint64` the rest 9% are in between.
And there is no predictable order where is what, so splitting is not an option either.
On 13 Mar 2024, at 17:53, Nathan <nathan.goldbaum@gmail.com> wrote:
Yes, an array of references still has a fixed size width in the array buffer. You can think of each entry in the array as a pointer to some other memory on the heap, which can be a dynamic memory allocation.
There's no way in NumPy to support variable-sized array elements in the array buffer, since that assumption is key to how numpy implements strided ufuncs and broadcasting.,
On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Thank you for this.
I am just starting to think about these things, so I appreciate your patience.
But isn’t it still true that all elements of an array are still of the same size in memory?
I am thinking along the lines of per-element dynamic memory management. Such that if I had array [0, 1e10000], the first element would default to reasonably small size in memory.
On 13 Mar 2024, at 16:29, Nathan <nathan.goldbaum@gmail.com>
wrote:
It is possible to do this using the new DType system.
Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
It adds a significant amount of complexity to store data outside the array buffer and introduces the possibility of use-after-free and dangling reference errors that are impossible if the array does not use embedded references, so that’s the main reason it hasn’t been done much.
On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Hi all,
Say python’s builtin `int` type. It can be as large as memory allows.
np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers.
So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object))
Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize)
Regards, dgpb
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: matti.picus@gmail.com
NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: jpivarski@gmail.com
After sending that email, I realize that I have to take it back: your motivation is to minimize memory use. The variable-length lists in Awkward Array (and therefore in ragged as well) are implemented using offset arrays, and they're at minimum 32-bit. The scheme is more cache-coherent (less "pointer chasing"), but doesn't reduce the size. These offsets are 32-bit so that individual values can be selected from the array in constant time. If you use a smaller integer size, like uint8, then they have to be number of elements in the lists, rather than offsets (the cumsum of number of elements in the lists). Then, to find a single value, you have to add counts from the beginning of the array. A standard way to store variable-length integers is to put the indicator of whether you've seen the whole integer yet in a high bit (so each byte effectively contributes 7 bits). That's also inherently non-random access. But if random access is not a requirement, how about Blosc and bcolz? That's a library that uses a very lightweight compression algorithm on the arrays and uncompresses them on the fly (fast enough to be practical). That sounds like it would fit your use-case better... Jim On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski <jpivarski@gmail.com> wrote:
This might be a good application of Awkward Array ( https://awkward-array.org), which applies a NumPy-like interface to arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged), a restriction of that to only variable-length lists, but satisfying the Array API standard.
The variable-length data in Awkward Array hasn't been used to represent arbitrary precision integers, though. It might be a good application of "behaviors," which are documented here: https://awkward-array.org/doc/main/reference/ak.behavior.html In principle, it would be possible to define methods and overload NumPy ufuncs to interpret variable-length lists of int8 as integers with arbitrary precision. Numba might be helpful in accelerating that if normal NumPy-style vectorization is insufficient.
If you're interested in following this route, I can help with first implementations of that arbitrary precision integer behavior. (It's an interesting application!)
Jim
On Wed, Mar 13, 2024, 12:28 PM Matti Picus <matti.picus@gmail.com> wrote:
I am not sure what kind of a scheme would support various-sized native ints. Any scheme that puts pointers in the array is going to be worse: the pointers will be 64-bit. You could store offsets to data, but then you would need to store both the offsets and the contiguous data, nearly doubling your storage. What shape are your arrays, that would be the minimum size of the offsets?
Matti
On 13/3/24 18:15, Dom Grigonis wrote:
By the way, I think I am referring to integer arrays. (Or integer part of floats.)
I don’t think what I am saying sensibly applies to floats as they are.
Although, new float type could base its integer part on such concept.
—
Where I am coming from is that I started to hit maximum bounds on integer arrays, where most of values are very small and some become very large. And I am hitting memory limits. And I don’t have many zeros, so sparse arrays aren’t an option.
Approximately: 90% of my arrays could fit into `np.uint8` 1% requires `np.uint64` the rest 9% are in between.
And there is no predictable order where is what, so splitting is not an option either.
On 13 Mar 2024, at 17:53, Nathan <nathan.goldbaum@gmail.com> wrote:
Yes, an array of references still has a fixed size width in the array buffer. You can think of each entry in the array as a pointer to some other memory on the heap, which can be a dynamic memory allocation.
There's no way in NumPy to support variable-sized array elements in the array buffer, since that assumption is key to how numpy implements strided ufuncs and broadcasting.,
On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Thank you for this.
I am just starting to think about these things, so I appreciate your patience.
But isn’t it still true that all elements of an array are still of the same size in memory?
I am thinking along the lines of per-element dynamic memory management. Such that if I had array [0, 1e10000], the first element would default to reasonably small size in memory.
On 13 Mar 2024, at 16:29, Nathan <nathan.goldbaum@gmail.com>
wrote:
It is possible to do this using the new DType system.
Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
It adds a significant amount of complexity to store data outside the array buffer and introduces the possibility of use-after-free and dangling reference errors that are impossible if the array does not use embedded references, so that’s the main reason it hasn’t been done much.
On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Hi all,
Say python’s builtin `int` type. It can be as large as memory allows.
np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers.
So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object))
Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize)
Regards, dgpb
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: matti.picus@gmail.com
NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: jpivarski@gmail.com
Thanks for this. Random access is unfortunately a requirement. By the way, what is the difference between awkward and ragged?
On 13 Mar 2024, at 18:59, Jim Pivarski <jpivarski@gmail.com> wrote:
After sending that email, I realize that I have to take it back: your motivation is to minimize memory use. The variable-length lists in Awkward Array (and therefore in ragged as well) are implemented using offset arrays, and they're at minimum 32-bit. The scheme is more cache-coherent (less "pointer chasing"), but doesn't reduce the size.
These offsets are 32-bit so that individual values can be selected from the array in constant time. If you use a smaller integer size, like uint8, then they have to be number of elements in the lists, rather than offsets (the cumsum of number of elements in the lists). Then, to find a single value, you have to add counts from the beginning of the array.
A standard way to store variable-length integers is to put the indicator of whether you've seen the whole integer yet in a high bit (so each byte effectively contributes 7 bits). That's also inherently non-random access.
But if random access is not a requirement, how about Blosc and bcolz? That's a library that uses a very lightweight compression algorithm on the arrays and uncompresses them on the fly (fast enough to be practical). That sounds like it would fit your use-case better...
Jim
On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski <jpivarski@gmail.com <mailto:jpivarski@gmail.com>> wrote: This might be a good application of Awkward Array (https://awkward-array.org <https://awkward-array.org/>), which applies a NumPy-like interface to arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged <https://github.com/scikit-hep/ragged>), a restriction of that to only variable-length lists, but satisfying the Array API standard.
The variable-length data in Awkward Array hasn't been used to represent arbitrary precision integers, though. It might be a good application of "behaviors," which are documented here: https://awkward-array.org/doc/main/reference/ak.behavior.html <https://awkward-array.org/doc/main/reference/ak.behavior.html> In principle, it would be possible to define methods and overload NumPy ufuncs to interpret variable-length lists of int8 as integers with arbitrary precision. Numba might be helpful in accelerating that if normal NumPy-style vectorization is insufficient.
If you're interested in following this route, I can help with first implementations of that arbitrary precision integer behavior. (It's an interesting application!)
Jim
On Wed, Mar 13, 2024, 12:28 PM Matti Picus <matti.picus@gmail.com <mailto:matti.picus@gmail.com>> wrote: I am not sure what kind of a scheme would support various-sized native ints. Any scheme that puts pointers in the array is going to be worse: the pointers will be 64-bit. You could store offsets to data, but then you would need to store both the offsets and the contiguous data, nearly doubling your storage. What shape are your arrays, that would be the minimum size of the offsets?
Matti
On 13/3/24 18:15, Dom Grigonis wrote:
By the way, I think I am referring to integer arrays. (Or integer part of floats.)
I don’t think what I am saying sensibly applies to floats as they are.
Although, new float type could base its integer part on such concept.
—
Where I am coming from is that I started to hit maximum bounds on integer arrays, where most of values are very small and some become very large. And I am hitting memory limits. And I don’t have many zeros, so sparse arrays aren’t an option.
Approximately: 90% of my arrays could fit into `np.uint8` 1% requires `np.uint64` the rest 9% are in between.
And there is no predictable order where is what, so splitting is not an option either.
On 13 Mar 2024, at 17:53, Nathan <nathan.goldbaum@gmail.com <mailto:nathan.goldbaum@gmail.com>> wrote:
Yes, an array of references still has a fixed size width in the array buffer. You can think of each entry in the array as a pointer to some other memory on the heap, which can be a dynamic memory allocation.
There's no way in NumPy to support variable-sized array elements in the array buffer, since that assumption is key to how numpy implements strided ufuncs and broadcasting.,
On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>> wrote:
Thank you for this.
I am just starting to think about these things, so I appreciate your patience.
But isn’t it still true that all elements of an array are still of the same size in memory?
I am thinking along the lines of per-element dynamic memory management. Such that if I had array [0, 1e10000], the first element would default to reasonably small size in memory.
On 13 Mar 2024, at 16:29, Nathan <nathan.goldbaum@gmail.com <mailto:nathan.goldbaum@gmail.com>> wrote:
It is possible to do this using the new DType system.
Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype <https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype>
It adds a significant amount of complexity to store data outside the array buffer and introduces the possibility of use-after-free and dangling reference errors that are impossible if the array does not use embedded references, so that’s the main reason it hasn’t been done much.
On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis <dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>> wrote:
Hi all,
Say python’s builtin `int` type. It can be as large as memory allows.
np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers.
So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object))
Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize)
Regards, dgpb
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: nathan12343@gmail.com <mailto:nathan12343@gmail.com>
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: nathan12343@gmail.com <mailto:nathan12343@gmail.com>
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: matti.picus@gmail.com <mailto:matti.picus@gmail.com>
NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: jpivarski@gmail.com <mailto:jpivarski@gmail.com> _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
Awkward is more general: it has all the same data types (and is zero-copy compatible with) Apache Arrow. ragged is only lists (of lists) of numbers, so that it's possible to describe as a shape and dtype. ragged adheres to the Array API, like NumPy 2.0 (am I right in that)? So, ragged is a useful subset. On Wed, Mar 13, 2024, 1:17 PM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Thanks for this.
Random access is unfortunately a requirement.
By the way, what is the difference between awkward and ragged?
On 13 Mar 2024, at 18:59, Jim Pivarski <jpivarski@gmail.com> wrote:
After sending that email, I realize that I have to take it back: your motivation is to minimize memory use. The variable-length lists in Awkward Array (and therefore in ragged as well) are implemented using offset arrays, and they're at minimum 32-bit. The scheme is more cache-coherent (less "pointer chasing"), but doesn't reduce the size.
These offsets are 32-bit so that individual values can be selected from the array in constant time. If you use a smaller integer size, like uint8, then they have to be number of elements in the lists, rather than offsets (the cumsum of number of elements in the lists). Then, to find a single value, you have to add counts from the beginning of the array.
A standard way to store variable-length integers is to put the indicator of whether you've seen the whole integer yet in a high bit (so each byte effectively contributes 7 bits). That's also inherently non-random access.
But if random access is not a requirement, how about Blosc and bcolz? That's a library that uses a very lightweight compression algorithm on the arrays and uncompresses them on the fly (fast enough to be practical). That sounds like it would fit your use-case better...
Jim
On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski <jpivarski@gmail.com> wrote:
This might be a good application of Awkward Array ( https://awkward-array.org), which applies a NumPy-like interface to arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged), a restriction of that to only variable-length lists, but satisfying the Array API standard.
The variable-length data in Awkward Array hasn't been used to represent arbitrary precision integers, though. It might be a good application of "behaviors," which are documented here: https://awkward-array.org/doc/main/reference/ak.behavior.html In principle, it would be possible to define methods and overload NumPy ufuncs to interpret variable-length lists of int8 as integers with arbitrary precision. Numba might be helpful in accelerating that if normal NumPy-style vectorization is insufficient.
If you're interested in following this route, I can help with first implementations of that arbitrary precision integer behavior. (It's an interesting application!)
Jim
On Wed, Mar 13, 2024, 12:28 PM Matti Picus <matti.picus@gmail.com> wrote:
I am not sure what kind of a scheme would support various-sized native ints. Any scheme that puts pointers in the array is going to be worse: the pointers will be 64-bit. You could store offsets to data, but then you would need to store both the offsets and the contiguous data, nearly doubling your storage. What shape are your arrays, that would be the minimum size of the offsets?
Matti
On 13/3/24 18:15, Dom Grigonis wrote:
By the way, I think I am referring to integer arrays. (Or integer part of floats.)
I don’t think what I am saying sensibly applies to floats as they are.
Although, new float type could base its integer part on such concept.
—
Where I am coming from is that I started to hit maximum bounds on integer arrays, where most of values are very small and some become very large. And I am hitting memory limits. And I don’t have many zeros, so sparse arrays aren’t an option.
Approximately: 90% of my arrays could fit into `np.uint8` 1% requires `np.uint64` the rest 9% are in between.
And there is no predictable order where is what, so splitting is not an option either.
On 13 Mar 2024, at 17:53, Nathan <nathan.goldbaum@gmail.com> wrote:
Yes, an array of references still has a fixed size width in the array buffer. You can think of each entry in the array as a pointer to some other memory on the heap, which can be a dynamic memory allocation.
There's no way in NumPy to support variable-sized array elements in the array buffer, since that assumption is key to how numpy implements strided ufuncs and broadcasting.,
On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigonis@gmail.com>
wrote:
Thank you for this.
I am just starting to think about these things, so I appreciate your patience.
But isn’t it still true that all elements of an array are still of the same size in memory?
I am thinking along the lines of per-element dynamic memory management. Such that if I had array [0, 1e10000], the first element would default to reasonably small size in memory.
On 13 Mar 2024, at 16:29, Nathan <nathan.goldbaum@gmail.com>
wrote:
It is possible to do this using the new DType system.
Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
It adds a significant amount of complexity to store data outside the array buffer and introduces the possibility of use-after-free and dangling reference errors that are impossible if the array does not use embedded references, so that’s the main reason it hasn’t been done much.
On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Hi all,
Say python’s builtin `int` type. It can be as large as memory allows.
np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers.
So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object))
Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize)
Regards, dgpb
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to
numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: matti.picus@gmail.com
NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: jpivarski@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: jpivarski@gmail.com
Yup yup, good point. So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So it is not a solution for this case. Nevertheless, such concept would still be worthwhile for cases where integers are say max 256bits (or unlimited), then even if memory addresses or offsets are 64bit. This would both: a) save memory if many of values in array are much smaller than 256bits b) provide a standard for dynamically unlimited size values — For now, what could be a temporary solution for me, is a type, which stays at minimum/maximum when it goes below, above bounds. Integer types don’t work here at all - np.uint8(255) + 2 = 1. Totally unacceptable Floats are a bit better: np.float16(65500) + 100 = np.float16(inf). At least it didn’t reset and it went the right way (just a bit too much).
On 13 Mar 2024, at 18:26, Matti Picus <matti.picus@gmail.com> wrote:
I am not sure what kind of a scheme would support various-sized native ints. Any scheme that puts pointers in the array is going to be worse: the pointers will be 64-bit. You could store offsets to data, but then you would need to store both the offsets and the contiguous data, nearly doubling your storage. What shape are your arrays, that would be the minimum size of the offsets?
Matti
On 13/3/24 18:15, Dom Grigonis wrote:
By the way, I think I am referring to integer arrays. (Or integer part of floats.)
I don’t think what I am saying sensibly applies to floats as they are.
Although, new float type could base its integer part on such concept.
—
Where I am coming from is that I started to hit maximum bounds on integer arrays, where most of values are very small and some become very large. And I am hitting memory limits. And I don’t have many zeros, so sparse arrays aren’t an option.
Approximately: 90% of my arrays could fit into `np.uint8` 1% requires `np.uint64` the rest 9% are in between.
And there is no predictable order where is what, so splitting is not an option either.
On 13 Mar 2024, at 17:53, Nathan <nathan.goldbaum@gmail.com> wrote:
Yes, an array of references still has a fixed size width in the array buffer. You can think of each entry in the array as a pointer to some other memory on the heap, which can be a dynamic memory allocation.
There's no way in NumPy to support variable-sized array elements in the array buffer, since that assumption is key to how numpy implements strided ufuncs and broadcasting.,
On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Thank you for this.
I am just starting to think about these things, so I appreciate your patience.
But isn’t it still true that all elements of an array are still of the same size in memory?
I am thinking along the lines of per-element dynamic memory management. Such that if I had array [0, 1e10000], the first element would default to reasonably small size in memory.
On 13 Mar 2024, at 16:29, Nathan <nathan.goldbaum@gmail.com> wrote:
It is possible to do this using the new DType system.
Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
It adds a significant amount of complexity to store data outside the array buffer and introduces the possibility of use-after-free and dangling reference errors that are impossible if the array does not use embedded references, so that’s the main reason it hasn’t been done much.
On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Hi all,
Say python’s builtin `int` type. It can be as large as memory allows.
np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers.
So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object))
Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize)
Regards, dgpb
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: matti.picus@gmail.com
NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
On 13 Mar 2024, at 6:01 PM, Dom Grigonis <dom.grigonis@gmail.com> wrote: So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So it is not a solution for this case. Nevertheless, such concept would still be worthwhile for cases where integers are say max 256bits (or unlimited), then even if memory addresses or offsets are 64bit. This would both: a) save memory if many of values in array are much smaller than 256bits b) provide a standard for dynamically unlimited size values In principle one could encode individual offsets in a smarter way, using just the minimal number of bits required, but again that would make random access impossible or very expensive – probably more or less amounting to what smart compression algorithms are already doing. Another approach might be to to use the mask approach after all (or just flag all you uint8 data valued 2**8 as overflows) and store the correct (uint64 or whatever) values and their indices in a second array. May still not vectorise very efficiently with just numpy if your typical operations are non-local. Derek
My array is growing in a manner of: array[slice] += values so for now will just clip values: res = np.add(array[slice], values, dtype=np.int64) array[slice] = res mask = res > MAX_UINT16 array[slice][mask] = MAX_UINT16 For this case, these large values do not have that much impact. And extra operation overhead is acceptable. --- And adding more involved project to my TODOs for the future. After all, it would be good to have an array, which (at preferably as minimal cost as possible) could handle anything you throw at it with near-optimal memory consumption and sensible precision handling, while keeping all the benefits of numpy. Time will tell if that is achievable. If anyone had any good ideas regarding this I am all ears. Much thanks to you all for information and ideas. dgpb
On 13 Mar 2024, at 21:00, Homeier, Derek <dhomeie@gwdg.de> wrote:
On 13 Mar 2024, at 6:01 PM, Dom Grigonis <dom.grigonis@gmail.com> wrote:
So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So it is not a solution for this case.
Nevertheless, such concept would still be worthwhile for cases where integers are say max 256bits (or unlimited), then even if memory addresses or offsets are 64bit. This would both: a) save memory if many of values in array are much smaller than 256bits b) provide a standard for dynamically unlimited size values
In principle one could encode individual offsets in a smarter way, using just the minimal number of bits required, but again that would make random access impossible or very expensive – probably more or less amounting to what smart compression algorithms are already doing. Another approach might be to to use the mask approach after all (or just flag all you uint8 data valued 2**8 as overflows) and store the correct (uint64 or whatever) values and their indices in a second array. May still not vectorise very efficiently with just numpy if your typical operations are non-local.
Derek
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
So that this doesn't get lost amid the discussion: https://www.blosc.org/python-blosc2/python-blosc2.html Blosc is on-the-fly compression, which is a more extreme way of making variable-sized integers. The compression is in small chunks that fit into CPU cachelines, such that it's random access per chunk. The compression is lightweight enough that it can be faster to decompress, edit, and recompress a chunk than it is to copy from RAM, edit, and copy back to RAM. (The extra cost of compression is paid for by moving less data between RAM and CPU. That's why I say "can be," because it depends on the entropy of the data.) Since you have to copy data from RAM to CPU and back anyway, as a part of any operation on an array, this can be a net win. What you're trying to do with variable-length integers is a kind of compression algorithm, an extremely lightweight one. That's why I think that Blosc would fit your use-case, because it's doing the same kind of thing, but with years of development behind it. (Earlier, I recommended bcolz, which was a Python array based on Blosc, but now I see that it has been deprecated. However, the link above goes to the current version of the Python interface to Blosc, so I'd expect it to cover the same use-cases.) -- Jim On Wed, Mar 13, 2024 at 4:45 PM Dom Grigonis <dom.grigonis@gmail.com> wrote:
My array is growing in a manner of: array[slice] += values
so for now will just clip values: res = np.add(array[slice], values, dtype=np.int64) array[slice] = res mask = res > MAX_UINT16 array[slice][mask] = MAX_UINT16
For this case, these large values do not have that much impact. And extra operation overhead is acceptable.
---
And adding more involved project to my TODOs for the future.
After all, it would be good to have an array, which (at preferably as minimal cost as possible) could handle anything you throw at it with near-optimal memory consumption and sensible precision handling, while keeping all the benefits of numpy.
Time will tell if that is achievable. If anyone had any good ideas regarding this I am all ears.
Much thanks to you all for information and ideas. dgpb
On 13 Mar 2024, at 21:00, Homeier, Derek <dhomeie@gwdg.de> wrote:
On 13 Mar 2024, at 6:01 PM, Dom Grigonis <dom.grigonis@gmail.com> wrote:
So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So it is not a solution for this case.
Nevertheless, such concept would still be worthwhile for cases where integers are say max 256bits (or unlimited), then even if memory addresses or offsets are 64bit. This would both: a) save memory if many of values in array are much smaller than 256bits b) provide a standard for dynamically unlimited size values
In principle one could encode individual offsets in a smarter way, using just the minimal number of bits required, but again that would make random access impossible or very expensive – probably more or less amounting to what smart compression algorithms are already doing. Another approach might be to to use the mask approach after all (or just flag all you uint8 data valued 2**8 as overflows) and store the correct (uint64 or whatever) values and their indices in a second array. May still not vectorise very efficiently with just numpy if your typical operations are non-local.
Derek
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: jpivarski@gmail.com
Thanks for reiterating, this looks promising!
On 13 Mar 2024, at 23:22, Jim Pivarski <jpivarski@gmail.com> wrote:
So that this doesn't get lost amid the discussion: https://www.blosc.org/python-blosc2/python-blosc2.html <https://www.blosc.org/python-blosc2/python-blosc2.html>
Blosc is on-the-fly compression, which is a more extreme way of making variable-sized integers. The compression is in small chunks that fit into CPU cachelines, such that it's random access per chunk. The compression is lightweight enough that it can be faster to decompress, edit, and recompress a chunk than it is to copy from RAM, edit, and copy back to RAM. (The extra cost of compression is paid for by moving less data between RAM and CPU. That's why I say "can be," because it depends on the entropy of the data.) Since you have to copy data from RAM to CPU and back anyway, as a part of any operation on an array, this can be a net win.
What you're trying to do with variable-length integers is a kind of compression algorithm, an extremely lightweight one. That's why I think that Blosc would fit your use-case, because it's doing the same kind of thing, but with years of development behind it.
(Earlier, I recommended bcolz, which was a Python array based on Blosc, but now I see that it has been deprecated. However, the link above goes to the current version of the Python interface to Blosc, so I'd expect it to cover the same use-cases.)
-- Jim
On Wed, Mar 13, 2024 at 4:45 PM Dom Grigonis <dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>> wrote: My array is growing in a manner of: array[slice] += values
so for now will just clip values: res = np.add(array[slice], values, dtype=np.int64) array[slice] = res mask = res > MAX_UINT16 array[slice][mask] = MAX_UINT16
For this case, these large values do not have that much impact. And extra operation overhead is acceptable.
---
And adding more involved project to my TODOs for the future.
After all, it would be good to have an array, which (at preferably as minimal cost as possible) could handle anything you throw at it with near-optimal memory consumption and sensible precision handling, while keeping all the benefits of numpy.
Time will tell if that is achievable. If anyone had any good ideas regarding this I am all ears.
Much thanks to you all for information and ideas. dgpb
On 13 Mar 2024, at 21:00, Homeier, Derek <dhomeie@gwdg.de <mailto:dhomeie@gwdg.de>> wrote:
On 13 Mar 2024, at 6:01 PM, Dom Grigonis <dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>> wrote:
So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So it is not a solution for this case.
Nevertheless, such concept would still be worthwhile for cases where integers are say max 256bits (or unlimited), then even if memory addresses or offsets are 64bit. This would both: a) save memory if many of values in array are much smaller than 256bits b) provide a standard for dynamically unlimited size values
In principle one could encode individual offsets in a smarter way, using just the minimal number of bits required, but again that would make random access impossible or very expensive – probably more or less amounting to what smart compression algorithms are already doing. Another approach might be to to use the mask approach after all (or just flag all you uint8 data valued 2**8 as overflows) and store the correct (uint64 or whatever) values and their indices in a second array. May still not vectorise very efficiently with just numpy if your typical operations are non-local.
Derek
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: dom.grigonis@gmail.com <mailto:dom.grigonis@gmail.com>
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org <mailto:numpy-discussion@python.org> To unsubscribe send an email to numpy-discussion-leave@python.org <mailto:numpy-discussion-leave@python.org> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/> Member address: jpivarski@gmail.com <mailto:jpivarski@gmail.com> _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
Does the new DType system in NumPy 2 make something like this more possible? I would suspect that the user would have to write a lot of code to have reasonable performance if it was. Kevin On Wed, Mar 13, 2024 at 3:55 PM Nathan <nathan.goldbaum@gmail.com> wrote:
Yes, an array of references still has a fixed size width in the array buffer. You can think of each entry in the array as a pointer to some other memory on the heap, which can be a dynamic memory allocation.
There's no way in NumPy to support variable-sized array elements in the array buffer, since that assumption is key to how numpy implements strided ufuncs and broadcasting.,
On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Thank you for this.
I am just starting to think about these things, so I appreciate your patience.
But isn’t it still true that all elements of an array are still of the same size in memory?
I am thinking along the lines of per-element dynamic memory management. Such that if I had array [0, 1e10000], the first element would default to reasonably small size in memory.
On 13 Mar 2024, at 16:29, Nathan <nathan.goldbaum@gmail.com> wrote:
It is possible to do this using the new DType system.
Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
It adds a significant amount of complexity to store data outside the array buffer and introduces the possibility of use-after-free and dangling reference errors that are impossible if the array does not use embedded references, so that’s the main reason it hasn’t been done much.
On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis <dom.grigonis@gmail.com> wrote:
Hi all,
Say python’s builtin `int` type. It can be as large as memory allows.
np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in comparison to python’s built-in list of integers.
So, I am thinking whether something in between exists? (And obviously something more clever than np.array(dtype=object))
Probably something similar to `StringDType`, but for integers and floats. (It’s just my guess. I don’t know anything about `StringDType`, but just guessing it must be better than np.array(dtype=object) in combination with np.vectorize)
Regards, dgpb
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: dom.grigonis@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: nathan12343@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-leave@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: kevin.k.sheppard@gmail.com
participants (6)
-
Dom Grigonis -
Homeier, Derek -
Jim Pivarski -
Kevin Sheppard -
Matti Picus -
Nathan