An article on numpy data types

Hi everyone, I'm almost done with the article about numpy types – something I haven't covered in Numpy Illustrated. Would someone please have a look to confirm I haven't written anything anti-climatic there? https://axil.github.io/numpy-data-types.html -- Best regards, Lev PS Earlier today I've mistakenly sent an email with the wrong link.

Dear Lev, thank you a lot! Something like this should be part of the Numpy documentation. I like the diagram, looks very nice! Also, I’ve opened an issue regarding data types https://github.com/numpy/numpy/issues/20662 <https://github.com/numpy/numpy/issues/20662> Some feedback from my side: 1. When calling numpy.array([1,2,3,4]) it gives me an int64 data type most of the time (two x86_64 systems, one arm64 system). The only time I’ve got int32 was on a Raspberry Pi, which is a software limitation, since the CPU is 64 bit and they have even replaced their so-far 32bit only Raspberry Pi Zero with a 64bit version (yes, one day Raspberry OS with 64 bit might actually become the default!). I don’t know what machine you are working on, but int64 should be the default. 2. x64 refers to the obsolete Intel Itanium architecture (mentioned once). Should be x86_64. 3. np.errstate looks nice, I could use that for my pull request as well. Many thanks & best regards, Michael

Dear Michael, Thank you for your feedback! I've fixed the x86_64 typo. I'll think how to reformulate the int32 part. I work on debian x86_64 and windows 10 64bit. Constructing an array with np.array([1,2,3]) as well as np.array([1,2,3], dtype=np.int_) gives me int64 dtype on linux, and int32 on windows. As suggested by Matti, I've put the rst source (and images) into a separate github repository https://github.com/axil/numpy-data-types PRs are welcome. My primary concern is to exclude serious typos/mistakes that might mislead/harm the readers if used. My personal preference is towards explicit width types like np.int32, but from reading the docs I have a feeling there's a trend of migrating towards the c-style notation. Best regards, Lev On Sun, Dec 26, 2021 at 7:05 PM Michael Siebert <michael.siebert2k@gmail.com> wrote:

Hey Lev, I‘ve forgotten to mention my MacBook M1, it‘s also int64 there. Python on Windows is and is supposed to be, as far as I get it, a dying platform. A billion things are broken there (HDF comes to my mind) and it seems even Microsoft wants developers to move away from native Windows with their introduction of WSL (Windows Subsystem for Linux). Its latest version, WSL2 even comes with an actual Linux kernel and since Windows 11, it has support for graphical applications (Xorg) out of the box. With Visual Studio Code (also Microsoft) and it’s remote capabilities, one does not even feel a difference between developing in an Ubuntu in a WSL in Windows and an actual Ubuntu. Considering the „traditional“ C datatypes, fixed types and prioritizing them in Numpy documentation, that‘s what my issue (see below) is about. I think they have summarized it nicely in https://matt.sh/howto-c Best regards, Michael

There seems to be missing an "a" before "more". Thank you. Fixed. This is a draft. It will be (more or less) professionally
Hi, Friedrich proofread thereafter.
on my machine it runs:: Which OS does your machine run on?
FloatingPointError written instead of RuntimeWarning This is most certainly a typo. Thanks.
*only once*: Good point! Added.
Thank you for your feedback! Looking forward to reading the next part of the reivew. Best regards, Lev On Sun, Dec 26, 2021 at 8:45 PM Michael Siebert <michael.siebert2k@gmail.com> wrote:

On 26/12/21 3:44 pm, Michael Siebert wrote:
Your statement is the first time I have heard this. Of those who answered the 2020 Python Developers Survey[0], 68% use linux, 48% Windows and 29% macOS (yes, the total is more than 100%: users could tick more than one box), which was up slightly from 2018 [1] where windows was 47%. I couldn't find a line in the survey about WSL, but the people I know still want to work directly on Windows. Matti [0] https://www.jetbrains.com/lp/python-developers-survey-2020/, search for "Operating system" [1] https://www.jetbrains.com/research/python-developers-survey-2018/ search for "Operating system"

Hi Matti, hi Lev, that‘s cool there are numbers on Python usage! According to those, Windows might still be around quite a while. Would be interesting to include options like „Windows (native)“ and „Windows (WSL)“ for future surveys. Windows is the main operating system I‘m working with most of the time because I need Office almost daily and it is not too much of an impact thanks to WSL. Many might be locked in to use Windows by company policies - my company is fortunately not too strict. That might explain a good portion of the high Windows usage. So meanwhile we might need to engineer everything at least twice: Windows and Linux (and some modifications for MacOS once in a while). Or engineer something wait until someone else complains. I can check Lev‘s article on a Windows Python later. Let‘s see, maybe they have meanwhile fixed that HDF issue. int32 would probably be the default if it‘s a 32 bit Python installation. There are still so many 32 bit programs around on Windows - quite scary in 2021 where probably almost every smartphone is 64 bit.

Python on Windows is and is supposed to be, as far as I get it, a dying
Hi Michael, platform. I would join Matti in thinking that it is a misconception. Have you heard of the enormous daily updated unofficial repository <https://www.lfd.uci.edu/~gohlke/pythonlibs/> of the binary windows compilations of almost 600 python libraries by Christoph Gohlke? (numpy and libs depending on it are built with MKL there) It is there for a reason. If you look at the stats such as this one (Matti already mentioned them while I was writing this text), https://www.jetbrains.com/research/python-developers-survey-2018/ https://www.jetbrains.com/lp/python-developers-survey-2020/ you'll see (in addition to the fact that numpy is the #1 library in data science ;) ) that in the recent years the percentage of windows user among the developers is quite high: 69% linux - 47% windows - 32% macos (2018) 68% linux - 48% windows - 29% macos (2020) So it looks as if it is rather growing than dying. This is due to the popularity of the above mentioned data science and AI, which have skyrocketed in the last 10 years. And the vast majority of data scientists work on windows. Windows as a platform for developers as a whole is also quite flourishing today. According to the stackoverflow 2021 developer survey <https://insights.stackoverflow.com/survey/2021#most-popular-technologies-op-...> 45% of the respondents use Windows (25% linux, 25% macos). Among the professional developers the numbers are 41% for windows, 30% macos, 26% linux. Also the primary audience of the tutorials like mine (as well as of stackoverflow?) are windows users. Linux users can easily figure things described there on their own, through the docstrings, source code or, as a last resort, through the docs )
I wouldn't like to go into holy wars, though. I'm equally literate in both unix and windows (somewhat less in macos) and in my opinion the interests of all the users of the the three operating systems should be taken into account in both the code of the library and the docs. The documentation is sometimes pretty ignorant of mac/windows users, btw:
As for the particular issue of the difference in the default integer types, in my opinion the default choice of int32 on windows for array [1,2,3] fits the description
better than int64 on linux/macos. Best regards, Lev On Sun, Dec 26, 2021 at 8:45 PM Michael Siebert <michael.siebert2k@gmail.com> wrote:

Okay, little modification to my last mail: many Android smartphones are still 32 bit, but according to https://www.androidauthority.com/arm-32-vs-64-bit-explained-1232065/ from 2023 on, all (or at least many) new ARM processors will be 64 bit only. Apple‘s iPhone 64 bit only since quite a while already (September 2017, iOS 11 release).

Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] on win32 ^^ this is relevant ^^^^ this is not Type "help", "copyright", "credits" or "license" for more information.
On Sun, Dec 26, 2021 at 11:42 PM Michael Siebert < michael.siebert2k@gmail.com> wrote:

I've tried to take into account all the suggestions from this thread. https://axil.github.io/numpy-data-types.html shows new version now and https://github.com/axil/numpy-data-types/commit/14d9da053fd67e5569436faa1f58... displays most of the changes. As for the inheritance diagram, I think it is perfectly fine to add it to the documentation as is, except that I'd put back the 'void' type I've originally omitted to keep it simple. Btw is anyone aware why 'U' is missing from the np.typecode['Character']? On Sun, Dec 26, 2021 at 11:57 PM Lev Maximov <lev.maximov@gmail.com> wrote:

Very nice overview! One question and one suggestion: 1. Is integer wraparound guaranteed for signed ints, or is it an implementation detail? For unsigned ints, sure, it's straight from a C standard; what about signed types however. 2. It'd be nice to explicitly stress that dtype=float corresponds to a C double, not a C float type. This frequently trips people trying to interface with C or Cython (in my experience) вт, 28 дек. 2021 г., 11:12 Lev Maximov <lev.maximov@gmail.com>:

On Wed, Dec 29, 2021 at 9:59 AM Charles R Harris <charlesr.harris@gmail.com> wrote:
Chuck
Yes, according to C standard signed integer overflow is undefined behavior. So, does NumPy guarantee wraparound for signed ints overflow? (at least provided that the platform is two's complement) There is an open issue "Document behavior of casts and overflows for signed integer types" #17982 https://github.com/numpy/numpy/issues/17982 There is some discussion, but no definitive answer. As a side note, Rust has both checked/unchecked wraparound arithmetic and saturated arithmetic as specialized methods: pub const fn saturating_add(self, rhs: u32) -> u32 pub fn saturating_add_signed(self, rhs: i32) -> u32 (experimental) pub const fn saturating_mul(self, rhs: u32) -> u32 etc. Best wishes, Lev

On Thu, Dec 30, 2021 at 4:12 AM Lev Maximov <lev.maximov@gmail.com> wrote:
There have been discussions about overflow behavior. The main problem is performance when there is no hardware support. There used to be architectures that offered that (VAX), but it has fallen out of favor. NumPy doesn't have an official policy that I know of, but it is currently pretty much two's complement with overflow wrap. Which is not to say that things will never change, but it isn't a priority. Chuck

On Fri, Dec 31, 2021 at 12:12 AM Charles R Harris <charlesr.harris@gmail.com> wrote:
Yes, that reflects my expectations. My primary concern right now is how to formulate this to the readers of the article so as not to mislead them. Can they rely on the wraparound for the signed ints in NumPy? Or is it rather 'use at your own risk'? Btw, is signed integer wrapping covered by regressions? Lev

On Wed, Dec 29, 2021 at 12:45 AM Eric Firing <efiring@hawaii.edu> wrote:
In pandas it is a different data type ('Int64' vs 'int64'), so it makes perfect sense for me to mention it in this article provided I missed it in the main Numpy Illustrated one. I didn't have a chance to use them on my own and I've heard from some people that they don't use the masked array, for some reason, too. Do you know if Pandas uses ma internally or they have their own implementation? Thanks and regards, Lev

On 2021/12/28 10:54 AM, Lev Maximov wrote:
Pandas has some support for importing numpy masked arrays, but internally it uses flag values rather than the parallel mask array as in a numpy MaskedArray. A DataFrame initialized from a MaskedArray is exported using the ".to_numpy" method as a plain ndarray, not a MaskedArray. Eric

Am Sa., 25. Dez. 2021 um 10:03 Uhr schrieb Lev Maximov <lev.maximov@gmail.com>:
Speaking of zero-dimensional arrays more realistic example where you can run into them is when you iterate over a numpy array with nditer:
There seems to be missing an "a" before "more". Overflow warning: Instead of
np.array([2**63–1])[0] + 1 FloatingPointError: overflow encountered in longlong_scalars
on my machine it runs::
numpy.array([2 ** 63 - 1])[0] + 1 <stdin>:1: RuntimeWarning: overflow encountered in long_scalars
There are also some more significant unclarities remaining: 1. The RuntimeWarning is issued *only once*:
2. And I do not get the the difference here:
The only apparent difference I can get hold of is that:
but:
While writing this down I realise that *a* is a zero-dimensional array, while *b* is an int64 scalar. This can also be seen from the beginning:
So, unclarity resolved, but maybe I am not the only one stumbling over this. Maybe the idiom ``>>> c = numpy.int64(2 ** 63 - 1)`` can be used? I never used this, so I am unsure about the exact semantics of such a statement. I am stopping studying your document here. Might be that I continue later. Friedrich

Dear Lev, thank you a lot! Something like this should be part of the Numpy documentation. I like the diagram, looks very nice! Also, I’ve opened an issue regarding data types https://github.com/numpy/numpy/issues/20662 <https://github.com/numpy/numpy/issues/20662> Some feedback from my side: 1. When calling numpy.array([1,2,3,4]) it gives me an int64 data type most of the time (two x86_64 systems, one arm64 system). The only time I’ve got int32 was on a Raspberry Pi, which is a software limitation, since the CPU is 64 bit and they have even replaced their so-far 32bit only Raspberry Pi Zero with a 64bit version (yes, one day Raspberry OS with 64 bit might actually become the default!). I don’t know what machine you are working on, but int64 should be the default. 2. x64 refers to the obsolete Intel Itanium architecture (mentioned once). Should be x86_64. 3. np.errstate looks nice, I could use that for my pull request as well. Many thanks & best regards, Michael

Dear Michael, Thank you for your feedback! I've fixed the x86_64 typo. I'll think how to reformulate the int32 part. I work on debian x86_64 and windows 10 64bit. Constructing an array with np.array([1,2,3]) as well as np.array([1,2,3], dtype=np.int_) gives me int64 dtype on linux, and int32 on windows. As suggested by Matti, I've put the rst source (and images) into a separate github repository https://github.com/axil/numpy-data-types PRs are welcome. My primary concern is to exclude serious typos/mistakes that might mislead/harm the readers if used. My personal preference is towards explicit width types like np.int32, but from reading the docs I have a feeling there's a trend of migrating towards the c-style notation. Best regards, Lev On Sun, Dec 26, 2021 at 7:05 PM Michael Siebert <michael.siebert2k@gmail.com> wrote:

Hey Lev, I‘ve forgotten to mention my MacBook M1, it‘s also int64 there. Python on Windows is and is supposed to be, as far as I get it, a dying platform. A billion things are broken there (HDF comes to my mind) and it seems even Microsoft wants developers to move away from native Windows with their introduction of WSL (Windows Subsystem for Linux). Its latest version, WSL2 even comes with an actual Linux kernel and since Windows 11, it has support for graphical applications (Xorg) out of the box. With Visual Studio Code (also Microsoft) and it’s remote capabilities, one does not even feel a difference between developing in an Ubuntu in a WSL in Windows and an actual Ubuntu. Considering the „traditional“ C datatypes, fixed types and prioritizing them in Numpy documentation, that‘s what my issue (see below) is about. I think they have summarized it nicely in https://matt.sh/howto-c Best regards, Michael

There seems to be missing an "a" before "more". Thank you. Fixed. This is a draft. It will be (more or less) professionally
Hi, Friedrich proofread thereafter.
on my machine it runs:: Which OS does your machine run on?
FloatingPointError written instead of RuntimeWarning This is most certainly a typo. Thanks.
*only once*: Good point! Added.
Thank you for your feedback! Looking forward to reading the next part of the reivew. Best regards, Lev On Sun, Dec 26, 2021 at 8:45 PM Michael Siebert <michael.siebert2k@gmail.com> wrote:

On 26/12/21 3:44 pm, Michael Siebert wrote:
Your statement is the first time I have heard this. Of those who answered the 2020 Python Developers Survey[0], 68% use linux, 48% Windows and 29% macOS (yes, the total is more than 100%: users could tick more than one box), which was up slightly from 2018 [1] where windows was 47%. I couldn't find a line in the survey about WSL, but the people I know still want to work directly on Windows. Matti [0] https://www.jetbrains.com/lp/python-developers-survey-2020/, search for "Operating system" [1] https://www.jetbrains.com/research/python-developers-survey-2018/ search for "Operating system"

Hi Matti, hi Lev, that‘s cool there are numbers on Python usage! According to those, Windows might still be around quite a while. Would be interesting to include options like „Windows (native)“ and „Windows (WSL)“ for future surveys. Windows is the main operating system I‘m working with most of the time because I need Office almost daily and it is not too much of an impact thanks to WSL. Many might be locked in to use Windows by company policies - my company is fortunately not too strict. That might explain a good portion of the high Windows usage. So meanwhile we might need to engineer everything at least twice: Windows and Linux (and some modifications for MacOS once in a while). Or engineer something wait until someone else complains. I can check Lev‘s article on a Windows Python later. Let‘s see, maybe they have meanwhile fixed that HDF issue. int32 would probably be the default if it‘s a 32 bit Python installation. There are still so many 32 bit programs around on Windows - quite scary in 2021 where probably almost every smartphone is 64 bit.

Python on Windows is and is supposed to be, as far as I get it, a dying
Hi Michael, platform. I would join Matti in thinking that it is a misconception. Have you heard of the enormous daily updated unofficial repository <https://www.lfd.uci.edu/~gohlke/pythonlibs/> of the binary windows compilations of almost 600 python libraries by Christoph Gohlke? (numpy and libs depending on it are built with MKL there) It is there for a reason. If you look at the stats such as this one (Matti already mentioned them while I was writing this text), https://www.jetbrains.com/research/python-developers-survey-2018/ https://www.jetbrains.com/lp/python-developers-survey-2020/ you'll see (in addition to the fact that numpy is the #1 library in data science ;) ) that in the recent years the percentage of windows user among the developers is quite high: 69% linux - 47% windows - 32% macos (2018) 68% linux - 48% windows - 29% macos (2020) So it looks as if it is rather growing than dying. This is due to the popularity of the above mentioned data science and AI, which have skyrocketed in the last 10 years. And the vast majority of data scientists work on windows. Windows as a platform for developers as a whole is also quite flourishing today. According to the stackoverflow 2021 developer survey <https://insights.stackoverflow.com/survey/2021#most-popular-technologies-op-...> 45% of the respondents use Windows (25% linux, 25% macos). Among the professional developers the numbers are 41% for windows, 30% macos, 26% linux. Also the primary audience of the tutorials like mine (as well as of stackoverflow?) are windows users. Linux users can easily figure things described there on their own, through the docstrings, source code or, as a last resort, through the docs )
I wouldn't like to go into holy wars, though. I'm equally literate in both unix and windows (somewhat less in macos) and in my opinion the interests of all the users of the the three operating systems should be taken into account in both the code of the library and the docs. The documentation is sometimes pretty ignorant of mac/windows users, btw:
As for the particular issue of the difference in the default integer types, in my opinion the default choice of int32 on windows for array [1,2,3] fits the description
better than int64 on linux/macos. Best regards, Lev On Sun, Dec 26, 2021 at 8:45 PM Michael Siebert <michael.siebert2k@gmail.com> wrote:

Okay, little modification to my last mail: many Android smartphones are still 32 bit, but according to https://www.androidauthority.com/arm-32-vs-64-bit-explained-1232065/ from 2023 on, all (or at least many) new ARM processors will be 64 bit only. Apple‘s iPhone 64 bit only since quite a while already (September 2017, iOS 11 release).

Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] on win32 ^^ this is relevant ^^^^ this is not Type "help", "copyright", "credits" or "license" for more information.
On Sun, Dec 26, 2021 at 11:42 PM Michael Siebert < michael.siebert2k@gmail.com> wrote:

I've tried to take into account all the suggestions from this thread. https://axil.github.io/numpy-data-types.html shows new version now and https://github.com/axil/numpy-data-types/commit/14d9da053fd67e5569436faa1f58... displays most of the changes. As for the inheritance diagram, I think it is perfectly fine to add it to the documentation as is, except that I'd put back the 'void' type I've originally omitted to keep it simple. Btw is anyone aware why 'U' is missing from the np.typecode['Character']? On Sun, Dec 26, 2021 at 11:57 PM Lev Maximov <lev.maximov@gmail.com> wrote:

Very nice overview! One question and one suggestion: 1. Is integer wraparound guaranteed for signed ints, or is it an implementation detail? For unsigned ints, sure, it's straight from a C standard; what about signed types however. 2. It'd be nice to explicitly stress that dtype=float corresponds to a C double, not a C float type. This frequently trips people trying to interface with C or Cython (in my experience) вт, 28 дек. 2021 г., 11:12 Lev Maximov <lev.maximov@gmail.com>:

On Wed, Dec 29, 2021 at 9:59 AM Charles R Harris <charlesr.harris@gmail.com> wrote:
Chuck
Yes, according to C standard signed integer overflow is undefined behavior. So, does NumPy guarantee wraparound for signed ints overflow? (at least provided that the platform is two's complement) There is an open issue "Document behavior of casts and overflows for signed integer types" #17982 https://github.com/numpy/numpy/issues/17982 There is some discussion, but no definitive answer. As a side note, Rust has both checked/unchecked wraparound arithmetic and saturated arithmetic as specialized methods: pub const fn saturating_add(self, rhs: u32) -> u32 pub fn saturating_add_signed(self, rhs: i32) -> u32 (experimental) pub const fn saturating_mul(self, rhs: u32) -> u32 etc. Best wishes, Lev

On Thu, Dec 30, 2021 at 4:12 AM Lev Maximov <lev.maximov@gmail.com> wrote:
There have been discussions about overflow behavior. The main problem is performance when there is no hardware support. There used to be architectures that offered that (VAX), but it has fallen out of favor. NumPy doesn't have an official policy that I know of, but it is currently pretty much two's complement with overflow wrap. Which is not to say that things will never change, but it isn't a priority. Chuck

On Fri, Dec 31, 2021 at 12:12 AM Charles R Harris <charlesr.harris@gmail.com> wrote:
Yes, that reflects my expectations. My primary concern right now is how to formulate this to the readers of the article so as not to mislead them. Can they rely on the wraparound for the signed ints in NumPy? Or is it rather 'use at your own risk'? Btw, is signed integer wrapping covered by regressions? Lev

On Wed, Dec 29, 2021 at 12:45 AM Eric Firing <efiring@hawaii.edu> wrote:
In pandas it is a different data type ('Int64' vs 'int64'), so it makes perfect sense for me to mention it in this article provided I missed it in the main Numpy Illustrated one. I didn't have a chance to use them on my own and I've heard from some people that they don't use the masked array, for some reason, too. Do you know if Pandas uses ma internally or they have their own implementation? Thanks and regards, Lev

On 2021/12/28 10:54 AM, Lev Maximov wrote:
Pandas has some support for importing numpy masked arrays, but internally it uses flag values rather than the parallel mask array as in a numpy MaskedArray. A DataFrame initialized from a MaskedArray is exported using the ".to_numpy" method as a plain ndarray, not a MaskedArray. Eric

Am Sa., 25. Dez. 2021 um 10:03 Uhr schrieb Lev Maximov <lev.maximov@gmail.com>:
Speaking of zero-dimensional arrays more realistic example where you can run into them is when you iterate over a numpy array with nditer:
There seems to be missing an "a" before "more". Overflow warning: Instead of
np.array([2**63–1])[0] + 1 FloatingPointError: overflow encountered in longlong_scalars
on my machine it runs::
numpy.array([2 ** 63 - 1])[0] + 1 <stdin>:1: RuntimeWarning: overflow encountered in long_scalars
There are also some more significant unclarities remaining: 1. The RuntimeWarning is issued *only once*:
2. And I do not get the the difference here:
The only apparent difference I can get hold of is that:
but:
While writing this down I realise that *a* is a zero-dimensional array, while *b* is an int64 scalar. This can also be seen from the beginning:
So, unclarity resolved, but maybe I am not the only one stumbling over this. Maybe the idiom ``>>> c = numpy.int64(2 ** 63 - 1)`` can be used? I never used this, so I am unsure about the exact semantics of such a statement. I am stopping studying your document here. Might be that I continue later. Friedrich
participants (10)
-
Brock Mendel
-
Charles R Harris
-
Eric Firing
-
Evgeni Burovski
-
Friedrich Romstedt
-
Juan Nunez-Iglesias
-
Lev Maximov
-
Matti Picus
-
Michael Siebert
-
Warren Weckesser