Proposal to support __format__
data:image/s3,"s3://crabby-images/97306/9730681f75f29da737c619576f30833d2b934a40" alt=""
Hi everyone! I want to discuss adding support for __format__ in ndarray and I am willing to contribute code-wise once consensus has been reached. It was briefly discussed on GitHub two years ago (https://github.com/numpy/numpy/issues/5543) and I will re-iterate some of the points made there and build off of that. I have been thinking about this a lot in the last few weeks and my thoughts turned into a fairly fleshed out proposal. The discussion should probably start more high-level, so I apologize if the level of detail is inappropriate at this point in time. I decided on a gist, since the email got too long and clear formatting helps: https://gist.github.com/gustavla/2783543be1204d2b5d368f6a1fb4d069 OK, those are my thoughts for now. What do you think? Cheers, Gustav
data:image/s3,"s3://crabby-images/dcdbd/dcdbd8ddec664b034475bdd79a7426bde32cc735" alt=""
On Tue, Feb 14, 2017 at 3:34 PM, Gustav Larsson <larsson@cs.uchicago.edu> wrote:
Hi everyone!
I want to discuss adding support for __format__ in ndarray and I am willing to contribute code-wise once consensus has been reached. It was briefly discussed on GitHub two years ago (https://github.com/numpy/ numpy/issues/5543) and I will re-iterate some of the points made there and build off of that. I have been thinking about this a lot in the last few weeks and my thoughts turned into a fairly fleshed out proposal. The discussion should probably start more high-level, so I apologize if the level of detail is inappropriate at this point in time.
I decided on a gist, since the email got too long and clear formatting helps:
https://gist.github.com/gustavla/2783543be1204d2b5d368f6a1fb4d069
This is a lovely and clearly written document. Thanks for taking the time to think through this! I encourage you to submit it as a pull request to the NumPy repository as a "NumPy Enhancement Proposal", either now or after we've discussed it: https://docs.scipy.org/doc/numpy-dev/neps/index.html
OK, those are my thoughts for now. What do you think?
Two thoughts for now: 1. For object arrays, I would default to calling format on each element (your "map principle") rather than raising an error. 2. It's absolutely OK to leave functionality unimplemented and not immediately nail down every edge case. As a default, I would suggest raising errors whenever non-empty type specifications are provided rather than raising errors in every case.
data:image/s3,"s3://crabby-images/97306/9730681f75f29da737c619576f30833d2b934a40" alt=""
I encourage you to submit it as a pull request to the NumPy repository as a "NumPy Enhancement Proposal", either now or after we've discussed it: https://docs.scipy.org/doc/numpy-dev/neps/index.html
OK, I will let it go through one iteration of comments and then I'll submit one. Thanks! 1. For object arrays, I would default to calling format on each element
(your "map principle") rather than raising an error.
I'm glad you brought this up as a possibility. It might be possible, but there are some issues that would need to be resolved. First of all, {} and {:} always works and gives the same result it currently does. So, this only affects the situation where the format spec is non-empty. I think there are two main issues: Heterogeneity: Let's say we have x = np.array([12.3, True, 'string', Foo(10)], dtype=np.object). Then, presumably {:.1f} should cause a ValueError since the string does not support format type 'f'. This could create a lot of ValueError land mines for the user. For x[:2] however it should work and produce something like [12.3 1.0]. Note, the "map principle" still can't be strictly true. Let's say we have an array with type object and mostly string-like elements. Then {:5s} will still not produce exactly {:5s} element-wise, because the string representations need to be repr-based inside the array (otherwise it could break for newlines and things like that and produce spaces that make the boundary between elements ambiguous). This brings me to the next issue. Str vs. repr: If we have a homogeneous object-array with types Foo and Foo implements __format__, it would be great if this worked. However, one issue is that Foo.__format__ might return things like newline (or spaces), which would break (or confuse) the printed output (unless it is made incredibly smart to support "vertical alignment"). This issue is essentially the same as for strings in general, which is why they use repr instead. I can think of two solutions: 1) Try to sanitize (or repr-ify) the string returned by __format__ somehow; 2) Put the responsibility on the user and simply let the rendering break if Foo.__format__ does not play well. 2. It's absolutely OK to leave functionality unimplemented and not
immediately nail down every edge case. As a default, I would suggest raising errors whenever non-empty type specifications are provided rather than raising errors in every case.
I agree. Gustav On Tue, Feb 14, 2017 at 3:59 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Tue, Feb 14, 2017 at 3:34 PM, Gustav Larsson <larsson@cs.uchicago.edu> wrote:
Hi everyone!
I want to discuss adding support for __format__ in ndarray and I am willing to contribute code-wise once consensus has been reached. It was briefly discussed on GitHub two years ago (https://github.com/numpy/nump y/issues/5543) and I will re-iterate some of the points made there and build off of that. I have been thinking about this a lot in the last few weeks and my thoughts turned into a fairly fleshed out proposal. The discussion should probably start more high-level, so I apologize if the level of detail is inappropriate at this point in time.
I decided on a gist, since the email got too long and clear formatting helps:
https://gist.github.com/gustavla/2783543be1204d2b5d368f6a1fb4d069
This is a lovely and clearly written document. Thanks for taking the time to think through this!
I encourage you to submit it as a pull request to the NumPy repository as a "NumPy Enhancement Proposal", either now or after we've discussed it: https://docs.scipy.org/doc/numpy-dev/neps/index.html
OK, those are my thoughts for now. What do you think?
Two thoughts for now: 1. For object arrays, I would default to calling format on each element (your "map principle") rather than raising an error. 2. It's absolutely OK to leave functionality unimplemented and not immediately nail down every edge case. As a default, I would suggest raising errors whenever non-empty type specifications are provided rather than raising errors in every case.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
data:image/s3,"s3://crabby-images/dcdbd/dcdbd8ddec664b034475bdd79a7426bde32cc735" alt=""
On Tue, Feb 14, 2017 at 5:35 PM, Gustav Larsson <larsson@cs.uchicago.edu> wrote:
1. For object arrays, I would default to calling format on each element
(your "map principle") rather than raising an error.
I'm glad you brought this up as a possibility. It might be possible, but there are some issues that would need to be resolved. First of all, {} and {:} always works and gives the same result it currently does. So, this only affects the situation where the format spec is non-empty. I think there are two main issues:
Heterogeneity: Let's say we have x = np.array([12.3, True, 'string', Foo(10)], dtype=np.object). Then, presumably {:.1f} should cause a ValueError since the string does not support format type 'f'. This could create a lot of ValueError land mines for the user.
Things will absolutely break if you try to do complex operations on in-homogeneously typed arrays. I would put the onus on the user in such a case.
For x[:2] however it should work and produce something like [12.3 1.0]. Note, the "map principle" still can't be strictly true. Let's say we have an array with type object and mostly string-like elements. Then {:5s} will still not produce exactly {:5s} element-wise, because the string representations need to be repr-based inside the array (otherwise it could break for newlines and things like that and produce spaces that make the boundary between elements ambiguous). This brings me to the next issue.
Indeed, this will be a departure from the behavior without a format string, which just uses repr. In my mind, this is the strongest argument against using the map principle here, because there is a discontinuous shift between providing and not providing a format string.
Str vs. repr: If we have a homogeneous object-array with types Foo and Foo implements __format__, it would be great if this worked. However, one issue is that Foo.__format__ might return things like newline (or spaces), which would break (or confuse) the printed output (unless it is made incredibly smart to support "vertical alignment"). This issue is essentially the same as for strings in general, which is why they use repr instead. I can think of two solutions: 1) Try to sanitize (or repr-ify) the string returned by __format__ somehow; 2) Put the responsibility on the user and simply let the rendering break if Foo.__format__ does not play well.
I wouldn't do anything fancy here to worry about line breaks. It's basically impossible to get this right for edge cases, so I would certainly put the responsibility on the user. On another note, about Python 2 vs 3: I would definitely take the approach of copying the Python 3 behavior on all versions of NumPy (when feasible) and not being too concerned about compatibility with format on Python 2. The future is Python 3.
data:image/s3,"s3://crabby-images/fca0d/fca0de20d0945555f8ea348e66860fb4c9aefbd2" alt=""
Hi Gustav, This is great! A few quick comments (mostly echo-ing Stephan's). 1. You basically have a NEP already! Making a PR from it allows to give line-by-line comments, so would help! 2. Don't worry about supporting python2 specifics; just try to ensure it doesn't break; I would not say more about it! 3. On `set_printoptions` -- ideally, it will become possible to use this as a context (i.e., `with set_printoption(...)`). It might make sense to have an `override_format` keyword argument to it. 4. Otherwise, my main suggestion is to start small with the more obvious ones, and not worry too much about format validation, but rather about getting the simple ones to work well (e.g., for an object array, just apply the format given; if it doesn't work, it will error out on its own, which is OK). 5. One bit of detail: the "g" one does confuse me. All the best, Marten
data:image/s3,"s3://crabby-images/97306/9730681f75f29da737c619576f30833d2b934a40" alt=""
This is great!
Thanks! Glad to be met by enthusiasm about this. 1. You basically have a NEP already! Making a PR from it allows to
give line-by-line comments, so would help!
I will do this soon. 2. Don't worry about supporting python2 specifics; just try to ensure
it doesn't break; I would not say more about it!
Sounds good to me. 3. On `set_printoptions` -- ideally, it will become possible to use
this as a context (i.e., `with set_printoption(...)`). It might make sense to have an `override_format` keyword argument to it.
Having a `with np.printoptions(...)` context manager is a great idea. It does sound orthogonal to __format__ though, so it could be addressed separately. 4. Otherwise, my main suggestion is to start small with the more
obvious ones, and not worry too much about format validation, but rather about getting the simple ones to work well (e.g., for an object array, just apply the format given; if it doesn't work, it will error out on its own, which is OK).
Sounds good to me. I was thinking of approaching the implementation by writing unit tests first and group them into different priority tiers. That way, the unit tests can go through another review before implementation gets going. I agree that __format__ doesn't have to check format validation if a ValueError is going to be raised anyway by sub-calls. 5. One bit of detail: the "g" one does confuse me. I will re-write this a bit to make it clearer. Basically, the 'g' with the mix of 'e'/'f' depending on max/min>1000 is all from the current numpy behavior, so it is not something I had much creative input on at all. Although, as it is written right now it may seem so. That is, the goal is to have {:} == {:g} for float arrays, analogous to how {:} == {:g} for built-in floats. Then, if the user departs a bit, like {:.2g}, it will simply be identical to calling np.set_printoptions(precision=2) first. Gustav On Wed, Feb 15, 2017 at 8:03 AM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
Hi Gustav,
This is great! A few quick comments (mostly echo-ing Stephan's).
1. You basically have a NEP already! Making a PR from it allows to give line-by-line comments, so would help!
2. Don't worry about supporting python2 specifics; just try to ensure it doesn't break; I would not say more about it!
3. On `set_printoptions` -- ideally, it will become possible to use this as a context (i.e., `with set_printoption(...)`). It might make sense to have an `override_format` keyword argument to it.
4. Otherwise, my main suggestion is to start small with the more obvious ones, and not worry too much about format validation, but rather about getting the simple ones to work well (e.g., for an object array, just apply the format given; if it doesn't work, it will error out on its own, which is OK).
5. One bit of detail: the "g" one does confuse me.
All the best,
Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
data:image/s3,"s3://crabby-images/83a26/83a26a81a72e9a5579075be8ed975ebd0b0df9ee" alt=""
On the last item, do we really have to follow that strange, `d`,`g` and so on conventions on formatting? With all respect to the humongous historical baggage, I think that notation is pretty archaic and terminal like. If being pythonic is of a concern here, maybe it is better to use a more verbose syntax. Just throwing out an idea after 15 seconds of thought (so by no means an alternative suggestion) eng:6i5d -> engineering notation (always powers of ten of multiples of 3) 6 integral digits and 5 decimal digits. float (whatever the default is) float:4i2d (you get the idea) etc. FULL DISCLOSURE: I am a very displeased customer of `fprintf ` of matlab (and others) and this archaic formatting. I never got a hang of it so it might be the case that I don't quite get the rationale behind it and I almost always get it wrong. Maybe at least the rationale can be clarified. Lastly, repeating what others mentioned: thank you for this well prepared initiative On Wed, Feb 15, 2017 at 10:48 PM, Gustav Larsson <larsson@cs.uchicago.edu> wrote:
This is great!
Thanks! Glad to be met by enthusiasm about this.
1. You basically have a NEP already! Making a PR from it allows to
give line-by-line comments, so would help!
I will do this soon.
2. Don't worry about supporting python2 specifics; just try to ensure
it doesn't break; I would not say more about it!
Sounds good to me.
3. On `set_printoptions` -- ideally, it will become possible to use
this as a context (i.e., `with set_printoption(...)`). It might make sense to have an `override_format` keyword argument to it.
Having a `with np.printoptions(...)` context manager is a great idea. It does sound orthogonal to __format__ though, so it could be addressed separately.
4. Otherwise, my main suggestion is to start small with the more
obvious ones, and not worry too much about format validation, but rather about getting the simple ones to work well (e.g., for an object array, just apply the format given; if it doesn't work, it will error out on its own, which is OK).
Sounds good to me. I was thinking of approaching the implementation by writing unit tests first and group them into different priority tiers. That way, the unit tests can go through another review before implementation gets going. I agree that __format__ doesn't have to check format validation if a ValueError is going to be raised anyway by sub-calls.
5. One bit of detail: the "g" one does confuse me.
I will re-write this a bit to make it clearer. Basically, the 'g' with the mix of 'e'/'f' depending on max/min>1000 is all from the current numpy behavior, so it is not something I had much creative input on at all. Although, as it is written right now it may seem so. That is, the goal is to have {:} == {:g} for float arrays, analogous to how {:} == {:g} for built-in floats. Then, if the user departs a bit, like {:.2g}, it will simply be identical to calling np.set_printoptions(precision=2) first.
Gustav
On Wed, Feb 15, 2017 at 8:03 AM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
Hi Gustav,
This is great! A few quick comments (mostly echo-ing Stephan's).
1. You basically have a NEP already! Making a PR from it allows to give line-by-line comments, so would help!
2. Don't worry about supporting python2 specifics; just try to ensure it doesn't break; I would not say more about it!
3. On `set_printoptions` -- ideally, it will become possible to use this as a context (i.e., `with set_printoption(...)`). It might make sense to have an `override_format` keyword argument to it.
4. Otherwise, my main suggestion is to start small with the more obvious ones, and not worry too much about format validation, but rather about getting the simple ones to work well (e.g., for an object array, just apply the format given; if it doesn't work, it will error out on its own, which is OK).
5. One bit of detail: the "g" one does confuse me.
All the best,
Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
data:image/s3,"s3://crabby-images/edd05/edd05df6b836af917a88663e386141414690885f" alt=""
On Wed, Feb 15, 2017 at 4:05 PM, Ilhan Polat <ilhanpolat@gmail.com> wrote:
On the last item, do we really have to follow that strange, `d`,`g` and so on conventions on formatting? With all respect to the humongous historical baggage, I think that notation is pretty archaic and terminal like. If being pythonic is of a concern here, maybe it is better to use a more verbose syntax. Just throwing out an idea after 15 seconds of thought (so by no means an alternative suggestion)
eng:6i5d -> engineering notation (always powers of ten of multiples of 3) 6 integral digits and 5 decimal digits. float (whatever the default is) float:4i2d (you get the idea)
etc.
While I agree with you that printf format codes are arcane, unfortunately they need to be used here since they are supported by Python: https://docs.python.org/3.1/library/string.html#formatspec
FULL DISCLOSURE: I am a very displeased customer of `fprintf ` of matlab (and others) and this archaic formatting. I never got a hang of it so it might be the case that I don't quite get the rationale behind it and I almost always get it wrong. Maybe at least the rationale can be clarified.
Lastly, repeating what others mentioned: thank you for this well prepared initiative
On Wed, Feb 15, 2017 at 10:48 PM, Gustav Larsson <larsson@cs.uchicago.edu> wrote:
This is great!
Thanks! Glad to be met by enthusiasm about this.
1. You basically have a NEP already! Making a PR from it allows to
give line-by-line comments, so would help!
I will do this soon.
2. Don't worry about supporting python2 specifics; just try to ensure
it doesn't break; I would not say more about it!
Sounds good to me.
3. On `set_printoptions` -- ideally, it will become possible to use
this as a context (i.e., `with set_printoption(...)`). It might make sense to have an `override_format` keyword argument to it.
Having a `with np.printoptions(...)` context manager is a great idea. It does sound orthogonal to __format__ though, so it could be addressed separately.
4. Otherwise, my main suggestion is to start small with the more
obvious ones, and not worry too much about format validation, but rather about getting the simple ones to work well (e.g., for an object array, just apply the format given; if it doesn't work, it will error out on its own, which is OK).
Sounds good to me. I was thinking of approaching the implementation by writing unit tests first and group them into different priority tiers. That way, the unit tests can go through another review before implementation gets going. I agree that __format__ doesn't have to check format validation if a ValueError is going to be raised anyway by sub-calls.
5. One bit of detail: the "g" one does confuse me.
I will re-write this a bit to make it clearer. Basically, the 'g' with the mix of 'e'/'f' depending on max/min>1000 is all from the current numpy behavior, so it is not something I had much creative input on at all. Although, as it is written right now it may seem so. That is, the goal is to have {:} == {:g} for float arrays, analogous to how {:} == {:g} for built-in floats. Then, if the user departs a bit, like {:.2g}, it will simply be identical to calling np.set_printoptions(precision=2) first.
Gustav
On Wed, Feb 15, 2017 at 8:03 AM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
Hi Gustav,
This is great! A few quick comments (mostly echo-ing Stephan's).
1. You basically have a NEP already! Making a PR from it allows to give line-by-line comments, so would help!
2. Don't worry about supporting python2 specifics; just try to ensure it doesn't break; I would not say more about it!
3. On `set_printoptions` -- ideally, it will become possible to use this as a context (i.e., `with set_printoption(...)`). It might make sense to have an `override_format` keyword argument to it.
4. Otherwise, my main suggestion is to start small with the more obvious ones, and not worry too much about format validation, but rather about getting the simple ones to work well (e.g., for an object array, just apply the format given; if it doesn't work, it will error out on its own, which is OK).
5. One bit of detail: the "g" one does confuse me.
All the best,
Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (5)
-
Gustav Larsson
-
Ilhan Polat
-
Marten van Kerkwijk
-
Nathan Goldbaum
-
Stephan Hoyer