Add Scalar to collections.abc with str, bytes, etc. being both Scalar and Sequence (therefore also Container)

This is a far more modest alternative to https://mail.python.org/archives/list/python-ideas@python.org/thread/ZP2OKYE... and is also something that could be reasonably & easily implemented in 3.x. Add `Scalar` to `collections.abc`. The meaning of `Scalar` is that the object is generally to be treated as an atomic value, regardless of whether it is also technically a container. The obvious application of an object being both a `Container` and a `Scalar` is to stringlike objects, but there might also be other kinds of standard object for which it would be useful and that I have not thought of. One reason for wanting to designate strings as `Scalar` is for navigating a hierarchy of collections in which, even though a string is a collection, we want to treat it as a leaf of the tree rather than drilling down into its characters (each of which is represented as a string of 1 character — `type('xyz'[1])` -> `str`) Example implementation (for addition to _collections_abc.py): class Scalar(metaclass=ABCMeta): __slots__ = () @classmethod def __subclasshook__(cls, C): if cls is Scalar: if not issubclass(C, Collection): return True return NotImplemented Scalar.register(str) Scalar.register(bytes) Scalar.register(bytearray)

It is usually quite meaningful to drill down into other kinds of immutable collections. A data structure consisting of a hierarchy of tuples would be a good example of that. Besides the drill-down, cases, there are times when it is valuable to test whether a collection or a "scalar" was received as an argument. That is to be avoided more often than not, but there are times (such as when implementing a DSL where it is useful enough to want to do. def __setattr__(self, value_or_vl): # When value_or_vl is a non-string sequence, then it is VERY likely # to be a tuple (which is immutable). if isinstance(Sequence, value_or_vl) and not isinstance(Scalar, value_or_vl): value, label = value_or_vl else: value = value_or_vl label = None self.values.add(value) self.value_labels['value'] = label or self.value_labels.get(value) If the code above were to specifically check for `str`, `bytes`, or `bytearray`, that would be more fragile. it would not accommodate any new stringlike types in newer python versions, and there would be no way to designate other custom classes as being logically scalar even though supporting sequence behavior.

On Mon, Oct 14, 2019 at 8:48 AM Steve Jorgensen <stevej@stevej.name> wrote:
I'm not really sure that there's a general problem to be solved here. You want to drill down into all containers except strings, so why not just say that? Scalar = (str, bytes) If a future Python version introduces a new string-like type, you would have to make a decision as to whether it should be treated as a string or as a collection of strings. (Assuming it isn't actually a subclass of str or bytes, in which case it's been made clear.) Whether something should be treated as atomic or iterable depends on usage, not just the inherent attributes of the type; for instance, a namedtuple is iterable (as a subclass of tuple), but many namedtuple instances should be treated as atomic, and not drilled down into. (Classic example: a point, defined as (x,y), should almost certainly be seen as atomic. Even more so if you define a point as (R, theta).) Alternatively, do what str.startswith/str.endswith do and mandate that a pair be one of a specific set of types. That's usually the easiest plan when working with both sequences and mappings; you can define sequence behaviour as applying ONLY to tuples and lists (and maybe sets), mapping behaviour applying ONLY to dicts, and every other object will be treated as a scalar. It's a small restriction on your caller, and a massive simplicity. But if you truly want to support arbitrary sequences (or perhaps arbitrary containers, or arbitrary iterables), then just special-case whichever string types make sense to you, and run with it. ChrisA

Chris Angelico wrote:
A scalar is not just str or bytes though. It is also anything else that's not a collection, so numbers, dates, booleans, and instances of any other classes (standard or custom) that are not collections. A string is just one of the few examples of something that is a collection but should be treated as a scalar by anything that asks whether it is one. f we were still talking about Python 2, then `unicode` would be another such type, but in Python 3 we do have `bytes`, `bytearray`, and `memoryview`. In addition, we might encounter subclasses of `collections.UserString` (which do not identify as subclasses of `str`). There is no guarantee that stringlike classes are the only cases anyone would ever want to be treated this way. In fact, you pointed out `namedtuple` below, so that's another case. Also, there can be custom classes that one might want to register as `Scalar`. For instance, Someone has a date-like class that can be treated as sequences of year, mont, day values, those members are still primarily attributes and secondarily sequence items. This would be something one might want to register as `Scalar` (even if not implemented as a `namedtuple` or a subclass thereof.
Well, if such a new type is truly stringlike, and `Scalar` has been added to the standard `collections.abc`, then clearly it should be registered as `Scalar` in `collections.abc` just as `str` would be. As I mentioned above, I think that `namedtuple` also makes sense to be scalar since its members are ore often primarily attributes than primarily values in a sequence.
Parsing out 2 different points from that… Assuming `Scalar` can't clean up every case we can imagine for this kind of ambiguity can come up with doesn't mean it's not potentially very helpful. After the discussion so far, I happen to still feel like this is a compelling idea that I'd like to see implemented. The problem with special-casing is when interfacing between different bodies of code. That is to say, using libraries or writing libraries for others to use. Having a central `Scalar` class that different bodies of code can register their classes to be provides a clean way for bodies of code to share this information without having to actually know ahead of time which other bodies of code in an integrated application might need to use that information.

On 14/10/2019 12:13, Steve Jorgensen wrote:
In practice I find that's almost never what I want. In the incredibly rare event that I'm throwing a string at an overly generic interface, I almost always want it to be treated as a sequence of characters. If I'm the one writing the overly generic interface, I usually apply the Well Don't Do That Then principle :-) -- Rhodri James *-* Kynesim Ltd

On Mon, Oct 14, 2019 at 10:14 PM Steve Jorgensen <stevej@stevej.name> wrote:
Assuming `Scalar` can't clean up every case we can imagine for this kind of ambiguity can come up with doesn't mean it's not potentially very helpful. After the discussion so far, I happen to still feel like this is a compelling idea that I'd like to see implemented.
The problem with special-casing is when interfacing between different bodies of code. That is to say, using libraries or writing libraries for others to use. Having a central `Scalar` class that different bodies of code can register their classes to be provides a clean way for bodies of code to share this information without having to actually know ahead of time which other bodies of code in an integrated application might need to use that information.
You're assuming that everyone's definition of Scalar will be the same, though. I haven't seen that proven. In many cases, the correct check is "is iterable but is not str". In others, it's "is iterable but is not (str,bytes)". Some will see an os.stat() result as atomic, some won't. Some will see a tuple as atomic, some won't. Which ones should be registered as Scalar? What exactly is the type's definition? ChrisA

On Oct 13, 2019, at 14:42, Steve Jorgensen <stevej@stevej.name> wrote:
It is usually quite meaningful to drill down into other kinds of immutable collections. A data structure consisting of a hierarchy of tuples would be a good example of that.
Sure, sometimes it makes sense to represent a tree as a tuple of recursively tuples instead of creating a Node class or similar. But in that case, wouldn’t you want to drill down only into tuples, not all sequences? And in the cases where you _did_ want to drill down into all sequences except strings, wouldn’t you also want to drill down into iterables that weren’t sequences? If someone passes you an array or an instance of an old-style-sequence-api type or even an iterator, why not drill down into that? It seems like this would just be a way to make it easier to write fragile and unintuitive interfaces to data structures that you could just as easily write better interfaces for without needing any new features.
I’m not sure what this code is intended to do. First, __setattr__ is always called with a name as well as a value: `spam.eggs = 2` calls `type(span).__setattr__(spam, 'eggs', 2)`. What syntax are you envisioning that would call it with just spam and 2, and what would it be used to? Then, at the end, you add the value to a set of values, and then you replace the label for the name 'value' in a dict with either the passed-in label or the existing label assigned to the value, which will presumably never exist unless the value happens to be the string 'value'. What is this supposed to be doing? Also, why does it matter that tuples are immutable here? All you’re doing with any Sequence is unpacking it into two values—which works just as well with a 2-element list as a 2-element tuple, and fails just as badly for a 3-element immutable sequence as a mutable one. And is there a reason you want to treat non-sequence iterables of 2 elements as scalars here? And even ignoring all of that, whatever this is supposed to do, it seems like the only reason you’re checking for non-string Sequence is that it’s “VERY likely to be a tuple”, and therefore a good approximation of checking for a tuple? Why not just check for a tuple instead, which is exactly rather than approximately what you want, and a lot simpler? I can see the vague sense for why Scalar might be useful, but I can’t think of any practical examples. So I’m at least +0 if you can think of any examples that would work, and should be implemented this way, but I don’t think either of these qualifies. Maybe it would help to come up with something concrete rather than an abstract toy example?

I completely agree that there is a problem to be solved here. Unfortunately I don't think the Scalar ABC is the solution. I'm not sure that there is a generic solution, since different applications will have different ideas of what needs to be drilled down into. E.g. a tuple representing coordinates (1, 2) might be treated as an atomic value, rather than a collection. Perhaps a useful partial solution would be to define a "String-like" ABC, consisting of str, bytes, UserString, possibly even bytearray and memoryview. But given that (1) UserString is rarely used, and (2) its pretty easy to simply say `isinstance(obj, (str, bytes))`, I'm not sure that it is worth the effort of predefining it. Speaking of UserString, does anyone know why it isn't registered as a virtual subclass of str? py> from collections import UserString py> issubclass(UserString, str) False -- Steven

On Oct 14, 2019, at 07:15, Steven D'Aprano <steve@pearwood.info> wrote:
Speaking of UserString, does anyone know why it isn't registered as a virtual subclass of str?
None of the concrete classes register virtual subclasses, so this would be a unique exception. Also, since 2.3, if you want a subclass of str you can just subclass str; if you use UserString instead, you have some reason for doing so, and that reason might include not being a subclass of str. If this were a common problem, you’d probably want to add a collections.abc.String, which would register both str and UserString, and allow people to register independent string types, and then people would check with `(String, ByteString)` instead of `(str, ByteString)`. But I think the need for that just doesn’t come up often enough for anyone to ask for it. Personally, I use third-party string types–mostly bridge types like objc.NSString or js.String—a lot more often than I build strings out of UserString.

Andrew Barnert wrote:
See the final lines in https://github.com/python/cpython/blob/3.7/Lib/_collections_abc.py MutableSequence.register(list) MutableSequence.register(bytearray) # Multiply inheriting, see ByteString [snip]

On Oct 14, 2019, at 13:28, Steve Jorgensen <stevej@stevej.name> wrote:
But that’s the exact opposite: it registers list and bytearray as virtual subclasses of the ABC MutableSequence, it doesn’t register MutableSequence as a virtual subclass of the concrete classes list and bytearray. Doing the latter would not only (like Steven’s proposal) be unprecedented and require changes to builtin types, but also (completely unlike Steven’s proposal) be just wrong—it’s certainly not true in any useful sense that every mutable sequence is usable as a bytearray.

It is usually quite meaningful to drill down into other kinds of immutable collections. A data structure consisting of a hierarchy of tuples would be a good example of that. Besides the drill-down, cases, there are times when it is valuable to test whether a collection or a "scalar" was received as an argument. That is to be avoided more often than not, but there are times (such as when implementing a DSL where it is useful enough to want to do. def __setattr__(self, value_or_vl): # When value_or_vl is a non-string sequence, then it is VERY likely # to be a tuple (which is immutable). if isinstance(Sequence, value_or_vl) and not isinstance(Scalar, value_or_vl): value, label = value_or_vl else: value = value_or_vl label = None self.values.add(value) self.value_labels['value'] = label or self.value_labels.get(value) If the code above were to specifically check for `str`, `bytes`, or `bytearray`, that would be more fragile. it would not accommodate any new stringlike types in newer python versions, and there would be no way to designate other custom classes as being logically scalar even though supporting sequence behavior.

On Mon, Oct 14, 2019 at 8:48 AM Steve Jorgensen <stevej@stevej.name> wrote:
I'm not really sure that there's a general problem to be solved here. You want to drill down into all containers except strings, so why not just say that? Scalar = (str, bytes) If a future Python version introduces a new string-like type, you would have to make a decision as to whether it should be treated as a string or as a collection of strings. (Assuming it isn't actually a subclass of str or bytes, in which case it's been made clear.) Whether something should be treated as atomic or iterable depends on usage, not just the inherent attributes of the type; for instance, a namedtuple is iterable (as a subclass of tuple), but many namedtuple instances should be treated as atomic, and not drilled down into. (Classic example: a point, defined as (x,y), should almost certainly be seen as atomic. Even more so if you define a point as (R, theta).) Alternatively, do what str.startswith/str.endswith do and mandate that a pair be one of a specific set of types. That's usually the easiest plan when working with both sequences and mappings; you can define sequence behaviour as applying ONLY to tuples and lists (and maybe sets), mapping behaviour applying ONLY to dicts, and every other object will be treated as a scalar. It's a small restriction on your caller, and a massive simplicity. But if you truly want to support arbitrary sequences (or perhaps arbitrary containers, or arbitrary iterables), then just special-case whichever string types make sense to you, and run with it. ChrisA

Chris Angelico wrote:
A scalar is not just str or bytes though. It is also anything else that's not a collection, so numbers, dates, booleans, and instances of any other classes (standard or custom) that are not collections. A string is just one of the few examples of something that is a collection but should be treated as a scalar by anything that asks whether it is one. f we were still talking about Python 2, then `unicode` would be another such type, but in Python 3 we do have `bytes`, `bytearray`, and `memoryview`. In addition, we might encounter subclasses of `collections.UserString` (which do not identify as subclasses of `str`). There is no guarantee that stringlike classes are the only cases anyone would ever want to be treated this way. In fact, you pointed out `namedtuple` below, so that's another case. Also, there can be custom classes that one might want to register as `Scalar`. For instance, Someone has a date-like class that can be treated as sequences of year, mont, day values, those members are still primarily attributes and secondarily sequence items. This would be something one might want to register as `Scalar` (even if not implemented as a `namedtuple` or a subclass thereof.
Well, if such a new type is truly stringlike, and `Scalar` has been added to the standard `collections.abc`, then clearly it should be registered as `Scalar` in `collections.abc` just as `str` would be. As I mentioned above, I think that `namedtuple` also makes sense to be scalar since its members are ore often primarily attributes than primarily values in a sequence.
Parsing out 2 different points from that… Assuming `Scalar` can't clean up every case we can imagine for this kind of ambiguity can come up with doesn't mean it's not potentially very helpful. After the discussion so far, I happen to still feel like this is a compelling idea that I'd like to see implemented. The problem with special-casing is when interfacing between different bodies of code. That is to say, using libraries or writing libraries for others to use. Having a central `Scalar` class that different bodies of code can register their classes to be provides a clean way for bodies of code to share this information without having to actually know ahead of time which other bodies of code in an integrated application might need to use that information.

On 14/10/2019 12:13, Steve Jorgensen wrote:
In practice I find that's almost never what I want. In the incredibly rare event that I'm throwing a string at an overly generic interface, I almost always want it to be treated as a sequence of characters. If I'm the one writing the overly generic interface, I usually apply the Well Don't Do That Then principle :-) -- Rhodri James *-* Kynesim Ltd

On Mon, Oct 14, 2019 at 10:14 PM Steve Jorgensen <stevej@stevej.name> wrote:
Assuming `Scalar` can't clean up every case we can imagine for this kind of ambiguity can come up with doesn't mean it's not potentially very helpful. After the discussion so far, I happen to still feel like this is a compelling idea that I'd like to see implemented.
The problem with special-casing is when interfacing between different bodies of code. That is to say, using libraries or writing libraries for others to use. Having a central `Scalar` class that different bodies of code can register their classes to be provides a clean way for bodies of code to share this information without having to actually know ahead of time which other bodies of code in an integrated application might need to use that information.
You're assuming that everyone's definition of Scalar will be the same, though. I haven't seen that proven. In many cases, the correct check is "is iterable but is not str". In others, it's "is iterable but is not (str,bytes)". Some will see an os.stat() result as atomic, some won't. Some will see a tuple as atomic, some won't. Which ones should be registered as Scalar? What exactly is the type's definition? ChrisA

On Oct 13, 2019, at 14:42, Steve Jorgensen <stevej@stevej.name> wrote:
It is usually quite meaningful to drill down into other kinds of immutable collections. A data structure consisting of a hierarchy of tuples would be a good example of that.
Sure, sometimes it makes sense to represent a tree as a tuple of recursively tuples instead of creating a Node class or similar. But in that case, wouldn’t you want to drill down only into tuples, not all sequences? And in the cases where you _did_ want to drill down into all sequences except strings, wouldn’t you also want to drill down into iterables that weren’t sequences? If someone passes you an array or an instance of an old-style-sequence-api type or even an iterator, why not drill down into that? It seems like this would just be a way to make it easier to write fragile and unintuitive interfaces to data structures that you could just as easily write better interfaces for without needing any new features.
I’m not sure what this code is intended to do. First, __setattr__ is always called with a name as well as a value: `spam.eggs = 2` calls `type(span).__setattr__(spam, 'eggs', 2)`. What syntax are you envisioning that would call it with just spam and 2, and what would it be used to? Then, at the end, you add the value to a set of values, and then you replace the label for the name 'value' in a dict with either the passed-in label or the existing label assigned to the value, which will presumably never exist unless the value happens to be the string 'value'. What is this supposed to be doing? Also, why does it matter that tuples are immutable here? All you’re doing with any Sequence is unpacking it into two values—which works just as well with a 2-element list as a 2-element tuple, and fails just as badly for a 3-element immutable sequence as a mutable one. And is there a reason you want to treat non-sequence iterables of 2 elements as scalars here? And even ignoring all of that, whatever this is supposed to do, it seems like the only reason you’re checking for non-string Sequence is that it’s “VERY likely to be a tuple”, and therefore a good approximation of checking for a tuple? Why not just check for a tuple instead, which is exactly rather than approximately what you want, and a lot simpler? I can see the vague sense for why Scalar might be useful, but I can’t think of any practical examples. So I’m at least +0 if you can think of any examples that would work, and should be implemented this way, but I don’t think either of these qualifies. Maybe it would help to come up with something concrete rather than an abstract toy example?

I completely agree that there is a problem to be solved here. Unfortunately I don't think the Scalar ABC is the solution. I'm not sure that there is a generic solution, since different applications will have different ideas of what needs to be drilled down into. E.g. a tuple representing coordinates (1, 2) might be treated as an atomic value, rather than a collection. Perhaps a useful partial solution would be to define a "String-like" ABC, consisting of str, bytes, UserString, possibly even bytearray and memoryview. But given that (1) UserString is rarely used, and (2) its pretty easy to simply say `isinstance(obj, (str, bytes))`, I'm not sure that it is worth the effort of predefining it. Speaking of UserString, does anyone know why it isn't registered as a virtual subclass of str? py> from collections import UserString py> issubclass(UserString, str) False -- Steven

On Oct 14, 2019, at 07:15, Steven D'Aprano <steve@pearwood.info> wrote:
Speaking of UserString, does anyone know why it isn't registered as a virtual subclass of str?
None of the concrete classes register virtual subclasses, so this would be a unique exception. Also, since 2.3, if you want a subclass of str you can just subclass str; if you use UserString instead, you have some reason for doing so, and that reason might include not being a subclass of str. If this were a common problem, you’d probably want to add a collections.abc.String, which would register both str and UserString, and allow people to register independent string types, and then people would check with `(String, ByteString)` instead of `(str, ByteString)`. But I think the need for that just doesn’t come up often enough for anyone to ask for it. Personally, I use third-party string types–mostly bridge types like objc.NSString or js.String—a lot more often than I build strings out of UserString.

Andrew Barnert wrote:
See the final lines in https://github.com/python/cpython/blob/3.7/Lib/_collections_abc.py MutableSequence.register(list) MutableSequence.register(bytearray) # Multiply inheriting, see ByteString [snip]

On Oct 14, 2019, at 13:28, Steve Jorgensen <stevej@stevej.name> wrote:
But that’s the exact opposite: it registers list and bytearray as virtual subclasses of the ABC MutableSequence, it doesn’t register MutableSequence as a virtual subclass of the concrete classes list and bytearray. Doing the latter would not only (like Steven’s proposal) be unprecedented and require changes to builtin types, but also (completely unlike Steven’s proposal) be just wrong—it’s certainly not true in any useful sense that every mutable sequence is usable as a bytearray.
participants (6)
-
Andrew Barnert
-
Chris Angelico
-
David Mertz
-
Rhodri James
-
Steve Jorgensen
-
Steven D'Aprano