Mailman 3 strings as iterables - from str.startswith taking any iterator instead of just tuple - Python-ideas

newer
`pathlib.Path.write` and...

strings as iterables - from str.startswith taking any iterator instead of just tuple

Alexander Heger

3 Jan 2014 3 Jan '14

3:54 a.m.

By designing an API that doesn't require such overloading.

On Thursday, January 2, 2014, Alexander Heger wrote:

...
...
...
isinstance(x, Iterable) and not isinstance(x, str)

If you find yourself typing that a lot I think you have a bigger problem though.

How do you replace this?

for my applications this seemed the most natural way - have the method deal with what it is fed, which could be strings or any kind of collections or iterables of strings. But never would I want to disassemble strings into characters. From the previous message I gather that I am not the only one with this application case. Generally, I find strings being iterables of characters as useful as if integers were iterables of bits. They should just be units. They already start out being not mutable. I think it would be a positive design change for Python 4 to make them units instead of being iterables. At least for me, there is much fewer applications where the latter is useful than where it requires extra code. Overall, it makes the language less clean that a string is an iterable; a special case we always have to code around. I know it will break a lot of existing code, but so did the string change from py2 to 3. (It would break very few of my codes, though.) -Alexander

Show replies by date

Chris Angelico

3 Jan 3 Jan

3:59 a.m.

On Fri, Jan 3, 2014 at 2:54 PM, Alexander Heger <python@2sn.net> wrote:

...

Generally, I find strings being iterables of characters as useful as if integers were iterables of bits. They should just be units.

What this would mean is that any time you want to iterate over the characters, you'd have to iterate over string.split('') instead. So the question is, is that common enough to be a problem? The other point that comes to mind is that iteration and indexing are closely related. I think most people would agree that "abcde"[1] should be 'b' (granted, there's room for debate as to whether that should be a one-character string or an integer with the Unicode codepoint, but either way); it's possible to iterate over anything by indexing it with 0, then 1, then 2, etc, until it raises IndexError. For a string to not be iterable, that identity would have to be broken. ChrisA

Terry Reedy

9:23 a.m.

On 1/2/2014 10:59 PM, Chris Angelico wrote:

...

On Fri, Jan 3, 2014 at 2:54 PM, Alexander Heger <python@2sn.net> wrote:

...
Generally, I find strings being iterables of characters as useful as if integers were iterables of bits. They should just be units.

What this would mean is that any time you want to iterate over the characters, you'd have to iterate over string.split('') instead. So the question is, is that common enough to be a problem?

The other point that comes to mind is that iteration and indexing are closely related.

def iter(collection): # is something like (ignoring two param form) if hasattr('__iter__'): return ob.__iter__ elif hasattr('__getitem__'): return iterator(ob) In 2.x, str does *not* have .__iter__, so the second branch is taken.

...

...
...
iter('ab') <iterator object at 0x0000000002ED56D8>

In 3.x, str *does* have .__iter__.

...

...
...
iter('ab') <str_iterator object at 0x00000000037D2EB8>

If .__iter__ were removed, strings would revert to using the generic iterator and would *still* be iterable.

...

I think most people would agree that "abcde"[1] should be 'b' (granted, there's room for debate as to whether that should be a one-character string or an integer with the Unicode codepoint, but either way); it's possible to iterate over anything by indexing it with 0, then 1, then 2, etc, until it raises IndexError. For a string to not be iterable, that identity would have to be broken.

Which, to me, would be really ugly ;-). -- Terry Jan Reedy

Alexander Heger

4 Jan 4 Jan

4:23 a.m.

...

On Fri, Jan 3, 2014 at 2:54 PM, Alexander Heger <python@2sn.net> wrote:

...
Generally, I find strings being iterables of characters as useful as if integers were iterables of bits. They should just be units.

What this would mean is that any time you want to iterate over the characters, you'd have to iterate over string.split('') instead. So the question is, is that common enough to be a problem?

you could still have had str.iter()

...

The other point that comes to mind is that iteration and indexing are closely related. I think most people would agree that "abcde"[1] should be 'b' (granted, there's room for debate as to whether that should be a one-character string or an integer with the Unicode codepoint, but either way); it's possible to iterate over anything by indexing it with 0, then 1, then 2, etc, until it raises IndexError. For a string to not be iterable, that identity would have to be broken.

OK, I admit that not being able to iterate over something that can be indexed may be confusing. Though indexing of strings is somewhat special in many languages. -Alexander

Chris Angelico

5:32 a.m.

On Sat, Jan 4, 2014 at 3:23 PM, Alexander Heger <python@2sn.net> wrote:

...

...
The other point that comes to mind is that iteration and indexing are closely related. I think most people would agree that "abcde"[1] should be 'b' (granted, there's room for debate as to whether that should be a one-character string or an integer with the Unicode codepoint, but either way); it's possible to iterate over anything by indexing it with 0, then 1, then 2, etc, until it raises IndexError. For a string to not be iterable, that identity would have to be broken.

OK, I admit that not being able to iterate over something that can be indexed may be confusing. Though indexing of strings is somewhat special in many languages.

I don't know that it's particularly special. In some languages, a string is simply an array of small integers (maybe bytes, maybe Unicode codepoints), so when you index into one, you get the integers. Python deems that the elements of a string are themselves strings, which is somewhat special I suppose, but only because the representation of a character is a short string. And of course, there are languages that treat strings as simple atomic scalars, no subscripting allowed at all - I don't think that's an advantage over either of the above. :) When you index a string, you get a character. Whatever the language uses to represent a character, that's what you get. I don't think this is particularly esoteric, but maybe that's just me. ChrisA

Mark Lawrence

3 Jan 3 Jan

4:27 a.m.

On 03/01/2014 03:54, Alexander Heger wrote:

...

Generally, I find strings being iterables of characters as useful as if integers were iterables of bits. They should just be units. They already start out being not mutable. I think it would be a positive design change for Python 4 to make them units instead of being iterables. At least for me, there is much fewer applications where the latter is useful than where it requires extra code. Overall, it makes the language less clean that a string is an iterable; a special case we always have to code around.

I find your terminology misleading. A string is a sequence in the same way that list, tuple, range, bytes, bytearray and memoryview are. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

spir

10:19 a.m.

On 01/03/2014 04:54 AM, Alexander Heger wrote:

...

...
By designing an API that doesn't require such overloading.

On Thursday, January 2, 2014, Alexander Heger wrote:

...
...
...
isinstance(x, Iterable) and not isinstance(x, str)

If you find yourself typing that a lot I think you have a bigger problem though.

How do you replace this?

for my applications this seemed the most natural way - have the method deal with what it is fed, which could be strings or any kind of collections or iterables of strings. But never would I want to disassemble strings into characters. From the previous message I gather that I am not the only one with this application case.

Generally, I find strings being iterables of characters as useful as if integers were iterables of bits. They should just be units. They already start out being not mutable. I think it would be a positive design change for Python 4 to make them units instead of being iterables. At least for me, there is much fewer applications where the latter is useful than where it requires extra code. Overall, it makes the language less clean that a string is an iterable; a special case we always have to code around.

I know it will break a lot of existing code, but so did the string change from py2 to 3. (It would break very few of my codes, though.)

I agree there is an occasionnal need which I also met in real code: it was parse result data, which can be a string (terminal patterns, that really "eat" part of the source) or list (or otherwise "tre" iterable collection, for composite or repetitive patterns). But the case is rare because it requires coincidence of conditions: * both string and collections may come as input * both are valid, from the app's logics' point of view * one want to iterate collections, but not strings On the other hand, I find you much too quickly dismiss real and very common need to iterate strings (on the lowest units of code points), apparently on the only base that in your own programming practice you don't need/want it. We should not make iterating strings a special case (eg by requiring explicit call to an iterator like for ucode in s.ucodes() because the case is so common. Instead we may consider finding a way to exclude strings in some collection traversal idiom (for which I have good proposal: the obvious one would .items(), but it's used for a different meaning), which would for instance yield an exception on strings because they don't match the idiom ("str object has no 'items' attribute"). Denis

Nick Coghlan

11:41 a.m.

On 3 January 2014 20:19, spir <denis.spir@gmail.com> wrote:

...

On 01/03/2014 04:54 AM, Alexander Heger wrote:

...
...
By designing an API that doesn't require such overloading.

On Thursday, January 2, 2014, Alexander Heger wrote:

...
...
...
isinstance(x, Iterable) and not isinstance(x, str)

If you find yourself typing that a lot I think you have a bigger problem though.

How do you replace this?

for my applications this seemed the most natural way - have the method deal with what it is fed, which could be strings or any kind of collections or iterables of strings. But never would I want to disassemble strings into characters. From the previous message I gather that I am not the only one with this application case.

Generally, I find strings being iterables of characters as useful as if integers were iterables of bits. They should just be units. They already start out being not mutable. I think it would be a positive design change for Python 4 to make them units instead of being iterables. At least for me, there is much fewer applications where the latter is useful than where it requires extra code. Overall, it makes the language less clean that a string is an iterable; a special case we always have to code around.

I know it will break a lot of existing code, but so did the string change from py2 to 3. (It would break very few of my codes, though.)

I agree there is an occasionnal need which I also met in real code: it was parse result data, which can be a string (terminal patterns, that really "eat" part of the source) or list (or otherwise "tre" iterable collection, for composite or repetitive patterns). But the case is rare because it requires coincidence of conditions: * both string and collections may come as input * both are valid, from the app's logics' point of view * one want to iterate collections, but not strings

On the other hand, I find you much too quickly dismiss real and very common need to iterate strings (on the lowest units of code points), apparently on the only base that in your own programming practice you don't need/want it.

We should not make iterating strings a special case (eg by requiring explicit call to an iterator like for ucode in s.ucodes() because the case is so common. Instead we may consider finding a way to exclude strings in some collection traversal idiom (for which I have good proposal: the obvious one would .items(), but it's used for a different meaning), which would for instance yield an exception on strings because they don't match the idiom ("str object has no 'items' attribute").

The underlying problem is that strings have a dual nature: you can view them as either a sequence of code points (which is how Python models them), or else you can view them as an opaque chunk of text (which is often how you want to treat them in code that accepts either containers or atomic values and treats them differently). This has some interesting implications for API design. "def f(*args)" handles the constraint fairly well, as f("astring") is treated as a single value and f(*"string") is an unlikely mistake for anyone to make. "def f(iterable)" has problems in many cases, since f("string") is treated as an iterable of code points, even if you'd prefer an immediate error. "def f(iterable_or_atomic)" also has problems, since strings will use the "iterable" path, even if the atomic handling would be more appropriate. Algorithms that recursively descend into containers also need to deal with the fact that doing so with strings causes an infinite loop (since iterating over a string produces length 1 strings). This is a genuine problem, which is why the question of how to cleanly deal with these situations keeps coming up every couple of years, and the current state of the art answer is "grit your teeth and use isinstance(obj, str)" (or a configurable alternative). However, I'm wondering if it might be reasonable to add a new entry in collections.abc for 3.5:

...

...
...
from abc import ABC from collections.abc import Iterable class Atomic(ABC): ... @classmethod ... def __subclasshook__(cls, subclass): ... if not issubclass(subclass, Iterable): ... return True ... return NotImplemented ... Atomic.register(str) <class 'str'> Atomic.register(bytes) <class 'bytes'> Atomic.register(bytearray) <class 'bytearray'> isinstance(1, Atomic) True isinstance(1.0, Atomic) True isinstance(1j, Atomic) True isinstance("Hello", Atomic) True isinstance(b"Hello", Atomic) True isinstance((), Atomic) False isinstance([], Atomic) False isinstance({}, Atomic) False

Any type which wasn't iterable would automatically be considered atomic, while some types which *are* iterable could *also* be registered as atomic (with str, bytes and bytearray being the obvious candidates, as shown above). Armed with such an ABC, you could then write an "iter_non_atomic" helper function as: def iter_non_atomic(iterable): if isinstance(iterable, Atomic): raise TypeError("{!r} is considered atomic".format(iterable.__class__.__name__) return iter(iterable) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Masklinn

12:12 p.m.

New subject: strings as iterables - from str.startswith taking any iterator instead of just tuple

On 2014-01-03, at 12:41 , Nick Coghlan <ncoghlan@gmail.com> wrote:

...

"def f(iterable_or_atomic)" also has problems, since strings will use the "iterable" path, even if the atomic handling would be more appropriate.

Algorithms that recursively descend into containers also need to deal with the fact that doing so with strings causes an infinite loop (since iterating over a string produces length 1 strings).

This is a genuine problem, which is why the question of how to cleanly deal with these situations keeps coming up every couple of years, and the current state of the art answer is "grit your teeth and use isinstance(obj, str)" (or a configurable alternative).

However, I'm wondering if it might be reasonable to add a new entry in collections.abc for 3.5:

...
...
...
from abc import ABC from collections.abc import Iterable class Atomic(ABC): ... @classmethod ... def __subclasshook__(cls, subclass): ... if not issubclass(subclass, Iterable): ... return True ... return NotImplemented ...

I’ve used some sort of ad-hoc version of it enough that I think it’s a good idea, although I’d suggest “scalar”: “atomic” also exists (with very different semantics) in concurrency contexts, whereas I believe scalar always means single-value (non-compound) data type.

...

...
...
...
Atomic.register(str) <class 'str'> Atomic.register(bytes) <class 'bytes'> Atomic.register(bytearray) <class 'bytearray'> isinstance(1, Atomic) True isinstance(1.0, Atomic) True isinstance(1j, Atomic) True isinstance("Hello", Atomic) True isinstance(b"Hello", Atomic) True isinstance((), Atomic) False isinstance([], Atomic) False isinstance({}, Atomic) False

Nick Coghlan

12:30 p.m.

On 3 January 2014 22:12, Masklinn <masklinn@masklinn.net> wrote:

...

On 2014-01-03, at 12:41 , Nick Coghlan <ncoghlan@gmail.com> wrote:

...
"def f(iterable_or_atomic)" also has problems, since strings will use the "iterable" path, even if the atomic handling would be more appropriate.

Algorithms that recursively descend into containers also need to deal with the fact that doing so with strings causes an infinite loop (since iterating over a string produces length 1 strings).

This is a genuine problem, which is why the question of how to cleanly deal with these situations keeps coming up every couple of years, and the current state of the art answer is "grit your teeth and use isinstance(obj, str)" (or a configurable alternative).

However, I'm wondering if it might be reasonable to add a new entry in collections.abc for 3.5:

...
...
...
from abc import ABC from collections.abc import Iterable class Atomic(ABC): ... @classmethod ... def __subclasshook__(cls, subclass): ... if not issubclass(subclass, Iterable): ... return True ... return NotImplemented ...

I’ve used some sort of ad-hoc version of it enough that I think it’s a good idea, although I’d suggest “scalar”: “atomic” also exists (with very different semantics) in concurrency contexts, whereas I believe scalar always means single-value (non-compound) data type.

Yeah, that makes sense. I believe the NumPy folks run into a somewhat similar issue with the subtle distinction between treating scalars as scalars and treating them as zero-dimensional arrays. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Joshua Landau

2:17 p.m.

On 3 January 2014 12:12, Masklinn <masklinn@masklinn.net> wrote:

...

On 2014-01-03, at 12:41 , Nick Coghlan <ncoghlan@gmail.com> wrote: I’ve used some sort of ad-hoc version of it enough that I think it’s a good idea, although I’d suggest “scalar”: “atomic” also exists (with very different semantics) in concurrency contexts, whereas I believe scalar always means single-value (non-compound) data type.

OTOH, to many non-mathematical people I hardly expect "is this scalar" to feel nearly as meaningful a question as "is this atomic". To bike-shed, how about "unitary". Nevertheless, I like the idea and the problem is a real one.

Bruce Leban

6:11 p.m.

On Fri, Jan 3, 2014 at 6:17 AM, Joshua Landau <joshua@landau.ws> wrote:

...

OTOH, to many non-mathematical people I hardly expect "is this scalar" to feel nearly as meaningful a question as "is this atomic".

To bike-shed, how about "unitary".

"atomic" has the wrong meaning since it says it doesn't have any component parts. Scalar has the right meaning. As to the idea of making strings not iterable, that would break my code. I write a lot of code to manipulate words (to create puzzles) and iterating over strings is fundamental. In fact, I'd like to have strings as results of iteration operations on strings:

...

...
...
sorted('string') 'ginrst' list(itertools.permutations('bar')) ['bar', 'bra', 'abr', 'arb', 'rba', 'rab']

instead I have to write

...

...
...
''.join(sorted('string')) [''.join(s) for s in itertools.permutations('bar')]

This would probably break less code than making strings non-iterable, but realize that there's approximately 0% chance this would ever change and there's no easy way to cover every iteration operation. And it would confuse people if sometimes: (x.upper() for x in s) returned an iterator and sometimes it returned a string. --- Bruce My guest puzzle for Puzzles Live: http://www.puzzazz.com/puzzles-live/10

spir

4 Jan 4 Jan

10:22 a.m.

On 01/03/2014 07:11 PM, Bruce Leban wrote:

...

As to the idea of making strings not iterable, that would break my code. I write a lot of code to manipulate words (to create puzzles) and iterating over strings is fundamental. In fact, I'd like to have strings as results of iteration operations on strings:

...
...
...
...
...
>sorted('string') 'ginrst' >list(itertools.permutations('bar')) ['bar', 'bra', 'abr', 'arb', 'rba', 'rab']

instead I have to write

...
...
...
...
...
>''.join(sorted('string')) >[''.join(s) for s in itertools.permutations('bar')]

Maybe we just need a 'cat' or 'concat' [1] method for lists: sorted('string').cat() (s for s in itertools.permutations('bar')).cat() (Then, a hard choice: should cat crash when items are not strings, or automagically stringify its operands? I wish join would do the latter.) Denis [1] I have not understood yet why "concatenation", instead of just "catenation". Literaly means chaining (things) together; but I'm still trying to figure out how one can chain things apart ;-) As if strings were called "withstrings" or "stringtogethers", more or less. Enlightening welcome. (Same for "concatenative languages"... of which one is called "cat"!)

Steven D'Aprano

10:59 p.m.

New subject: strings as iterables - from str.startswith taking any iterator instead of just tuple

On Sat, Jan 04, 2014 at 11:22:16AM +0100, spir wrote:

...

On 01/03/2014 07:11 PM, Bruce Leban wrote:

...
As to the idea of making strings not iterable, that would break my code. I write a lot of code to manipulate words (to create puzzles) and iterating over strings is fundamental. In fact, I'd like to have strings as results of iteration operations on strings:

...
...
...
...
>>sorted('string') 'ginrst' >>list(itertools.permutations('bar')) ['bar', 'bra', 'abr', 'arb', 'rba', 'rab']

That would be nice to have.

...

...
instead I have to write

...
...
...
...
>>''.join(sorted('string')) >>[''.join(s) for s in itertools.permutations('bar')]

Which is a slight inconvenience, but not a great one. You can always save three characters by creating a helper function: join = ''.join

...

Maybe we just need a 'cat' or 'concat' [1] method for lists: sorted('string').cat() (s for s in itertools.permutations('bar')).cat()

-1 Lists are general collections, giving them a method that depends on a specific kind of item is ugly. Adding that same method to generator expressions is even worse. We don't have list.sum() for adding lists of numbers, we have a sum() function that takes a list.

...

(Then, a hard choice: should cat crash when items are not strings, or automagically stringify its operands? I wish join would do the latter.)

-1 Joining what you think is a list of strings but actually isn't is an error. The right thing to do in the face of an error is to raise an exception, not to silently hide the error. If you want to automatically convert arbitrary items into strings, it is better to explicitly do so: ''.join(str(x) for x in items) than to have it magically, and incorrectly, happen implicitly.

...

[1] I have not understood yet why "concatenation", instead of just "catenation". Literaly means chaining (things) together; but I'm still trying to figure out how one can chain things apart ;-)

Chain your left arm to the wall on your left, and your right arm to the wall on your right. Your arms are now chained apart. http://www.vlvstamps.com/man-chained-to-wall.html (Safe for work.) -- Steven

Masklinn

11:30 p.m.

New subject: strings as iterables - from str.startswith taking any iterator instead of just tuple

On 2014-01-04, at 23:59 , Steven D'Aprano <steve@pearwood.info> wrote:

...

On Sat, Jan 04, 2014 at 11:22:16AM +0100, spir wrote:

...
On 01/03/2014 07:11 PM, Bruce Leban wrote:

...
As to the idea of making strings not iterable, that would break my code. I write a lot of code to manipulate words (to create puzzles) and iterating over strings is fundamental. In fact, I'd like to have strings as results of iteration operations on strings:

...
...
...
>>> sorted('string') 'ginrst' >>> list(itertools.permutations('bar')) ['bar', 'bra', 'abr', 'arb', 'rba', 'rab']

That would be nice to have.

More generally, it would be nice if a sequence type could specify how to derive a new instance of itself (from an iterable for instance). Constructors don't necessarily work (e.g. str's constructor). Clojure has such a concept through the IPersistentCollection protocol: empty(coll) creates a new (empty) instance of coll (clojure's collections being immutable, it makes sense to create an empty collection then add stuff into it via into() or conj())

Amber Yust

5 Jan 5 Jan

12:08 a.m.

__fromiter__, anyone? On Sat Jan 04 2014 at 3:31:59 PM, Masklinn <masklinn@masklinn.net> wrote:

...

On 2014-01-04, at 23:59 , Steven D'Aprano <steve@pearwood.info> wrote:

...
On Sat, Jan 04, 2014 at 11:22:16AM +0100, spir wrote:

...
On 01/03/2014 07:11 PM, Bruce Leban wrote:

...
As to the idea of making strings not iterable, that would break my code. I write a lot of code to manipulate words (to create puzzles) and iterating over strings is fundamental. In fact, I'd like to have strings as results of iteration operations on strings:

...
...
>>>> sorted('string') 'ginrst' >>>> list(itertools.permutations('bar')) ['bar', 'bra', 'abr', 'arb', 'rba', 'rab']

That would be nice to have.

More generally, it would be nice if a sequence type could specify how to derive a new instance of itself (from an iterable for instance). Constructors don't necessarily work (e.g. str's constructor). Clojure has such a concept through the IPersistentCollection protocol: empty(coll) creates a new (empty) instance of coll (clojure's collections being immutable, it makes sense to create an empty collection then add stuff into it via into() or conj())

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Joshua Landau

12:50 a.m.

On Jan 5, 2014 12:08 AM, "Amber Yust" <amber.yust@gmail.com> wrote:

...

__fromiter__, anyone?

I'm unconvinced that it should be a dunder method. Do you expect it to be used like fromiter(str, characters) ? However, +1 on the name, +0 on the idea.

Amber Yust

2:10 a.m.

I'm thinking of it being analogous to the __getstate__ and __setstate__ dunders used by Pickle to allow customization of object creation. On Sat Jan 04 2014 at 4:50:11 PM, Joshua Landau <joshua.landau.ws@gmail.com> wrote:

...

On Jan 5, 2014 12:08 AM, "Amber Yust" <amber.yust@gmail.com> wrote:

...
__fromiter__, anyone?

I'm unconvinced that it should be a dunder method. Do you expect it to be used like

fromiter(str, characters)

?

However, +1 on the name, +0 on the idea.

Guido van Rossum

7:24 a.m.

Is this tread still about strings vs. other iterables? First of all, the motivation for making strings iterable is that they are indexable and sliceable, which means they act like sequences. Historically, indexing and slicing predated the concept of iterators in Python. Many other languages (starting with Pascal and C) also treat strings as arrays; while many of those have a separate character type, a few languages follow Python's example (or the other way around, I don't feel like tracking the influences exactly, or even finding examples -- I do know they exist). There are also languages where strings are *not* considered arrays (I think this is the case in Ruby and Perl). In such languages string manipulation is typically done using regular expressions or similar APIs, although there usually also non-array APIs to get characters or substrings using indexes, but those APIs may not be O(1), e.g. for reasons having to do with decoding UTF-8 on the fly. All in all I am happy with Python's string-as-array semantics and I don't want to change this. While I would like to encourage API designs that don't require distinguishing between strings and other iterables (just like I prefer APIs that don't require distinguishing between sequences and mappings, or between callables and "plain values"), I realize that pragmatically people are going to want to write such code, and an ABC seems a good choice. However, if "Atomic" is still under consideration, I would strongly argue against that particular term. Given that a string is an array of characters, calling it an "atom" (== indivisible) seems particularly out of order. (And yes, I know that the use of the term in physics is also a misnomer -- let's not repeat that mistake. :-) Alas, I don't have a better name, but I'm sure the thesauriers will find something. We have until Python 3.5 is released to agree on a name. :-) -- --Guido van Rossum (python.org/~guido)

Amber Yust

6:17 p.m.

For ABC names, perhaps "IndependentSequence" or "UnaffiliatedSequence"? On Sat Jan 04 2014 at 11:25:23 PM, Guido van Rossum <guido@python.org> wrote:

...

Is this tread still about strings vs. other iterables?

First of all, the motivation for making strings iterable is that they are indexable and sliceable, which means they act like sequences.

Historically, indexing and slicing predated the concept of iterators in Python. Many other languages (starting with Pascal and C) also treat strings as arrays; while many of those have a separate character type, a few languages follow Python's example (or the other way around, I don't feel like tracking the influences exactly, or even finding examples -- I do know they exist). There are also languages where strings are *not* considered arrays (I think this is the case in Ruby and Perl). In such languages string manipulation is typically done using regular expressions or similar APIs, although there usually also non-array APIs to get characters or substrings using indexes, but those APIs may not be O(1), e.g. for reasons having to do with decoding UTF-8 on the fly.

All in all I am happy with Python's string-as-array semantics and I don't want to change this.

While I would like to encourage API designs that don't require distinguishing between strings and other iterables (just like I prefer APIs that don't require distinguishing between sequences and mappings, or between callables and "plain values"), I realize that pragmatically people are going to want to write such code, and an ABC seems a good choice.

However, if "Atomic" is still under consideration, I would strongly argue against that particular term. Given that a string is an array of characters, calling it an "atom" (== indivisible) seems particularly out of order. (And yes, I know that the use of the term in physics is also a misnomer -- let's not repeat that mistake. :-)

Alas, I don't have a better name, but I'm sure the thesauriers will find something. We have until Python 3.5 is released to agree on a name. :-)

-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

spir

3 Jan 3 Jan

2:17 p.m.

On 01/03/2014 01:12 PM, Masklinn wrote:

...

I’ve used some sort of ad-hoc version of it enough that I think it’s a good idea, although I’d suggest “scalar”: “atomic” also exists (with very different semantics) in concurrency contexts, whereas I believe scalar always means single-value (non-compound) data type.

I used to use, for non highly educated folks, "element" or "elementary" (considering "scalar" too rare a term, and "atomic" potentially misleading). Denis

Stephen J. Turnbull

3:54 p.m.

New subject: strings as iterables - from str.startswith taking any iterator instead of just tuple

Masklinn writes:

...

I’ve used some sort of ad-hoc version of it enough that I think it’s a good idea, although I’d suggest “scalar”: “atomic” also exists (with very different semantics) in concurrency contexts, whereas I believe scalar always means single-value (non-compound) data type.

Sure, but if you're a Unicode geek "scalar" essentially means "character", so a string ain't that! Seriously, all the good words have been taken two or three times already in some other field. Pick one and don't worry about the overloading -- learning to spell English is *much* harder.

spir

4:31 p.m.

New subject: strings as iterables - from str.startswith taking any iterator instead of just tuple

On 01/03/2014 04:54 PM, Stephen J. Turnbull wrote:

...

Masklinn writes:

...
I’ve used some sort of ad-hoc version of it enough that I think it’s a good idea, although I’d suggest “scalar”: “atomic” also exists (with very different semantics) in concurrency contexts, whereas I believe scalar always means single-value (non-compound) data type.

Sure, but if you're a Unicode geek "scalar" essentially means "character", so a string ain't that!

Unfortunately in unicode slang "character" does not mean character ;-) (but, say, whatever a code point happens to represent)

...

Seriously, all the good words have been taken two or three times already in some other field. Pick one and don't worry about the overloading -- learning to spell English is *much* harder.

Thankfully no one needs spelling english corectly to program --except for keywords... Denis

spir

2:21 p.m.

On 01/03/2014 12:41 PM, Nick Coghlan wrote:

...

The underlying problem is that strings have a dual nature: you can view them as either a sequence of code points (which is how Python models them), or else you can view them as an opaque chunk of text (which is often how you want to treat them in code that accepts either containers or atomic values and treats them differently).

This has some interesting implications for API design.

"def f(*args)" handles the constraint fairly well, as f("astring") is treated as a single value and f(*"string") is an unlikely mistake for anyone to make.

"def f(iterable)" has problems in many cases, since f("string") is treated as an iterable of code points, even if you'd prefer an immediate error.

"def f(iterable_or_atomic)" also has problems, since strings will use the "iterable" path, even if the atomic handling would be more appropriate.

Algorithms that recursively descend into containers also need to deal with the fact that doing so with strings causes an infinite loop (since iterating over a string produces length 1 strings).

This is a genuine problem, which is why the question of how to cleanly deal with these situations keeps coming up every couple of years, and the current state of the art answer is "grit your teeth and use isinstance(obj, str)" (or a configurable alternative).

However, I'm wondering if it might be reasonable to add a new entry in collections.abc for 3.5:

...
...
...
...
...
>from abc import ABC >from collections.abc import Iterable >class Atomic(ABC): ... @classmethod ... def __subclasshook__(cls, subclass): ... if not issubclass(subclass, Iterable): ... return True ... return NotImplemented ... >Atomic.register(str) <class 'str'> >Atomic.register(bytes) <class 'bytes'> >Atomic.register(bytearray) <class 'bytearray'> >isinstance(1, Atomic) True >isinstance(1.0, Atomic) True >isinstance(1j, Atomic) True >isinstance("Hello", Atomic) True >isinstance(b"Hello", Atomic) True >isinstance((), Atomic) False >isinstance([], Atomic) False >isinstance({}, Atomic) False

Any type which wasn't iterable would automatically be considered atomic, while some types which *are* iterable could *also* be registered as atomic (with str, bytes and bytearray being the obvious candidates, as shown above).

Armed with such an ABC, you could then write an "iter_non_atomic" helper function as:

def iter_non_atomic(iterable): if isinstance(iterable, Atomic): raise TypeError("{!r} is considered atomic".format(iterable.__class__.__name__) return iter(iterable)

I like this solution. But would live with checking for type (usually str). The point is that, while not that uncommon, when the issue arises one has to deal with it at one or at most a few places in code (typically at start of one a few methods of a given type). It is not as if we had to carry an unneeded overload about everywhere. Denis

Nick Coghlan

2:39 p.m.

On 4 January 2014 00:21, spir <denis.spir@gmail.com> wrote:

...

On 01/03/2014 12:41 PM, Nick Coghlan wrote:

...
Armed with such an ABC, you could then write an "iter_non_atomic" helper function as:

def iter_non_atomic(iterable): if isinstance(iterable, Atomic): raise TypeError("{!r} is considered atomic".format(iterable.__class__.__name__) return iter(iterable)

I like this solution. But would live with checking for type (usually str).

The ducktyping variant I've also used on occasion is "hasattr(obj, 'encode')" rather than an instance check against a concrete type (it also has the benefit of picking up both str and unicode in Python 2 when writing 2/3 compatible code that can't rely on basestring, as well as UserString instances)

...

The point is that, while not that uncommon, when the issue arises one has to deal with it at one or at most a few places in code (typically at start of one a few methods of a given type). It is not as if we had to carry an unneeded overload about everywhere.

Right, I see it as very similar to the "is that a sequence or a mapping?" question that was one of the key motivations for adding the ABC machinery in the first place. For that case, people historically used a check like "hasattr(obj, 'keys')" (and I think we still do that in a couple of places). Here, the distinction is between true containers types like sets, dicts and lists, and more structured iterables like strings, where the whole is substantially more than the sum of its parts. Actually, that would be another way of carving out the distinction - rather than trying to cover *all* Atomic types, just have an AtomicIterable ABC that indicated any structure where applying operations like "flatten" doesn't make sense. In addition to str, bytes and bytearray, memoryview and namedtuple instances would also be appropriate candidates. The Iterable suffix would indicate directly that this wasn't related to concurrency. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

spir

4:39 p.m.

On 01/03/2014 03:39 PM, Nick Coghlan wrote:

...

Here, the distinction is between true containers types like sets, dicts and lists, and more structured iterables like strings, where the whole is substantially more than the sum of its parts.

That's it: the unique property of strings is that composing & combining are the same operation, while for true containers ther are distinct: when combining sets (union), one gets a set at the same complexity level, whatever the items are, while when composing sets one gets a set of sets.

...

Actually, that would be another way of carving out the distinction - rather than trying to cover *all* Atomic types, just have an AtomicIterable ABC that indicated any structure where applying operations like "flatten" doesn't make sense. In addition to str, bytes and bytearray, memoryview and namedtuple instances would also be appropriate candidates.

Yes, maybe it's more practicle; but an ABC type common to strings (and the like) and atomic types also makes sense. Denis PS: I had another common use case at times, with trees which leaves may be string, or not (esp for their str and repr methods).

Andrew Barnert

5:27 p.m.

New subject: strings as iterables - from str.startswith taking any iterator instead of just tuple

On Jan 3, 2014, at 6:39, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

The Iterable suffix would indicate directly that this wasn't related to concurrency.

I don't know; something whose iter was guaranteed to return a iterator that I could next without synchronizing could be pretty handy. ;) More seriously, I think a strength of your original version was having a single abstract type for both non-iterables and things that are iterable but you sometimes don't want to treat that way. A flatten function that uses "not isinstance(x, Iterable) or isinstance(x, AtomicIterable)" is less obvious than one that just uses "isinstance(x, Atomic)", and will be a source of 10x as many stupid "oops I used and instead of or" type bugs. If there really is no acceptable name for the easier concept, the tradeoff could be worth it anyway, but I think it's worth trying harder for one One last question to bring up: Is there a reasonable/common use case where you do want to flatten multi-char strings to single-char strings, but then want to treat single-char strings as atoms? I can certainly imagine toy cases like that, but it could easily be so rarely useful that it's ok to leave that clumsy to write.

Alexander Heger

4 Jan 4 Jan

4:08 a.m.

Dear Nick, yes, defining an ABC for this case would be an excellent solution. Thanks. -Alexander

...

However, I'm wondering if it might be reasonable to add a new entry in collections.abc for 3.5:

...
...
...
from abc import ABC from collections.abc import Iterable class Atomic(ABC): ... @classmethod ... def __subclasshook__(cls, subclass): ... if not issubclass(subclass, Iterable): ... return True ... return NotImplemented ... Atomic.register(str) <class 'str'> Atomic.register(bytes) <class 'bytes'> Atomic.register(bytearray) <class 'bytearray'> isinstance(1, Atomic) True isinstance(1.0, Atomic) True isinstance(1j, Atomic) True isinstance("Hello", Atomic) True isinstance(b"Hello", Atomic) True isinstance((), Atomic) False isinstance([], Atomic) False isinstance({}, Atomic) False

Any type which wasn't iterable would automatically be considered atomic, while some types which *are* iterable could *also* be registered as atomic (with str, bytes and bytearray being the obvious candidates, as shown above).

Armed with such an ABC, you could then write an "iter_non_atomic" helper function as:

def iter_non_atomic(iterable): if isinstance(iterable, Atomic): raise TypeError("{!r} is considered atomic".format(iterable.__class__.__name__) return iter(iterable)

Cheers, Nick.

3927

Age (days ago)

3929

Last active (days ago)

List overview

Download

27 comments

15 participants

participants (15)

Alexander Heger
Amber Yust
Andrew Barnert
Bruce Leban
Chris Angelico
Guido van Rossum
Joshua Landau
Joshua Landau
Mark Lawrence
Masklinn
Nick Coghlan
spir
Stephen J. Turnbull
Steven D'Aprano
Terry Reedy

strings as iterables - from str.startswith taking any iterator instead of just tuple

Mark Lawrence

spir

spir

spir

spir

spir

spir

tags

participants (15)