discontinue iterable strings

standard python should discontinue to see strings as iterables of characters - length-1 strings. I see this as one of the biggest design flaws of python. It may have seem genius at the time, but it has passed it usefulness for practical language use. For example, numpy has no issues
np.array('abc') array('abc', dtype='<U3')
whereas, as all know,
list('abc') ['a', 'b', 'c']
Numpy was of course design a lot later, with more experience in practical use (in mind). Maybe a starting point for transition that latter operation also returns ['abc'] in the long run, could be to have an explicit split operator as recommended use, e.g., 'abc'.split() 'abc'.split('') 'abc'.chars() 'abc'.items() the latter two could return an iterator whereas the former two return lists (currently raise exceptions). Similar for bytes, etc.

This would introduce a major inconsistency. To do this, you would need to also strip string’s of their status as sequences (in collections.abc, Sequence is a subclass of Iterable). Thus, making string’s no longer iterable would also mean you could no longer take the length or slice of a string. While I believe your proposal was well intentioned, IMHO it would cause a giant inconsistency in Python (why would one of our core sequences not be iterable?) - Ed

This would introduce a major inconsistency. To do this, you would need to also strip string’s of their status as sequences (in collections.abc, Sequence is a subclass of Iterable). Thus, making string’s no longer iterable would also mean you could no longer take the length or slice of a string. you can always define __len__ and __index__ independently. I do this for many objects. But it is a point I have not considered. While I believe your proposal was well intentioned, IMHO it would cause a giant inconsistency in Python (why would one of our core sequences not be iterable?) Yes, I am aware it will cause a lot of backward incompatibilities, but this is based on all the lengthy discussions about "string but not iterable" type determinations. If sting was not iterable, a lot of things would also be easier. You could also argue why an integer cannot be iterated over its bits? As had been noted, is one of few objects of which the component can be the object itself. 'a'[0] == 'a' I do not iterate over strings so often that it could not be done using, e.g., str.chars(): for c in str.chars(): print(c) On 20 August 2016 at 13:24, Edward Minnix <egregius313@gmail.com> wrote:

On Sat, Aug 20, 2016 at 4:28 PM, Alexander Heger <python@2sn.net> wrote:
Yes, I am aware it will cause a lot of backward incompatibilities...
Tell me, would you retain the ability to subscript a string to get its characters?
"asdf"[0] 'a'
If not, you break a ton of code. If you do, they are automatically iterable *by definition*. Watch: class IsThisIterable: def __getitem__(self, idx): if idx < 5: return idx*idx raise IndexError
So you can't lose iteration without also losing subscripting. ChrisA

On Sat, Aug 20, 2016 at 3:48 AM Chris Angelico <rosuav@gmail.com> wrote:
A separate character type would solve that issue. While Alexander Heger was advocating for a "monolithic object," and may in fact not want subscripting, I think he's more frustrated by the fact that iterating over a string gives other strings. If instead a 1-length string were a different, non-iterable type, that might avoid some problems. However, special-casing a character as a different type would bring its own problems. Note the annoyance of iterating over bytes and getting integers. In case it's not clear, I should add that I disagree with this proposal and do not want any change to strings.

On Sat, Aug 20, 2016 at 10:31 PM, Michael Selik <michael.selik@gmail.com> wrote:
Agreed. One of the handy traits of cross-platform code is that MANY languages let you subscript a double-quoted string to get a single-quoted character. Compare these blocks of code: if ("asdf"[0] == 'a') write("The first letter of asdf is a.\n"); if ("asdf"[0] == 'a'): print("The first letter of asdf is a.") if ("asdf"[0] == 'a') console.log("The first letter of asdf is a.") if ("asdf"[0] == 'a') printf("The first letter of asdf is a.\n"); if ("asdf"[0] == 'a') echo("The first letter of asdf is a.\n"); Those are Pike, Python, JavaScript/ECMAScript, C/C++, and PHP, respectively. Two of them treat single-quoted and double-quoted strings identically (Python and JS). Two use double quotes for strings and single quotes for character (aka integer) constants (Pike and C). One has double quotes for interpolated and single quotes for non-interpolated strings (PHP). And just to mess you up completely, two (or three) of these define strings to be sequences of bytes (C/C++ and PHP, plus Python 2), two as sequences of Unicode codepoints (Python and Pike), and one as sequences of UTF-16 code units (JS). But in all five, subscripting a double-quoted string yields a single-quoted character. I'm firmly of the opinion that this should not change. Code clarity is not helped by creating a brand-new "character" type and not having a corresponding literal for it, and the one obvious literal, given the amount of prior art using it, would be some form of quote character - probably the apostrophe. Since that's not available, I think a character type would be a major hurdle to get over. ChrisA

That would require strings to also not be sequences, or to totally drop the sequence protocol. These are non-starters. They *will not* happen. Not they shouldn’t happen, or they probably won’t happen. They cannot and will not happen. That is a much bigger break than they were even willing to make between 2 and 3. From: Python-ideas [mailto:python-ideas-bounces+tritium-list=sdamon.com@python.org] On Behalf Of Alexander Heger Sent: Saturday, August 20, 2016 4:52 PM To: Chris Angelico <rosuav@gmail.com> Cc: python-ideas <python-ideas@python.org> Subject: Re: [Python-ideas] discontinue iterable strings I was not proposing a character type, only that strings are not iterable: for i in 'abc': print(i) TypeError: 'str' object is not iterable

On Sat, Aug 20, 2016 at 4:57 PM Alexander Heger <python@2sn.net> wrote:
You can quibble with the original design choice, but unless you borrow Guido's time machine, there's not much point to that discussion. Instead, let's talk about the benefits and problems that your change proposal would cause. Benefits: - no more accidentally using str as an iterable Problems: - any code that subscripts, slices, or iterates over a str will break Did I leave anything out? How would you weigh the benefits against the problems? How would you manage the upgrade path for code that's been broken?

Just to be clear, at the time it was designed, it surely was a genious idea with its obvious simplicity. I spend much of my time refactoring codes and interfaces from previous "genius" ideas, as usage matures.
I would try to keep indexing and slicing, but not iterating. Though there have been comments that may not be straightforward to implement. Not sure if strings would need to acquire a "substring" attribute that can be indexed and sliced. Did I leave anything out?
How would you weigh the benefits against the problems? How would you manage the upgrade path for code that's been broken?
FIrst one needs to add the extension string attributes like split()/split(''), chars(), and substring[] (Python 3.7). When indexing becomes disallowed (Python 3.10 / 4.0) attempts to iterate (or slice) will raise TypeError. The fixes overall will be a lot easier and obvious than introduction of unicode as default string type in Python 3.0. It could already be used/test starting with Python 3.7 using 'from future import __monolythic_strings__`.

From: Python-ideas [mailto:python-ideas-bounces+tritium-list=sdamon.com@python.org] On Behalf Of ????? Sent: Saturday, August 20, 2016 5:56 PM To: python-ideas <python-ideas@python.org> Subject: Re: [Python-ideas] discontinue iterable strings On Sun, Aug 21, 2016 at 12:28 AM Alexander Heger <mailto:python@2sn.net> wrote: Did I leave anything out? How would you weigh the benefits against the problems? How would you manage the upgrade path for code that's been broken? FIrst one needs to add the extension string attributes like split()/split(''), chars(), and substring[] (Python 3.7). When indexing becomes disallowed (Python 3.10 / 4.0) attempts to iterate (or slice) will raise TypeError. The fixes overall will be a lot easier and obvious than introduction of unicode as default string type in Python 3.0. It could already be used/test starting with Python 3.7 using 'from future import __monolythic_strings__`. Is there any equivalent __future__ import with such deep semantic implications? Most imports I can think of are mainly syntactic. And what would it do? change the type of string literals? change the behavior of str methods locally in this module? globally? How will this play with 3rd party libraries? Sounds like it will break stuff in a way that cannot be locally fixed. ~Elazar from __future__ import unicode_literals outright changes the type of object string literals make (in python 2). If you were to create a non-iterable, non-sequence text type (a horrible idea, IMO) the same thing can be done done for that.

On Sun, Aug 21, 2016 at 6:08 PM, <tritium-list@sdamon.com> wrote:
from __future__ import unicode_literals outright changes the type of object string literals make (in python 2). If you were to create a non-iterable, non-sequence text type (a horrible idea, IMO) the same thing can be done done for that.
It could; but that just changes what *literals* make. But what about other sources of strings - str()? bytes.decode()? format()? repr()? Which ones get changed, and which don't? There's no easy way to do this. ChrisA

On Sat, Aug 20, 2016 at 5:27 PM Alexander Heger <python@2sn.net> wrote:
- any code that subscripts, slices, or iterates over a str will break
I would try to keep indexing and slicing, but not iterating.
So anything that wants to loop over a string character by character would need to construct a new object, like ``for c in list(s)``? That seems inefficient. I suppose you might advocate for a new type, some sort of stringview that would allow iteration over a string, but avoid allocating so much space as a list does, but that might bring us back to where we started.
The fixes overall will be a lot easier and obvious than introduction of unicode as default string type in Python 3.0.
That's a bold claim. Have you considered what's at stake if that's not true? Anyway, why don't you write a proof of concept module for a non-iterable string, throw it on PyPI, and see if people like using it?

On Sun, Aug 21, 2016 at 12:34:02AM +0000, Michael Selik wrote:
If this was ten years ago, and we were talking about backwards incompatible changes for the soon-to-be-started Python 3000, I might be more responsive to changing strings to be an atomic type (like ints, floats, etc) with a .chars() view that iterates over the characters. Or something like what Go does (I think), namely to distinguish between Chars and Strings: indexing a string gives you a Char, and Chars are not indexable and not iterable. But even then, the change probably would have required a PEP.
Saying that these so-called "fixes" (we haven't established yet that Python's string behaviour is a bug that need fixing) will be easier and more obvious than the change to Unicode is not that bold a claim. Pretty much everything is easier and more obvious than changing to Unicode. :-) (Possibly not bringing peace to the Middle East.) I think that while the suggestion does bring some benefit, the benefit isn't enough to make up for the code churn and disruption it would cause. But I encourage the OP to go through the standard library, pick a couple of modules, and re-write them to see how they would look using this proposal. -- Steve

On Sun, Aug 21, 2016 at 12:52 PM, Steven D'Aprano <steve@pearwood.info> wrote:
And yet it's so simple. We can teach novice programmers about two's complement [1] representations of integers, and they have no trouble comprehending that the abstract concept of "integer" is different from the concrete representation in memory. We can teach intermediate programmers how hash tables work, and how to improve their performance on CPUs with 64-byte cache lines - again, there's no comprehension barrier between "mapping from key to value" and "puddle of bytes in memory that represent that mapping". But so many programmers are entrenched in the thinking that a byte IS a character.
Python still has a rule that you can iterate over anything that has __getitem__, and it'll be called with 0, 1, 2, 3... until it raises IndexError. So you have two options: Remove that rule, and require that all iterable objects actually define __iter__; or make strings non-subscriptable, which means you need to do something like "asdf".char_at(0) instead of "asdf"[0]. IMO the second option is a total non-flyer - good luck convincing anyone that THAT is an improvement. The first one is possible, but dramatically broadens the backward-compatibility issue. You'd have to search for any class that defines __getitem__ and not __iter__. If that *does* get considered, it wouldn't be too hard to have a compatibility function, maybe in itertools. def subscript(self): i = 0 try: while "moar indexing": yield self[i] i += 1 except IndexError: pass class Demo: def __getitem__(self, item): ... __iter__ = itertools.subscript But there'd have to be the full search of "what will this break", even before getting as far as making strings non-iterable. ChrisA [1] Not "two's compliment", although I'm told that Two can say some very nice things.

On 2016-08-20 21:10, Chris Angelico wrote:
Isn't the rule that that __getitem__ iteration is available only if __iter__ is not explicitly defined? So there is a third option: retain __getitem__ but give this new modified string type an explicit __iter__ that raises TypeError. That said, I'm not sure I really support the overall proposal to change the behavior of strings. I agree that it is annoying that sometimes when you try to iterate over something you accidentally end up iterating over the characters of a string, but it's been that way for quite a while and changing it would be a significant behavior change. It seems like the main practical problem might be solved by just providing a standard library function iter_or_string or whatever, that just returns a one-item iterator if its argument is a string, or the normal iterator if not. It seems that gazillions of libraries already define such a function, and the main problem is just that, because there is no standard one, many people don't realize they need it until they accidentally iterate over a string and their code goes awry. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Sun, Aug 21, 2016 at 3:06 PM, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
Hmm. It would somehow need to be recognized as "not iterable". I'm not sure how this detection is done; is it based on the presence/absence of __iter__, or is it by calling that method and seeing what comes back? If the latter, then sure, an __iter__ that raises would cover that. ChrisA

On Sun, Aug 21, 2016 at 5:27 AM, Chris Angelico <rosuav@gmail.com> wrote:
PyObject_GetIter calls __iter__ (i.e. tp_iter) if it's defined. To get a TypeError, __iter__ can return an object that's not an iterator, i.e. an object that doesn't have a __next__ method (i.e. tp_iternext). For example: >>> class C: ... def __iter__(self): return self ... def __getitem__(self, index): return 42 ... >>> iter(C()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: iter() returned non-iterator of type 'C' If __iter__ isn't defined but __getitem__ is defined, then PySeqIter_New is called to get a sequence iterator. >>> class D: ... def __getitem__(self, index): return 42 ... >>> it = iter(D()) >>> type(it) <class 'iterator'> >>> next(it) 42

On 21 August 2016 at 16:02, eryk sun <eryksun@gmail.com> wrote:
I believe Chris's concern was that "isintance(obj, collections.abc.Iterable)" would still return True. That's actually a valid concern, but Python 3.6 generalises the previously __hash__ specific "__hash__ = None" anti-registration mechanism to other protocols, including __iter__: https://hg.python.org/cpython/rev/72b9f195569c Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Aug 21, 2016 at 1:27 AM Chris Angelico <rosuav@gmail.com> wrote:
The detection of not hashable via __hash__ set to None was necessary, but not desirable. Better to have never defined the method/attribute in the first place. Since __iter__ isn't present on ``object``, we're free to use the better technique of not defining __iter__ rather than defining it as None, NotImplemented, etc. This is superior, because we don't want __iter__ to show up in a dir(), help(), or other tools.

On Sun, Aug 21, 2016 at 6:34 AM, Michael Selik <michael.selik@gmail.com> wrote:
The point is to be able to define __getitem__ without falling back on the sequence iterator. I wasn't aware of the recent commit that allows anti-registration of __iter__. This is perfect: >>> class C: ... __iter__ = None ... def __getitem__(self, index): return 42 ...

On Sun, Aug 21, 2016 at 2:46 AM eryk sun <eryksun@gmail.com> wrote:
For that to make sense, Iterable should be a parent of C, or C should be a subclass of something registered as an Iterable. Otherwise it'd be creating a general recommendation to say ``__iter__ = None`` on every non-Iterable class, which would be silly.

On Sun, Aug 21, 2016 at 2:53 AM Michael Selik <michael.selik@gmail.com> wrote:
I see your point for avoiding iterability when having __getitem__, but I hope that's seen as an anti-pattern that reduces flexibility. And I should learn to stop hitting the send button halfway through my email.

On Sun, Aug 21, 2016 at 6:53 AM, Michael Selik <michael.selik@gmail.com> wrote:
Iterable is a one-trick pony ABC that formerly just checked for an __iter__ method using any("__iter__" in B.__dict__ for B in C.__mro__). It was mentioned that the default __getitem__ iterator can be avoided by defining __iter__ as a callable that either directly or indirectly raises a TypeError, but that's an instance of Iterable, which is misleading. In 3.6 you can instead set `__iter__ = None`. At the low-level, slot_tp_iter has been updated to look for this with the following code: func = lookup_method(self, &PyId___iter__); if (func == Py_None) { Py_DECREF(func); PyErr_Format(PyExc_TypeError, "'%.200s' object is not iterable", Py_TYPE(self)->tp_name); return NULL; } At the high level, Iterable.__subclasshook__ calls _check_methods(C, "__iter__"): def _check_methods(C, *methods): mro = C.__mro__ for method in methods: for B in mro: if method in B.__dict__: if B.__dict__[method] is None: return NotImplemented break else: return NotImplemented return True

On 21 August 2016 at 14:10, Chris Angelico <rosuav@gmail.com> wrote:
That's not actually true - any type that defines __getitem__ can prevent iteration just by explicitly raising TypeError from __iter__. It would be *weird* to do so, but it's entirely possible. However, the real problem with this proposal (and the reason why the switch from 8-bit str to "bytes are effectively a tuple of ints" in Python 3 was such a pain), is that there are a lot of bytes and text processing operations that *really do* operate code point by code point. Scanning a path for directory separators, scanning a CSV (or other delimited format) for delimiters, processing regular expressions, tokenising according to a grammar, analysing words in a text for character popularity, answering questions like "Is this a valid identifier?" all involve looking at each character in a sequence individually, rather than looking at the character sequence as an atomic unit. The idiomatic pattern for doing that kind of "item by item" processing in Python is iteration (whether through the Python syntax and builtins, or through the CPython C API). Now, if we were designing a language from scratch today, there's a strong case to be made that the *right* way to represent text is to have a stream-like interface (e.g. StringIO, BytesIO) around an atomic type (e.g. CodePoint, int). But we're not designing a language from scratch - we're iterating on one with a 25 year history of design, development, and use. There may also be a case to be made for introducing an AtomicStr type into Python's data model that works like a normal string, but *doesn't* support indexing, slicing, or iteration, and is instead an opaque blob of data that nevertheless supports all the other usual string operations. (Similar to the way that types.MappingProxyType lets you provide a read-only view of an otherwise mutable mapping, and that collections.KeysView, ValuesView and ItemsView provide different interfaces for a common underlying mapping) But changing the core text type itself to no longer be suitable for use in text processing tasks? Not gonna happen :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 21 August 2016 at 15:22, Nick Coghlan <ncoghlan@gmail.com> wrote:
Huh, prompted by Brendan Barnwell's comment, I just realised that a discussion I was having with Graham Dumpleton at PyCon Australia about getting the wrapt module (or equivalent functionality) into Python 3.7 (not 3.6 just due to the respective timelines) is actually relevant here: given wrapt.ObjectProxy (see http://wrapt.readthedocs.io/en/latest/wrappers.html#object-proxy ) it shouldn't actually be that difficult to write an "atomic_proxy" implementation that wraps arbitrary container objects in a proxy that permits most operations, but actively *prevents* them from being treated as collections.abc.Container instances of any kind. So if folks are looking for a way to resolve the perennial problem of "How do I tell container processing algorithms to treat *this particular container* as an opaque blob?" that arise most often with strings and binary data, I'd highly recommend that as a potentially fruitful avenue to start exploring. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Aug 21, 2016 1:23 AM, "Nick Coghlan" <ncoghlan@gmail.com> wrote:
Thought: A string, in compsci, is, sort of by definition, a sequence of characters. It is short for "string of characters", isn't it? If you were to create a new language, and you don't want to think of strings as char sequences, you might have a type called Text instead. Programmers could be required to call functions to get iterables, such as myText.chars(), myText.lines(), and even myText.words(). Thus, the proposal makes str try to be a Text type rather than the related but distinct String type.

Nick Coghlan writes:
Sure, but code points aren't strings in any language I use except Python. And AFAIK strings are the only case in Python where a singleton *is* an element, and an element *is* a singleton. (Except it isn't: "ord('ab')" is a TypeError, even though "type('a')" returns "<class str>". <sigh/>) I thought this was cute when I first encountered it (it happens that I was studying how you can embed a set of elements into the semigroup of sequences of such elements in algebra at the time), but it has *never* been of practical use to me that indexing or iterating a str returns str (rather than a code point). "''.join(list('abc'))" being an identity is an interesting, and maybe useful, fact, but I've never missed it in languages that distinguish characters from strings. Perhaps that's because they generally have a split function defined so that "''.join('abc'.split(''))" is also available for that identity. (N.B. Python doesn't accept an empty separator, but Emacs Lisp does, where "'abc'.split('')" returns "['', 'a', 'b', 'c', '']". I guess it's too late to make this change, though.) The reason that switching to bytes is a pain is that we changed the return type of indexing bytes to something requiring conversion of literals. You can't write "bytething[i] == b'a'", you need to write "bytething[i] == ord(b'a')", and "b''.join(list(b'abc')) is an error, not an identity. Of course the world broke!
But we're not designing a language from scratch - we're iterating on one with a 25 year history of design, development, and use.
+1 to that.

On 22 August 2016 at 19:47, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Sure, but the main concern at hand ("list(strobj)" giving a broken out list of individual code points rather than TypeError) isn't actually related to the fact those individual items are themselves length-1 strings, it's related to the fact that Python normally considers strings to be a sequence type rather than a scalar value type. str is far from the only builtin container type that NumPy gives the scalar treatment when sticking it into an array:
(Interestingly, both bytearray and memoryview get interpreted as "uint8" arrays, unlike the bytes literal - presumably the latter discrepancy is a requirement for compatibility with NumPy's str/unicode handling in Python 2) That's why I suggested that a scalar proxy based on wrapt.ObjectProxy that masked all container related protocols could be an interesting future addition to the standard library (especially if it has been battle-tested on PyPI first). "I want to take this container instance, and make it behave like it wasn't a container, even if other code tries to use it as a container" is usually what people are after when they find str iteration inconvenient, but "treat this container as a scalar value, but otherwise expose all of its methods" is an operation with applications beyond strings. Not-so-coincidentally, that approach would also give us a de facto "code point" type: it would be the result of applying the scalar proxy to a length 1 str instance. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 21.08.2016 04:52, Steven D'Aprano wrote:
Agreed. Especially those "we need to distinguish between char and string" calls are somewhat irritating. I need to work with such languages at work sometimes and honestly: it sucks (but that may just be me). Furthermore, I don't see much benefit at all. First, the initial run and/or the first test will reveal the wrong behavior. Second, it just makes sense if people use a generic variable (say 'var') for different types of objects. But, well, people shouldn't do that in the first place. Third, it would make iterating over a string more cumbersome. Especially the last point makes me -1 on this proposal. My 2 cents, Sven

On Fri, Aug 19, 2016, at 23:13, Alexander Heger wrote:
Numpy was of course design a lot later, with more experience in practical use (in mind).
The meaning of np.array('abc') is a bit baffling to someone with no experience in numpy. It doesn't seem to be a one-dimensional array containing 'abc', as your next statement suggests. It seem to be a zero-dimensional array?
Maybe a starting point for transition that latter operation also returns ['abc'] in the long run
Just to be clear, are you proposing a generalized list(obj: non-iterable) constructor that returns [obj]?

This is a feature, not a flaw. From: Python-ideas [mailto:python-ideas-bounces+tritium-list=sdamon.com@python.org] On Behalf Of Alexander Heger Sent: Friday, August 19, 2016 11:14 PM To: python-ideas <python-ideas@python.org> Subject: [Python-ideas] discontinue iterable strings standard python should discontinue to see strings as iterables of characters - length-1 strings. I see this as one of the biggest design flaws of python. It may have seem genius at the time, but it has passed it usefulness for practical language use. For example, numpy has no issues
np.array('abc')
array('abc', dtype='<U3') whereas, as all know,
list('abc')
['a', 'b', 'c'] Numpy was of course design a lot later, with more experience in practical use (in mind). Maybe a starting point for transition that latter operation also returns ['abc'] in the long run, could be to have an explicit split operator as recommended use, e.g., 'abc'.split() 'abc'.split('') 'abc'.chars() 'abc'.items() the latter two could return an iterator whereas the former two return lists (currently raise exceptions). Similar for bytes, etc.

On Aug 19, 2016 11:14 PM, "Alexander Heger" <python@2sn.net> wrote:
standard python should discontinue to see strings as iterables of
characters - length-1 strings. I see this as one of the biggest design flaws of python. It may have seem genius at the time, but it has passed it usefulness for practical language use. I'm bothered by it whenever I want to write code that takes a sequence and returns a sequence of the same type. But I don't think that the answer is to remove the concept of strings as sequences. And I don't want strings to be sequences of character code points, because that's forcing humans to think on the implementation level. Please explain the problem with the status quo, preferably with examples where it goes wrong.
That says, "This is a 0-length array of 3-char Unicode strings." Numpy doesn't recognize the string as a specification of an array. Try `np.array(4.)` and you'll get (IIRC) `array(4., dtype='float')`, which has shape `()`. Numpy probably won't let you index either one. What can you even do with it? (By the way, notice that the string size is part of the dtype.)
Numpy is for numbers. It was designed with numbers in mind. Numpy's relevant experience here is waaaay less than general Python's.

my apologies about the confusion, the dim=() was not the point, but rather that numpy treats the string as a monolithic object rather than disassembling it as it would do with other iterables. I was just trying to give the simplest possible example. Numpy still does
but not here
-Alexander

On 20 August 2016 at 15:47, Franklin? Lee <leewangzhong+python@gmail.com> wrote: On Aug 19, 2016 11:14 PM, "Alexander Heger" <python@2sn.net> wrote:
standard python should discontinue to see strings as iterables of
characters - length-1 strings. I see this as one of the biggest design flaws of python. It may have seem genius at the time, but it has passed it usefulness for practical language use. I'm bothered by it whenever I want to write code that takes a sequence and returns a sequence of the same type. But I don't think that the answer is to remove the concept of strings as sequences. And I don't want strings to be sequences of character code points, because that's forcing humans to think on the implementation level. Please explain the problem with the status quo, preferably with examples where it goes wrong.
That says, "This is a 0-length array of 3-char Unicode strings." Numpy doesn't recognize the string as a specification of an array. Try `np.array(4.)` and you'll get (IIRC) `array(4., dtype='float')`, which has shape `()`. Numpy probably won't let you index either one. What can you even do with it? (By the way, notice that the string size is part of the dtype.) it is a generalisation of n-dimensional arrays you can index it using '()'
The point is it does not try to disassemble it into elements as it would do with other iterables
Numpy is for numbers. It was designed with numbers in mind. Numpy's relevant experience here is waaaay less than general Python's. But it does deal with strings as monolithic objects, doing away with many of the pitfalls of strings in Python. And yes, it does a lot about memory management, so it is fully aware of strings and bytes ...

This would introduce a major inconsistency. To do this, you would need to also strip string’s of their status as sequences (in collections.abc, Sequence is a subclass of Iterable). Thus, making string’s no longer iterable would also mean you could no longer take the length or slice of a string. While I believe your proposal was well intentioned, IMHO it would cause a giant inconsistency in Python (why would one of our core sequences not be iterable?) - Ed

This would introduce a major inconsistency. To do this, you would need to also strip string’s of their status as sequences (in collections.abc, Sequence is a subclass of Iterable). Thus, making string’s no longer iterable would also mean you could no longer take the length or slice of a string. you can always define __len__ and __index__ independently. I do this for many objects. But it is a point I have not considered. While I believe your proposal was well intentioned, IMHO it would cause a giant inconsistency in Python (why would one of our core sequences not be iterable?) Yes, I am aware it will cause a lot of backward incompatibilities, but this is based on all the lengthy discussions about "string but not iterable" type determinations. If sting was not iterable, a lot of things would also be easier. You could also argue why an integer cannot be iterated over its bits? As had been noted, is one of few objects of which the component can be the object itself. 'a'[0] == 'a' I do not iterate over strings so often that it could not be done using, e.g., str.chars(): for c in str.chars(): print(c) On 20 August 2016 at 13:24, Edward Minnix <egregius313@gmail.com> wrote:

On Sat, Aug 20, 2016 at 4:28 PM, Alexander Heger <python@2sn.net> wrote:
Yes, I am aware it will cause a lot of backward incompatibilities...
Tell me, would you retain the ability to subscript a string to get its characters?
"asdf"[0] 'a'
If not, you break a ton of code. If you do, they are automatically iterable *by definition*. Watch: class IsThisIterable: def __getitem__(self, idx): if idx < 5: return idx*idx raise IndexError
So you can't lose iteration without also losing subscripting. ChrisA

On Sat, Aug 20, 2016 at 3:48 AM Chris Angelico <rosuav@gmail.com> wrote:
A separate character type would solve that issue. While Alexander Heger was advocating for a "monolithic object," and may in fact not want subscripting, I think he's more frustrated by the fact that iterating over a string gives other strings. If instead a 1-length string were a different, non-iterable type, that might avoid some problems. However, special-casing a character as a different type would bring its own problems. Note the annoyance of iterating over bytes and getting integers. In case it's not clear, I should add that I disagree with this proposal and do not want any change to strings.

On Sat, Aug 20, 2016 at 10:31 PM, Michael Selik <michael.selik@gmail.com> wrote:
Agreed. One of the handy traits of cross-platform code is that MANY languages let you subscript a double-quoted string to get a single-quoted character. Compare these blocks of code: if ("asdf"[0] == 'a') write("The first letter of asdf is a.\n"); if ("asdf"[0] == 'a'): print("The first letter of asdf is a.") if ("asdf"[0] == 'a') console.log("The first letter of asdf is a.") if ("asdf"[0] == 'a') printf("The first letter of asdf is a.\n"); if ("asdf"[0] == 'a') echo("The first letter of asdf is a.\n"); Those are Pike, Python, JavaScript/ECMAScript, C/C++, and PHP, respectively. Two of them treat single-quoted and double-quoted strings identically (Python and JS). Two use double quotes for strings and single quotes for character (aka integer) constants (Pike and C). One has double quotes for interpolated and single quotes for non-interpolated strings (PHP). And just to mess you up completely, two (or three) of these define strings to be sequences of bytes (C/C++ and PHP, plus Python 2), two as sequences of Unicode codepoints (Python and Pike), and one as sequences of UTF-16 code units (JS). But in all five, subscripting a double-quoted string yields a single-quoted character. I'm firmly of the opinion that this should not change. Code clarity is not helped by creating a brand-new "character" type and not having a corresponding literal for it, and the one obvious literal, given the amount of prior art using it, would be some form of quote character - probably the apostrophe. Since that's not available, I think a character type would be a major hurdle to get over. ChrisA

That would require strings to also not be sequences, or to totally drop the sequence protocol. These are non-starters. They *will not* happen. Not they shouldn’t happen, or they probably won’t happen. They cannot and will not happen. That is a much bigger break than they were even willing to make between 2 and 3. From: Python-ideas [mailto:python-ideas-bounces+tritium-list=sdamon.com@python.org] On Behalf Of Alexander Heger Sent: Saturday, August 20, 2016 4:52 PM To: Chris Angelico <rosuav@gmail.com> Cc: python-ideas <python-ideas@python.org> Subject: Re: [Python-ideas] discontinue iterable strings I was not proposing a character type, only that strings are not iterable: for i in 'abc': print(i) TypeError: 'str' object is not iterable

On Sat, Aug 20, 2016 at 4:57 PM Alexander Heger <python@2sn.net> wrote:
You can quibble with the original design choice, but unless you borrow Guido's time machine, there's not much point to that discussion. Instead, let's talk about the benefits and problems that your change proposal would cause. Benefits: - no more accidentally using str as an iterable Problems: - any code that subscripts, slices, or iterates over a str will break Did I leave anything out? How would you weigh the benefits against the problems? How would you manage the upgrade path for code that's been broken?

Just to be clear, at the time it was designed, it surely was a genious idea with its obvious simplicity. I spend much of my time refactoring codes and interfaces from previous "genius" ideas, as usage matures.
I would try to keep indexing and slicing, but not iterating. Though there have been comments that may not be straightforward to implement. Not sure if strings would need to acquire a "substring" attribute that can be indexed and sliced. Did I leave anything out?
How would you weigh the benefits against the problems? How would you manage the upgrade path for code that's been broken?
FIrst one needs to add the extension string attributes like split()/split(''), chars(), and substring[] (Python 3.7). When indexing becomes disallowed (Python 3.10 / 4.0) attempts to iterate (or slice) will raise TypeError. The fixes overall will be a lot easier and obvious than introduction of unicode as default string type in Python 3.0. It could already be used/test starting with Python 3.7 using 'from future import __monolythic_strings__`.

From: Python-ideas [mailto:python-ideas-bounces+tritium-list=sdamon.com@python.org] On Behalf Of ????? Sent: Saturday, August 20, 2016 5:56 PM To: python-ideas <python-ideas@python.org> Subject: Re: [Python-ideas] discontinue iterable strings On Sun, Aug 21, 2016 at 12:28 AM Alexander Heger <mailto:python@2sn.net> wrote: Did I leave anything out? How would you weigh the benefits against the problems? How would you manage the upgrade path for code that's been broken? FIrst one needs to add the extension string attributes like split()/split(''), chars(), and substring[] (Python 3.7). When indexing becomes disallowed (Python 3.10 / 4.0) attempts to iterate (or slice) will raise TypeError. The fixes overall will be a lot easier and obvious than introduction of unicode as default string type in Python 3.0. It could already be used/test starting with Python 3.7 using 'from future import __monolythic_strings__`. Is there any equivalent __future__ import with such deep semantic implications? Most imports I can think of are mainly syntactic. And what would it do? change the type of string literals? change the behavior of str methods locally in this module? globally? How will this play with 3rd party libraries? Sounds like it will break stuff in a way that cannot be locally fixed. ~Elazar from __future__ import unicode_literals outright changes the type of object string literals make (in python 2). If you were to create a non-iterable, non-sequence text type (a horrible idea, IMO) the same thing can be done done for that.

On Sun, Aug 21, 2016 at 6:08 PM, <tritium-list@sdamon.com> wrote:
from __future__ import unicode_literals outright changes the type of object string literals make (in python 2). If you were to create a non-iterable, non-sequence text type (a horrible idea, IMO) the same thing can be done done for that.
It could; but that just changes what *literals* make. But what about other sources of strings - str()? bytes.decode()? format()? repr()? Which ones get changed, and which don't? There's no easy way to do this. ChrisA

On Sat, Aug 20, 2016 at 5:27 PM Alexander Heger <python@2sn.net> wrote:
- any code that subscripts, slices, or iterates over a str will break
I would try to keep indexing and slicing, but not iterating.
So anything that wants to loop over a string character by character would need to construct a new object, like ``for c in list(s)``? That seems inefficient. I suppose you might advocate for a new type, some sort of stringview that would allow iteration over a string, but avoid allocating so much space as a list does, but that might bring us back to where we started.
The fixes overall will be a lot easier and obvious than introduction of unicode as default string type in Python 3.0.
That's a bold claim. Have you considered what's at stake if that's not true? Anyway, why don't you write a proof of concept module for a non-iterable string, throw it on PyPI, and see if people like using it?

On Sun, Aug 21, 2016 at 12:34:02AM +0000, Michael Selik wrote:
If this was ten years ago, and we were talking about backwards incompatible changes for the soon-to-be-started Python 3000, I might be more responsive to changing strings to be an atomic type (like ints, floats, etc) with a .chars() view that iterates over the characters. Or something like what Go does (I think), namely to distinguish between Chars and Strings: indexing a string gives you a Char, and Chars are not indexable and not iterable. But even then, the change probably would have required a PEP.
Saying that these so-called "fixes" (we haven't established yet that Python's string behaviour is a bug that need fixing) will be easier and more obvious than the change to Unicode is not that bold a claim. Pretty much everything is easier and more obvious than changing to Unicode. :-) (Possibly not bringing peace to the Middle East.) I think that while the suggestion does bring some benefit, the benefit isn't enough to make up for the code churn and disruption it would cause. But I encourage the OP to go through the standard library, pick a couple of modules, and re-write them to see how they would look using this proposal. -- Steve

On Sun, Aug 21, 2016 at 12:52 PM, Steven D'Aprano <steve@pearwood.info> wrote:
And yet it's so simple. We can teach novice programmers about two's complement [1] representations of integers, and they have no trouble comprehending that the abstract concept of "integer" is different from the concrete representation in memory. We can teach intermediate programmers how hash tables work, and how to improve their performance on CPUs with 64-byte cache lines - again, there's no comprehension barrier between "mapping from key to value" and "puddle of bytes in memory that represent that mapping". But so many programmers are entrenched in the thinking that a byte IS a character.
Python still has a rule that you can iterate over anything that has __getitem__, and it'll be called with 0, 1, 2, 3... until it raises IndexError. So you have two options: Remove that rule, and require that all iterable objects actually define __iter__; or make strings non-subscriptable, which means you need to do something like "asdf".char_at(0) instead of "asdf"[0]. IMO the second option is a total non-flyer - good luck convincing anyone that THAT is an improvement. The first one is possible, but dramatically broadens the backward-compatibility issue. You'd have to search for any class that defines __getitem__ and not __iter__. If that *does* get considered, it wouldn't be too hard to have a compatibility function, maybe in itertools. def subscript(self): i = 0 try: while "moar indexing": yield self[i] i += 1 except IndexError: pass class Demo: def __getitem__(self, item): ... __iter__ = itertools.subscript But there'd have to be the full search of "what will this break", even before getting as far as making strings non-iterable. ChrisA [1] Not "two's compliment", although I'm told that Two can say some very nice things.

On 2016-08-20 21:10, Chris Angelico wrote:
Isn't the rule that that __getitem__ iteration is available only if __iter__ is not explicitly defined? So there is a third option: retain __getitem__ but give this new modified string type an explicit __iter__ that raises TypeError. That said, I'm not sure I really support the overall proposal to change the behavior of strings. I agree that it is annoying that sometimes when you try to iterate over something you accidentally end up iterating over the characters of a string, but it's been that way for quite a while and changing it would be a significant behavior change. It seems like the main practical problem might be solved by just providing a standard library function iter_or_string or whatever, that just returns a one-item iterator if its argument is a string, or the normal iterator if not. It seems that gazillions of libraries already define such a function, and the main problem is just that, because there is no standard one, many people don't realize they need it until they accidentally iterate over a string and their code goes awry. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Sun, Aug 21, 2016 at 3:06 PM, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
Hmm. It would somehow need to be recognized as "not iterable". I'm not sure how this detection is done; is it based on the presence/absence of __iter__, or is it by calling that method and seeing what comes back? If the latter, then sure, an __iter__ that raises would cover that. ChrisA

On Sun, Aug 21, 2016 at 5:27 AM, Chris Angelico <rosuav@gmail.com> wrote:
PyObject_GetIter calls __iter__ (i.e. tp_iter) if it's defined. To get a TypeError, __iter__ can return an object that's not an iterator, i.e. an object that doesn't have a __next__ method (i.e. tp_iternext). For example: >>> class C: ... def __iter__(self): return self ... def __getitem__(self, index): return 42 ... >>> iter(C()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: iter() returned non-iterator of type 'C' If __iter__ isn't defined but __getitem__ is defined, then PySeqIter_New is called to get a sequence iterator. >>> class D: ... def __getitem__(self, index): return 42 ... >>> it = iter(D()) >>> type(it) <class 'iterator'> >>> next(it) 42

On 21 August 2016 at 16:02, eryk sun <eryksun@gmail.com> wrote:
I believe Chris's concern was that "isintance(obj, collections.abc.Iterable)" would still return True. That's actually a valid concern, but Python 3.6 generalises the previously __hash__ specific "__hash__ = None" anti-registration mechanism to other protocols, including __iter__: https://hg.python.org/cpython/rev/72b9f195569c Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Aug 21, 2016 at 1:27 AM Chris Angelico <rosuav@gmail.com> wrote:
The detection of not hashable via __hash__ set to None was necessary, but not desirable. Better to have never defined the method/attribute in the first place. Since __iter__ isn't present on ``object``, we're free to use the better technique of not defining __iter__ rather than defining it as None, NotImplemented, etc. This is superior, because we don't want __iter__ to show up in a dir(), help(), or other tools.

On Sun, Aug 21, 2016 at 6:34 AM, Michael Selik <michael.selik@gmail.com> wrote:
The point is to be able to define __getitem__ without falling back on the sequence iterator. I wasn't aware of the recent commit that allows anti-registration of __iter__. This is perfect: >>> class C: ... __iter__ = None ... def __getitem__(self, index): return 42 ...

On Sun, Aug 21, 2016 at 2:46 AM eryk sun <eryksun@gmail.com> wrote:
For that to make sense, Iterable should be a parent of C, or C should be a subclass of something registered as an Iterable. Otherwise it'd be creating a general recommendation to say ``__iter__ = None`` on every non-Iterable class, which would be silly.

On Sun, Aug 21, 2016 at 2:53 AM Michael Selik <michael.selik@gmail.com> wrote:
I see your point for avoiding iterability when having __getitem__, but I hope that's seen as an anti-pattern that reduces flexibility. And I should learn to stop hitting the send button halfway through my email.

On Sun, Aug 21, 2016 at 6:53 AM, Michael Selik <michael.selik@gmail.com> wrote:
Iterable is a one-trick pony ABC that formerly just checked for an __iter__ method using any("__iter__" in B.__dict__ for B in C.__mro__). It was mentioned that the default __getitem__ iterator can be avoided by defining __iter__ as a callable that either directly or indirectly raises a TypeError, but that's an instance of Iterable, which is misleading. In 3.6 you can instead set `__iter__ = None`. At the low-level, slot_tp_iter has been updated to look for this with the following code: func = lookup_method(self, &PyId___iter__); if (func == Py_None) { Py_DECREF(func); PyErr_Format(PyExc_TypeError, "'%.200s' object is not iterable", Py_TYPE(self)->tp_name); return NULL; } At the high level, Iterable.__subclasshook__ calls _check_methods(C, "__iter__"): def _check_methods(C, *methods): mro = C.__mro__ for method in methods: for B in mro: if method in B.__dict__: if B.__dict__[method] is None: return NotImplemented break else: return NotImplemented return True

On 21 August 2016 at 14:10, Chris Angelico <rosuav@gmail.com> wrote:
That's not actually true - any type that defines __getitem__ can prevent iteration just by explicitly raising TypeError from __iter__. It would be *weird* to do so, but it's entirely possible. However, the real problem with this proposal (and the reason why the switch from 8-bit str to "bytes are effectively a tuple of ints" in Python 3 was such a pain), is that there are a lot of bytes and text processing operations that *really do* operate code point by code point. Scanning a path for directory separators, scanning a CSV (or other delimited format) for delimiters, processing regular expressions, tokenising according to a grammar, analysing words in a text for character popularity, answering questions like "Is this a valid identifier?" all involve looking at each character in a sequence individually, rather than looking at the character sequence as an atomic unit. The idiomatic pattern for doing that kind of "item by item" processing in Python is iteration (whether through the Python syntax and builtins, or through the CPython C API). Now, if we were designing a language from scratch today, there's a strong case to be made that the *right* way to represent text is to have a stream-like interface (e.g. StringIO, BytesIO) around an atomic type (e.g. CodePoint, int). But we're not designing a language from scratch - we're iterating on one with a 25 year history of design, development, and use. There may also be a case to be made for introducing an AtomicStr type into Python's data model that works like a normal string, but *doesn't* support indexing, slicing, or iteration, and is instead an opaque blob of data that nevertheless supports all the other usual string operations. (Similar to the way that types.MappingProxyType lets you provide a read-only view of an otherwise mutable mapping, and that collections.KeysView, ValuesView and ItemsView provide different interfaces for a common underlying mapping) But changing the core text type itself to no longer be suitable for use in text processing tasks? Not gonna happen :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 21 August 2016 at 15:22, Nick Coghlan <ncoghlan@gmail.com> wrote:
Huh, prompted by Brendan Barnwell's comment, I just realised that a discussion I was having with Graham Dumpleton at PyCon Australia about getting the wrapt module (or equivalent functionality) into Python 3.7 (not 3.6 just due to the respective timelines) is actually relevant here: given wrapt.ObjectProxy (see http://wrapt.readthedocs.io/en/latest/wrappers.html#object-proxy ) it shouldn't actually be that difficult to write an "atomic_proxy" implementation that wraps arbitrary container objects in a proxy that permits most operations, but actively *prevents* them from being treated as collections.abc.Container instances of any kind. So if folks are looking for a way to resolve the perennial problem of "How do I tell container processing algorithms to treat *this particular container* as an opaque blob?" that arise most often with strings and binary data, I'd highly recommend that as a potentially fruitful avenue to start exploring. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Aug 21, 2016 1:23 AM, "Nick Coghlan" <ncoghlan@gmail.com> wrote:
Thought: A string, in compsci, is, sort of by definition, a sequence of characters. It is short for "string of characters", isn't it? If you were to create a new language, and you don't want to think of strings as char sequences, you might have a type called Text instead. Programmers could be required to call functions to get iterables, such as myText.chars(), myText.lines(), and even myText.words(). Thus, the proposal makes str try to be a Text type rather than the related but distinct String type.

Nick Coghlan writes:
Sure, but code points aren't strings in any language I use except Python. And AFAIK strings are the only case in Python where a singleton *is* an element, and an element *is* a singleton. (Except it isn't: "ord('ab')" is a TypeError, even though "type('a')" returns "<class str>". <sigh/>) I thought this was cute when I first encountered it (it happens that I was studying how you can embed a set of elements into the semigroup of sequences of such elements in algebra at the time), but it has *never* been of practical use to me that indexing or iterating a str returns str (rather than a code point). "''.join(list('abc'))" being an identity is an interesting, and maybe useful, fact, but I've never missed it in languages that distinguish characters from strings. Perhaps that's because they generally have a split function defined so that "''.join('abc'.split(''))" is also available for that identity. (N.B. Python doesn't accept an empty separator, but Emacs Lisp does, where "'abc'.split('')" returns "['', 'a', 'b', 'c', '']". I guess it's too late to make this change, though.) The reason that switching to bytes is a pain is that we changed the return type of indexing bytes to something requiring conversion of literals. You can't write "bytething[i] == b'a'", you need to write "bytething[i] == ord(b'a')", and "b''.join(list(b'abc')) is an error, not an identity. Of course the world broke!
But we're not designing a language from scratch - we're iterating on one with a 25 year history of design, development, and use.
+1 to that.

On 22 August 2016 at 19:47, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Sure, but the main concern at hand ("list(strobj)" giving a broken out list of individual code points rather than TypeError) isn't actually related to the fact those individual items are themselves length-1 strings, it's related to the fact that Python normally considers strings to be a sequence type rather than a scalar value type. str is far from the only builtin container type that NumPy gives the scalar treatment when sticking it into an array:
(Interestingly, both bytearray and memoryview get interpreted as "uint8" arrays, unlike the bytes literal - presumably the latter discrepancy is a requirement for compatibility with NumPy's str/unicode handling in Python 2) That's why I suggested that a scalar proxy based on wrapt.ObjectProxy that masked all container related protocols could be an interesting future addition to the standard library (especially if it has been battle-tested on PyPI first). "I want to take this container instance, and make it behave like it wasn't a container, even if other code tries to use it as a container" is usually what people are after when they find str iteration inconvenient, but "treat this container as a scalar value, but otherwise expose all of its methods" is an operation with applications beyond strings. Not-so-coincidentally, that approach would also give us a de facto "code point" type: it would be the result of applying the scalar proxy to a length 1 str instance. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 21.08.2016 04:52, Steven D'Aprano wrote:
Agreed. Especially those "we need to distinguish between char and string" calls are somewhat irritating. I need to work with such languages at work sometimes and honestly: it sucks (but that may just be me). Furthermore, I don't see much benefit at all. First, the initial run and/or the first test will reveal the wrong behavior. Second, it just makes sense if people use a generic variable (say 'var') for different types of objects. But, well, people shouldn't do that in the first place. Third, it would make iterating over a string more cumbersome. Especially the last point makes me -1 on this proposal. My 2 cents, Sven

On Fri, Aug 19, 2016, at 23:13, Alexander Heger wrote:
Numpy was of course design a lot later, with more experience in practical use (in mind).
The meaning of np.array('abc') is a bit baffling to someone with no experience in numpy. It doesn't seem to be a one-dimensional array containing 'abc', as your next statement suggests. It seem to be a zero-dimensional array?
Maybe a starting point for transition that latter operation also returns ['abc'] in the long run
Just to be clear, are you proposing a generalized list(obj: non-iterable) constructor that returns [obj]?

This is a feature, not a flaw. From: Python-ideas [mailto:python-ideas-bounces+tritium-list=sdamon.com@python.org] On Behalf Of Alexander Heger Sent: Friday, August 19, 2016 11:14 PM To: python-ideas <python-ideas@python.org> Subject: [Python-ideas] discontinue iterable strings standard python should discontinue to see strings as iterables of characters - length-1 strings. I see this as one of the biggest design flaws of python. It may have seem genius at the time, but it has passed it usefulness for practical language use. For example, numpy has no issues
np.array('abc')
array('abc', dtype='<U3') whereas, as all know,
list('abc')
['a', 'b', 'c'] Numpy was of course design a lot later, with more experience in practical use (in mind). Maybe a starting point for transition that latter operation also returns ['abc'] in the long run, could be to have an explicit split operator as recommended use, e.g., 'abc'.split() 'abc'.split('') 'abc'.chars() 'abc'.items() the latter two could return an iterator whereas the former two return lists (currently raise exceptions). Similar for bytes, etc.

On Aug 19, 2016 11:14 PM, "Alexander Heger" <python@2sn.net> wrote:
standard python should discontinue to see strings as iterables of
characters - length-1 strings. I see this as one of the biggest design flaws of python. It may have seem genius at the time, but it has passed it usefulness for practical language use. I'm bothered by it whenever I want to write code that takes a sequence and returns a sequence of the same type. But I don't think that the answer is to remove the concept of strings as sequences. And I don't want strings to be sequences of character code points, because that's forcing humans to think on the implementation level. Please explain the problem with the status quo, preferably with examples where it goes wrong.
That says, "This is a 0-length array of 3-char Unicode strings." Numpy doesn't recognize the string as a specification of an array. Try `np.array(4.)` and you'll get (IIRC) `array(4., dtype='float')`, which has shape `()`. Numpy probably won't let you index either one. What can you even do with it? (By the way, notice that the string size is part of the dtype.)
Numpy is for numbers. It was designed with numbers in mind. Numpy's relevant experience here is waaaay less than general Python's.

my apologies about the confusion, the dim=() was not the point, but rather that numpy treats the string as a monolithic object rather than disassembling it as it would do with other iterables. I was just trying to give the simplest possible example. Numpy still does
but not here
-Alexander

On 20 August 2016 at 15:47, Franklin? Lee <leewangzhong+python@gmail.com> wrote: On Aug 19, 2016 11:14 PM, "Alexander Heger" <python@2sn.net> wrote:
standard python should discontinue to see strings as iterables of
characters - length-1 strings. I see this as one of the biggest design flaws of python. It may have seem genius at the time, but it has passed it usefulness for practical language use. I'm bothered by it whenever I want to write code that takes a sequence and returns a sequence of the same type. But I don't think that the answer is to remove the concept of strings as sequences. And I don't want strings to be sequences of character code points, because that's forcing humans to think on the implementation level. Please explain the problem with the status quo, preferably with examples where it goes wrong.
That says, "This is a 0-length array of 3-char Unicode strings." Numpy doesn't recognize the string as a specification of an array. Try `np.array(4.)` and you'll get (IIRC) `array(4., dtype='float')`, which has shape `()`. Numpy probably won't let you index either one. What can you even do with it? (By the way, notice that the string size is part of the dtype.) it is a generalisation of n-dimensional arrays you can index it using '()'
The point is it does not try to disassemble it into elements as it would do with other iterables
Numpy is for numbers. It was designed with numbers in mind. Numpy's relevant experience here is waaaay less than general Python's. But it does deal with strings as monolithic objects, doing away with many of the pitfalls of strings in Python. And yes, it does a lot about memory management, so it is fully aware of strings and bytes ...
participants (14)
-
Alexander Heger
-
Brendan Barnwell
-
Chris Angelico
-
Edward Minnix
-
eryk sun
-
Franklin? Lee
-
Michael Selik
-
Nick Coghlan
-
Random832
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Sven R. Kunze
-
tritium-list@sdamon.com
-
אלעזר