[Typing-sig] A proposal for str vs. Iterable[str]

June 14, 2019

      Hi everyone,
I know there was previously some discussion about ways to resolve str vs.
Iterable[str] sometimes causing issues (
https://github.com/python/typing/issues/256), and I wanted to bring it up
again because I have a suggestion that appears to have not been explored
significantly in the discussions.

A lot of the requests are something like "special case length-1 strings at
runtime", which is tricky.

But in that issue, a few people propose another option: have str.__iter__
return a type that isn't str. This returned type should itself be
non-iterable. I think that alternative didn't quite get explored enough, so
I wanted to bring it up for more discussion.

This has some nice properties compared to a lot of the other suggestions
that make it easier to implement: it requires no changes to
python-the-language, and even requires no changes to MyPy or other
type-checkers, it only (I think) requires changes to type stubs for the
standard library and builtins. Whether this stricter behavior should hide
behind a flag is another question that I don't address.

So the problem is that, given a function like

def many_strs(strs: Sequence[str]) -> str:
  return strs[0]

the call `many_strs('aaa')` will typecheck successfully, even though its
exceedingly unlikely, given the annotations, that this is behavior the user
wanted or expected.

So let's add a new type, which I'll refer to as "OpaqueStr", since it's
meant to be non-iterable and non-container-like (since most instances are
just going to be a length-1 string from str.__iter__). It should only
implement a small subset of the str methods (capitalize, add, join, upper,
lower, all of the isXYZ), but not any of the container like ones: no
__iter__, no __len__, no strip, translate, format, find, etc. Those don't
make sense for a type like this.

Similarly, other functions, like ord, str.join, str.format, etc. need to be
updated to also accept an OpaqueStr. There's also unicode related changes,
and some additional type-plumbing that I haven't encountered. Auditing
current places where `str` is accepted, and replacing them with `str or
OpaqueStr` is probably the most annoying thing about this, but it forces
people to opt-in to the ambiguous behavior.

The result of this is that as soon as you iterate over a string, you get an
OpaqueStr, which isn't in the normal str hierarchy, it's unrelated (though
may be `basestring`? in python2). So attempts to iterate over or use the
OpaqueStr in non-intuitive ways fail. Similarly, since str is now defined
as Iterable/Container[OpaqueStr], instead of Iterable[str], you avoid the
recursive type problem.

If this sounds relatively concrete, it's because I wrote an initial
implementation after encountering this problem, that basically follows the
design I outlined above. It's by no means perfect (likely erring on the
side of false-negatives). I was also able to run it on a lot of the code at
Google. Here's what I found:

 - The only false positives I saw were from functions that could return
either str or Iterable[str], based on, say, a flag or another argument. I
may have missed some, there were some suspicious cases, but most of those
were in situations where I couldn't obviously say that

 - The breakages were, I think, a mostly even mix of logical issues in the
code, and misleading type annotations, for example, I saw a function that
could only ever return `False` due to a logical error, and this revealed
it, but similarly, I saw logically corrected functions with bad
annotations. One example was (paraphrased, in a class)

@abstractmethod
def _GetThing(self) -> str:
    pass

def DoThing(self):
    x, y = self.GetThing()

the subclass correctly overrode the type signature of _GetThing to return
Tuple[str, str].

 - Mis-typing Dict[str, List[str]] as Dict[str, str] is weirdly common, I
think this was the cause of maybe a third of the changes I had to make,
although it's hard to say with certainty, since in many cases I just
disabled the type errors.

 - I ran this over a bunch of code (not sure exactly what I can say, but it
was a lot), and while I haven't finished all of them, it looks like there
are, maybe, 100 locations that need updating, including within the standard
library (re, hexlify, etc.) and open source libraries that I could detect,
as well as within Google's python code. Suppressing the warnings took me
all of a few hours, fixing them will obviously take longer, but much of
that is because the code in question is subtly broken and needs fixing.

tl;dr: Add OpaqueStr, which is like str, but not a container. str.__iter__
returns OpaqueStr. This catches bugs.

I'm looking forward to any feedback or thoughts you all have. Thanks,
Josh

[Typing-sig] A proposal for str vs. Iterable[str]

Joshua Morton