data:image/s3,"s3://crabby-images/65987/659874f88a84e07c92a09d7d230ac17da6f747f4" alt=""
Hi everyone, I know there was previously some discussion about ways to resolve str vs. Iterable[str] sometimes causing issues ( https://github.com/python/typing/issues/256), and I wanted to bring it up again because I have a suggestion that appears to have not been explored significantly in the discussions. A lot of the requests are something like "special case length-1 strings at runtime", which is tricky. But in that issue, a few people propose another option: have str.__iter__ return a type that isn't str. This returned type should itself be non-iterable. I think that alternative didn't quite get explored enough, so I wanted to bring it up for more discussion. This has some nice properties compared to a lot of the other suggestions that make it easier to implement: it requires no changes to python-the-language, and even requires no changes to MyPy or other type-checkers, it only (I think) requires changes to type stubs for the standard library and builtins. Whether this stricter behavior should hide behind a flag is another question that I don't address. So the problem is that, given a function like def many_strs(strs: Sequence[str]) -> str: return strs[0] the call `many_strs('aaa')` will typecheck successfully, even though its exceedingly unlikely, given the annotations, that this is behavior the user wanted or expected. So let's add a new type, which I'll refer to as "OpaqueStr", since it's meant to be non-iterable and non-container-like (since most instances are just going to be a length-1 string from str.__iter__). It should only implement a small subset of the str methods (capitalize, add, join, upper, lower, all of the isXYZ), but not any of the container like ones: no __iter__, no __len__, no strip, translate, format, find, etc. Those don't make sense for a type like this. Similarly, other functions, like ord, str.join, str.format, etc. need to be updated to also accept an OpaqueStr. There's also unicode related changes, and some additional type-plumbing that I haven't encountered. Auditing current places where `str` is accepted, and replacing them with `str or OpaqueStr` is probably the most annoying thing about this, but it forces people to opt-in to the ambiguous behavior. The result of this is that as soon as you iterate over a string, you get an OpaqueStr, which isn't in the normal str hierarchy, it's unrelated (though may be `basestring`? in python2). So attempts to iterate over or use the OpaqueStr in non-intuitive ways fail. Similarly, since str is now defined as Iterable/Container[OpaqueStr], instead of Iterable[str], you avoid the recursive type problem. If this sounds relatively concrete, it's because I wrote an initial implementation after encountering this problem, that basically follows the design I outlined above. It's by no means perfect (likely erring on the side of false-negatives). I was also able to run it on a lot of the code at Google. Here's what I found: - The only false positives I saw were from functions that could return either str or Iterable[str], based on, say, a flag or another argument. I may have missed some, there were some suspicious cases, but most of those were in situations where I couldn't obviously say that - The breakages were, I think, a mostly even mix of logical issues in the code, and misleading type annotations, for example, I saw a function that could only ever return `False` due to a logical error, and this revealed it, but similarly, I saw logically corrected functions with bad annotations. One example was (paraphrased, in a class) @abstractmethod def _GetThing(self) -> str: pass def DoThing(self): x, y = self.GetThing() the subclass correctly overrode the type signature of _GetThing to return Tuple[str, str]. - Mis-typing Dict[str, List[str]] as Dict[str, str] is weirdly common, I think this was the cause of maybe a third of the changes I had to make, although it's hard to say with certainty, since in many cases I just disabled the type errors. - I ran this over a bunch of code (not sure exactly what I can say, but it was a lot), and while I haven't finished all of them, it looks like there are, maybe, 100 locations that need updating, including within the standard library (re, hexlify, etc.) and open source libraries that I could detect, as well as within Google's python code. Suppressing the warnings took me all of a few hours, fixing them will obviously take longer, but much of that is because the code in question is subtly broken and needs fixing. tl;dr: Add OpaqueStr, which is like str, but not a container. str.__iter__ returns OpaqueStr. This catches bugs. I'm looking forward to any feedback or thoughts you all have. Thanks, Josh