A proposal for str vs. Iterable[str]
Hi everyone, I know there was previously some discussion about ways to resolve str vs. Iterable[str] sometimes causing issues ( https://github.com/python/typing/issues/256), and I wanted to bring it up again because I have a suggestion that appears to have not been explored significantly in the discussions. A lot of the requests are something like "special case length-1 strings at runtime", which is tricky. But in that issue, a few people propose another option: have str.__iter__ return a type that isn't str. This returned type should itself be non-iterable. I think that alternative didn't quite get explored enough, so I wanted to bring it up for more discussion. This has some nice properties compared to a lot of the other suggestions that make it easier to implement: it requires no changes to python-the-language, and even requires no changes to MyPy or other type-checkers, it only (I think) requires changes to type stubs for the standard library and builtins. Whether this stricter behavior should hide behind a flag is another question that I don't address. So the problem is that, given a function like def many_strs(strs: Sequence[str]) -> str: return strs[0] the call `many_strs('aaa')` will typecheck successfully, even though its exceedingly unlikely, given the annotations, that this is behavior the user wanted or expected. So let's add a new type, which I'll refer to as "OpaqueStr", since it's meant to be non-iterable and non-container-like (since most instances are just going to be a length-1 string from str.__iter__). It should only implement a small subset of the str methods (capitalize, add, join, upper, lower, all of the isXYZ), but not any of the container like ones: no __iter__, no __len__, no strip, translate, format, find, etc. Those don't make sense for a type like this. Similarly, other functions, like ord, str.join, str.format, etc. need to be updated to also accept an OpaqueStr. There's also unicode related changes, and some additional type-plumbing that I haven't encountered. Auditing current places where `str` is accepted, and replacing them with `str or OpaqueStr` is probably the most annoying thing about this, but it forces people to opt-in to the ambiguous behavior. The result of this is that as soon as you iterate over a string, you get an OpaqueStr, which isn't in the normal str hierarchy, it's unrelated (though may be `basestring`? in python2). So attempts to iterate over or use the OpaqueStr in non-intuitive ways fail. Similarly, since str is now defined as Iterable/Container[OpaqueStr], instead of Iterable[str], you avoid the recursive type problem. If this sounds relatively concrete, it's because I wrote an initial implementation after encountering this problem, that basically follows the design I outlined above. It's by no means perfect (likely erring on the side of false-negatives). I was also able to run it on a lot of the code at Google. Here's what I found: - The only false positives I saw were from functions that could return either str or Iterable[str], based on, say, a flag or another argument. I may have missed some, there were some suspicious cases, but most of those were in situations where I couldn't obviously say that - The breakages were, I think, a mostly even mix of logical issues in the code, and misleading type annotations, for example, I saw a function that could only ever return `False` due to a logical error, and this revealed it, but similarly, I saw logically corrected functions with bad annotations. One example was (paraphrased, in a class) @abstractmethod def _GetThing(self) -> str: pass def DoThing(self): x, y = self.GetThing() the subclass correctly overrode the type signature of _GetThing to return Tuple[str, str]. - Mis-typing Dict[str, List[str]] as Dict[str, str] is weirdly common, I think this was the cause of maybe a third of the changes I had to make, although it's hard to say with certainty, since in many cases I just disabled the type errors. - I ran this over a bunch of code (not sure exactly what I can say, but it was a lot), and while I haven't finished all of them, it looks like there are, maybe, 100 locations that need updating, including within the standard library (re, hexlify, etc.) and open source libraries that I could detect, as well as within Google's python code. Suppressing the warnings took me all of a few hours, fixing them will obviously take longer, but much of that is because the code in question is subtly broken and needs fixing. tl;dr: Add OpaqueStr, which is like str, but not a container. str.__iter__ returns OpaqueStr. This catches bugs. I'm looking forward to any feedback or thoughts you all have. Thanks, Josh
Fascinating. Maybe that type should be called Character? I assume it should be a figment of the type checkers's imagination? I.e. only in stubs. Great that you were able to test this on a large code base. On Thu, Jun 13, 2019 at 14:49 Joshua Morton via Typing-sig < typing-sig@python.org> wrote:
Hi everyone, I know there was previously some discussion about ways to resolve str vs. Iterable[str] sometimes causing issues ( https://github.com/python/typing/issues/256), and I wanted to bring it up again because I have a suggestion that appears to have not been explored significantly in the discussions.
A lot of the requests are something like "special case length-1 strings at runtime", which is tricky.
But in that issue, a few people propose another option: have str.__iter__ return a type that isn't str. This returned type should itself be non-iterable. I think that alternative didn't quite get explored enough, so I wanted to bring it up for more discussion.
This has some nice properties compared to a lot of the other suggestions that make it easier to implement: it requires no changes to python-the-language, and even requires no changes to MyPy or other type-checkers, it only (I think) requires changes to type stubs for the standard library and builtins. Whether this stricter behavior should hide behind a flag is another question that I don't address.
So the problem is that, given a function like
def many_strs(strs: Sequence[str]) -> str: return strs[0]
the call `many_strs('aaa')` will typecheck successfully, even though its exceedingly unlikely, given the annotations, that this is behavior the user wanted or expected.
So let's add a new type, which I'll refer to as "OpaqueStr", since it's meant to be non-iterable and non-container-like (since most instances are just going to be a length-1 string from str.__iter__). It should only implement a small subset of the str methods (capitalize, add, join, upper, lower, all of the isXYZ), but not any of the container like ones: no __iter__, no __len__, no strip, translate, format, find, etc. Those don't make sense for a type like this.
Similarly, other functions, like ord, str.join, str.format, etc. need to be updated to also accept an OpaqueStr. There's also unicode related changes, and some additional type-plumbing that I haven't encountered. Auditing current places where `str` is accepted, and replacing them with `str or OpaqueStr` is probably the most annoying thing about this, but it forces people to opt-in to the ambiguous behavior.
The result of this is that as soon as you iterate over a string, you get an OpaqueStr, which isn't in the normal str hierarchy, it's unrelated (though may be `basestring`? in python2). So attempts to iterate over or use the OpaqueStr in non-intuitive ways fail. Similarly, since str is now defined as Iterable/Container[OpaqueStr], instead of Iterable[str], you avoid the recursive type problem.
If this sounds relatively concrete, it's because I wrote an initial implementation after encountering this problem, that basically follows the design I outlined above. It's by no means perfect (likely erring on the side of false-negatives). I was also able to run it on a lot of the code at Google. Here's what I found:
- The only false positives I saw were from functions that could return either str or Iterable[str], based on, say, a flag or another argument. I may have missed some, there were some suspicious cases, but most of those were in situations where I couldn't obviously say that
- The breakages were, I think, a mostly even mix of logical issues in the code, and misleading type annotations, for example, I saw a function that could only ever return `False` due to a logical error, and this revealed it, but similarly, I saw logically corrected functions with bad annotations. One example was (paraphrased, in a class)
@abstractmethod def _GetThing(self) -> str: pass
def DoThing(self): x, y = self.GetThing()
the subclass correctly overrode the type signature of _GetThing to return Tuple[str, str].
- Mis-typing Dict[str, List[str]] as Dict[str, str] is weirdly common, I think this was the cause of maybe a third of the changes I had to make, although it's hard to say with certainty, since in many cases I just disabled the type errors.
- I ran this over a bunch of code (not sure exactly what I can say, but it was a lot), and while I haven't finished all of them, it looks like there are, maybe, 100 locations that need updating, including within the standard library (re, hexlify, etc.) and open source libraries that I could detect, as well as within Google's python code. Suppressing the warnings took me all of a few hours, fixing them will obviously take longer, but much of that is because the code in question is subtly broken and needs fixing.
tl;dr: Add OpaqueStr, which is like str, but not a container. str.__iter__ returns OpaqueStr. This catches bugs.
I'm looking forward to any feedback or thoughts you all have. Thanks, Josh _______________________________________________ Typing-sig mailing list -- typing-sig@python.org To unsubscribe send an email to typing-sig-leave@python.org https://mail.python.org/mailman3/lists/typing-sig.python.org/
-- --Guido (mobile)
So far in fixing things up, I haven't found a place where I'd need to cast to char, but I think inline annotations of char might be reasonable in some (rare) cases. For example, there's some code I saw that did some character by character string processing. Something like while i < len(mystr): c = mystr[i] if some_f(c): accumulator += c That's weird, but I saw it. `some_f` in that case is `some_f(c: char) -> bool`, and being able to annotate it that way would be nice, I think? You could probably get by with upcasting everything to a str when crossing method boundaries, so `some_f(str(c))`, but I'm not clear that that's better than just exposing typing.Char. And yes, I originally implemented this under the name char/unichar, but there was some discussion of that not necessarily being the right name, so I proposed something intentionally bad :) --Josh On Thu, Jun 13, 2019 at 3:05 PM Guido van Rossum <guido@python.org> wrote:
Fascinating. Maybe that type should be called Character? I assume it should be a figment of the type checkers's imagination? I.e. only in stubs.
Great that you were able to test this on a large code base.
On Thu, Jun 13, 2019 at 14:49 Joshua Morton via Typing-sig <typing-sig@python.org> wrote:
Hi everyone, I know there was previously some discussion about ways to resolve str vs. Iterable[str] sometimes causing issues (https://github.com/python/typing/issues/256), and I wanted to bring it up again because I have a suggestion that appears to have not been explored significantly in the discussions.
A lot of the requests are something like "special case length-1 strings at runtime", which is tricky.
But in that issue, a few people propose another option: have str.__iter__ return a type that isn't str. This returned type should itself be non-iterable. I think that alternative didn't quite get explored enough, so I wanted to bring it up for more discussion.
This has some nice properties compared to a lot of the other suggestions that make it easier to implement: it requires no changes to python-the-language, and even requires no changes to MyPy or other type-checkers, it only (I think) requires changes to type stubs for the standard library and builtins. Whether this stricter behavior should hide behind a flag is another question that I don't address.
So the problem is that, given a function like
def many_strs(strs: Sequence[str]) -> str: return strs[0]
the call `many_strs('aaa')` will typecheck successfully, even though its exceedingly unlikely, given the annotations, that this is behavior the user wanted or expected.
So let's add a new type, which I'll refer to as "OpaqueStr", since it's meant to be non-iterable and non-container-like (since most instances are just going to be a length-1 string from str.__iter__). It should only implement a small subset of the str methods (capitalize, add, join, upper, lower, all of the isXYZ), but not any of the container like ones: no __iter__, no __len__, no strip, translate, format, find, etc. Those don't make sense for a type like this.
Similarly, other functions, like ord, str.join, str.format, etc. need to be updated to also accept an OpaqueStr. There's also unicode related changes, and some additional type-plumbing that I haven't encountered. Auditing current places where `str` is accepted, and replacing them with `str or OpaqueStr` is probably the most annoying thing about this, but it forces people to opt-in to the ambiguous behavior.
The result of this is that as soon as you iterate over a string, you get an OpaqueStr, which isn't in the normal str hierarchy, it's unrelated (though may be `basestring`? in python2). So attempts to iterate over or use the OpaqueStr in non-intuitive ways fail. Similarly, since str is now defined as Iterable/Container[OpaqueStr], instead of Iterable[str], you avoid the recursive type problem.
If this sounds relatively concrete, it's because I wrote an initial implementation after encountering this problem, that basically follows the design I outlined above. It's by no means perfect (likely erring on the side of false-negatives). I was also able to run it on a lot of the code at Google. Here's what I found:
- The only false positives I saw were from functions that could return either str or Iterable[str], based on, say, a flag or another argument. I may have missed some, there were some suspicious cases, but most of those were in situations where I couldn't obviously say that
- The breakages were, I think, a mostly even mix of logical issues in the code, and misleading type annotations, for example, I saw a function that could only ever return `False` due to a logical error, and this revealed it, but similarly, I saw logically corrected functions with bad annotations. One example was (paraphrased, in a class)
@abstractmethod def _GetThing(self) -> str: pass
def DoThing(self): x, y = self.GetThing()
the subclass correctly overrode the type signature of _GetThing to return Tuple[str, str].
- Mis-typing Dict[str, List[str]] as Dict[str, str] is weirdly common, I think this was the cause of maybe a third of the changes I had to make, although it's hard to say with certainty, since in many cases I just disabled the type errors.
- I ran this over a bunch of code (not sure exactly what I can say, but it was a lot), and while I haven't finished all of them, it looks like there are, maybe, 100 locations that need updating, including within the standard library (re, hexlify, etc.) and open source libraries that I could detect, as well as within Google's python code. Suppressing the warnings took me all of a few hours, fixing them will obviously take longer, but much of that is because the code in question is subtly broken and needs fixing.
tl;dr: Add OpaqueStr, which is like str, but not a container. str.__iter__ returns OpaqueStr. This catches bugs.
I'm looking forward to any feedback or thoughts you all have. Thanks, Josh _______________________________________________ Typing-sig mailing list -- typing-sig@python.org To unsubscribe send an email to typing-sig-leave@python.org https://mail.python.org/mailman3/lists/typing-sig.python.org/
-- --Guido (mobile)
participants (2)
-
Guido van Rossum
-
Joshua Morton