
On Sat, May 14, 2016 at 8:28 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Executive summary:
Inference of str vs bytes (in particular, which component of a "sum" type in general) is not always possible, and it's not obvious how beneficial it is. Proposal: have typecheckers recognize "suspicious" unions, and complain if they are not "converted" to a component type before passing to a polymorphic function.
Maybe you mean to propose that the concept of sum types should be introduced in typing/mypy. I don't think we want that, because annotating types with them would probably be clumsy, and there is too much overlap with things that are already available (unless of course some of that gets thrown away in favor of Sum).
Koos Zevenhoven writes:
We'll need to address this, unless we want the type checker to not know whether os.path.* etc. return str or bytes and to carry around Union[str, bytes]. In theory, it would be possible to infer whether it is str or bytes, as described.
I'm -0.5 = "more trouble than it's worth". Rationale:
I could imagine that being the conclusion at this point. I just feel bad for reducing the precision in the "type-hintability" of things like os.path.* as a consequence of the new fspath PEP, which I'm otherwise already quite happy with. That's the problem I'm trying to solve. I'm perfectly fine with it if the conclusion is that it's not worth it. I just don't want to say I broke the type hinting for os.path.* for people that want mypy to catch their errors of mixing bytes and str paths in the wrong way, or errors of mixing str/bytes-polymorphic path code with str concatenation etc. in the wrong way. But if I'm told Unions are sufficient, I'll be happy with that, because I'm not a significant user of type hints anyway. So please, someone tell me that. In my OP in the original thread, I describe how the os.path functions could presently be hinted (I guess someone may have already done this, but I did not check). In the future, It may be preferable to tell the type checker that passing a PurePath into os.path functions results in str being returned, as I also explain using a parametrizable type.
*Sometimes* it would be possible. OTOH, modules like os.path *will* end up carrying around Union[str, bytes], precisely because they will be called in both contexts, and that's not predictable at the time of type-checking os.path.
My proposal is aimed at making it possible to *always* have mypy know the types.
So what you really want is a Sum type constructor, where Sum is in the sense of category theory's sum functor. That is, not only is the result Type a (disjoint) union, but you're also guaranteed that each operation respects the component type.
(Side note: I don't know why you capitalize T in "Type" here, it does not seem to make sense in the context that Guido has just been describing as Type. Is this confusion an argument against calling Type "Type"?) My point was to parametrize the type (hint) for path objects [pretty much exactly (like) a generic], which roughly means that the type checker can be told how one disjoint union as the argument type gets turned into another disjoint union as the return type. I suppose one might write this in English as "In a typical os.path.* function such as os.path.dirname, Union[PathABC[str], str] and Union[PathABC[bytes], bytes] get turned into str and bytes, *respectively*." Mypy/PEP483-4 approaches this problem by using Generics and TypeVars. Parametrizing the path type by the underlying type such as str in PathABC[str] would allow using a TypeVar to indicate "respectively". However, there is also @overload to indicate this.
I could argue that this is why Guido was right to remove the restriction that os.fspath return str. The __fspath__ method is part of the "sum category os.path"[1], which is a collection of operations on bytes and on str that parallel each other because they are equivalent representations of filesystem paths. os.fspath therefore is also part of this category. If you want to take the result out of this category and make it str, use os.fsdecode (which is *not* part of this category *because* the codomain is str, not Sum[bytes,str]).
Indeed, as mentioned in my previous post, you seem to be describing exactly my earlier arguments, but using a different (or additional) set of terminology.
Note that Union is still a useful type if we have Sum. For example, the operator division / : Tuple(Union(Float,Int),Union(Float,Int)) -> Union(Float,Int) might take 1.0/1.0 -> 1 (in Common Lisp it does, although in Python it doesn't). So the "sum" requirement that the map respect "original" type is a language design question, not a restriction we should apply for the fun of it.
Yes, Unions will always be needed.
Now, how would the type checker treat __fspath__? Borrowing a type from Ethan, it needs to observe that there was bytes object in the main function that got passed to the Antipathy Sum[str,bytes] -> Path constructor. It would then tag that Path p as a bytes-y Path, and further tag the value of os.fspath(p) as bytes. So we would need to add the concept of a ComponentType-y object of Type SumType to the typing module. That sounds complicated to me, and typically not all that useful: people living at the level of bytes will be swimming in bytes, ditto str-people. Polymorphism "just works" for them. People working at the interface will be using functions like os.fsdecode that ensure type in the nature of things. They have to be more careful, but use of a Union[str, bytes] as an argument to foo(b: bytes) will already be flagged.
Indeed, as mentioned, there are already approaches for this.
I think TRT here is to provide a way to tell the type checker that certain Unions should be converted as soon as possible, and not allow passing them even to polymorphic functions. So:
def foo(d: DirEntry) -> Int: s = os.fspath(d) with open(s) as f: # Flag use of s as Union[str, bytes]. # open() doesn't care, but the type # checker knows that such Unions # should only be passed to conversion # functions. return len(f)
while
def foo(d: DirEntry) -> Int: s = os.fspath(d) s = os.fsdecode(s) # Converted to str immediately, clear since # fsdecode Type is "SuspiciousUnion -> str". with open(s) as f: return len(f)
There is nothing suspicious about that Union there. "Union[str, bytes] -> str" is a perfectly good type hint for os.fsdecode. You can't get any more precise than that. Some nitpicking: Your function foo should probably be hinted to work with any path, not just DirEntry, besides, there seems to be no right way to currently import DirEntry, you won't actually be able to use that as a static type hint. I'm not sure this example has anything to do with what you (or I) are proposing, except that os.fsdecode and os.fsencode indeed make sure that the output type is no longer a Union. So maybe you are just calling Unions in general suspicious, because without a context you can't know exactly which runtime type it should be. Luckily our PEP will turn your example into: def foo(d: TheTypeHintForAnyPath) -> int: with open(d) as f: return len(f)
is OK. This is on the "catch bugs early" principle, and respects the generally accepted principle that in "most" applications, encoded bytes should be decoded to str "at the boundary". Of course this would be optional (eg, it would be off in a standalone check of os.path for type sanity), and not all Unions are subject to this treatment (eg, it would make no sense for numbers).
One definitely should not use os.fsdecode or os.fsencode just to help the type checker. I hope that is not what you meant.
On the other hand, *inside* a function that is respectfully polymorphic, the annotation foo(s: Sum[str, bytes]) -> Sum[str, bytes] would tell the typechecker to ensure that in foo's implementation, if bytes go in, bytes come out, if str goes in, str comes out. That might be very useful (but I don't have a strong opinion).
Indeed, this is almost the same as TypeVar or an overload.
A nit: what would len(s: Sum[str, bytes]) -> Int mean? In this case it's obviously equivalent to len(s: Union[str, bytes]) -> Int, but I'm not sure it's all that obvious in other possible cases.
Also here. - Koos
Finally, in many cases Sum types just are not going to be useful. You might want to alias Number to Sum[Float, Int] because (in Python) Float + Float -> Float and Int + Int -> Int, but then you have two problems. First, you can't do inference on mixed-type arithmetic, Sum doesn't permit that.[2] Second, you can't do inference on Int / Int.
Footnotes: [1] The quotes mean "I think I could make this precise but it would be boring. Bear with me, and don't be distracted by the fact that lots of stuff in this 'category' aren't in the os.path *module* -- eg, open() and __fspath__ itself." I hope that the categorical language will be a useful metaphor for those who understand category theory, but really the whole thing is a hand-wavy path to "what implementation would look like".
[2] This is also a handwavy statement. Technically, the arithmetic we're talking about is binary: Tuple[Sum, Sum] -> Sum and it could do anything with mixed types, as Sum can only provide restrictions on unary: Sum -> Sum operation. So a more accurate way to put the point is that it's not really obvious what properties Tuple[Sum, Sum] should have, especially in the context of a function whose value is of the same Sum Type. And of course the problem with Int/Int is real -- it's hard to imagine how that could be handled by a general rule about Tuple[Sum, Sum] -> Sum: surely if both operands are of the same type, the value should be of that type! But we decided otherwise when designing Python 3.