
Executive summary: Inference of str vs bytes (in particular, which component of a "sum" type in general) is not always possible, and it's not obvious how beneficial it is. Proposal: have typecheckers recognize "suspicious" unions, and complain if they are not "converted" to a component type before passing to a polymorphic function. Koos Zevenhoven writes:
We'll need to address this, unless we want the type checker to not know whether os.path.* etc. return str or bytes and to carry around Union[str, bytes]. In theory, it would be possible to infer whether it is str or bytes, as described.
I'm -0.5 = "more trouble than it's worth". Rationale: *Sometimes* it would be possible. OTOH, modules like os.path *will* end up carrying around Union[str, bytes], precisely because they will be called in both contexts, and that's not predictable at the time of type-checking os.path. So what you really want is a Sum type constructor, where Sum is in the sense of category theory's sum functor. That is, not only is the result Type a (disjoint) union, but you're also guaranteed that each operation respects the component type. I could argue that this is why Guido was right to remove the restriction that os.fspath return str. The __fspath__ method is part of the "sum category os.path"[1], which is a collection of operations on bytes and on str that parallel each other because they are equivalent representations of filesystem paths. os.fspath therefore is also part of this category. If you want to take the result out of this category and make it str, use os.fsdecode (which is *not* part of this category *because* the codomain is str, not Sum[bytes,str]). Note that Union is still a useful type if we have Sum. For example, the operator division / : Tuple(Union(Float,Int),Union(Float,Int)) -> Union(Float,Int) might take 1.0/1.0 -> 1 (in Common Lisp it does, although in Python it doesn't). So the "sum" requirement that the map respect "original" type is a language design question, not a restriction we should apply for the fun of it. Now, how would the type checker treat __fspath__? Borrowing a type from Ethan, it needs to observe that there was bytes object in the main function that got passed to the Antipathy Sum[str,bytes] -> Path constructor. It would then tag that Path p as a bytes-y Path, and further tag the value of os.fspath(p) as bytes. So we would need to add the concept of a ComponentType-y object of Type SumType to the typing module. That sounds complicated to me, and typically not all that useful: people living at the level of bytes will be swimming in bytes, ditto str-people. Polymorphism "just works" for them. People working at the interface will be using functions like os.fsdecode that ensure type in the nature of things. They have to be more careful, but use of a Union[str, bytes] as an argument to foo(b: bytes) will already be flagged. I think TRT here is to provide a way to tell the type checker that certain Unions should be converted as soon as possible, and not allow passing them even to polymorphic functions. So: def foo(d: DirEntry) -> Int: s = os.fspath(d) with open(s) as f: # Flag use of s as Union[str, bytes]. # open() doesn't care, but the type # checker knows that such Unions # should only be passed to conversion # functions. return len(f) while def foo(d: DirEntry) -> Int: s = os.fspath(d) s = os.fsdecode(s) # Converted to str immediately, clear since # fsdecode Type is "SuspiciousUnion -> str". with open(s) as f: return len(f) is OK. This is on the "catch bugs early" principle, and respects the generally accepted principle that in "most" applications, encoded bytes should be decoded to str "at the boundary". Of course this would be optional (eg, it would be off in a standalone check of os.path for type sanity), and not all Unions are subject to this treatment (eg, it would make no sense for numbers). On the other hand, *inside* a function that is respectfully polymorphic, the annotation foo(s: Sum[str, bytes]) -> Sum[str, bytes] would tell the typechecker to ensure that in foo's implementation, if bytes go in, bytes come out, if str goes in, str comes out. That might be very useful (but I don't have a strong opinion). A nit: what would len(s: Sum[str, bytes]) -> Int mean? In this case it's obviously equivalent to len(s: Union[str, bytes]) -> Int, but I'm not sure it's all that obvious in other possible cases. Finally, in many cases Sum types just are not going to be useful. You might want to alias Number to Sum[Float, Int] because (in Python) Float + Float -> Float and Int + Int -> Int, but then you have two problems. First, you can't do inference on mixed-type arithmetic, Sum doesn't permit that.[2] Second, you can't do inference on Int / Int. Footnotes: [1] The quotes mean "I think I could make this precise but it would be boring. Bear with me, and don't be distracted by the fact that lots of stuff in this 'category' aren't in the os.path *module* -- eg, open() and __fspath__ itself." I hope that the categorical language will be a useful metaphor for those who understand category theory, but really the whole thing is a hand-wavy path to "what implementation would look like". [2] This is also a handwavy statement. Technically, the arithmetic we're talking about is binary: Tuple[Sum, Sum] -> Sum and it could do anything with mixed types, as Sum can only provide restrictions on unary: Sum -> Sum operation. So a more accurate way to put the point is that it's not really obvious what properties Tuple[Sum, Sum] should have, especially in the context of a function whose value is of the same Sum Type. And of course the problem with Int/Int is real -- it's hard to imagine how that could be handled by a general rule about Tuple[Sum, Sum] -> Sum: surely if both operands are of the same type, the value should be of that type! But we decided otherwise when designing Python 3.