On Wed, Aug 13, 2014 at 6:39 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
On Wednesday, August 13, 2014 12:45 PM, Guido van Rossum <guido@python.org> wrote:


>  def word_count(input: List[str]) -> Dict[str, int]:
>      result = {}  #type: Dict[str, int]
>      for line in input:
>          for word in line.split():
>              result[word] = result.get(word, 0) + 1
>      return result


I just realized why this bothers me.

This function really, really ought to be taking an Iterable[String] (except that we don't have a String ABC). If you hadn't statically typed it, it would work just fine with, say, a text file—or, for that matter, a binary file. By restricting it to List[str], you've made it a lot less usable, for no visible benefit.

And, while this is less serious, I don't think it should be guaranteeing that the result is a Dict rather than just some kind of Mapping. If you want to change the implementation tomorrow to return some kind of proxy or a tree-based sorted mapping, you can't do so without breaking all the code that uses your function.

I see this is a matter of programming style. In a library module, I'd usually use about as general types as feasible (without making them overly complex). However, if we have just a simple utility function that's only used within a single program, declaring everything using abstract types buys you little, IMHO, but may make things much more complicated. You can always refactor the code to use more general types if the need arises. Using simple, concrete types seems to decrease the cognitive load, but that's just my experience.

Also, programmers don't always read documentation/annotations and can abuse the knowledge of the concrete return type of any function (they can figure this out easily by using repr()/type()). In general, as long as dynamically typed programs may call your function, changing the concrete return type of a library function risks breaking code that makes too many assumptions. Thus I'd rather use concrete types for function return types -- but of course everybody is free to not follow this convention.


And if even Guido, in the motivating example for this feature, is needlessly restricting the usability and future flexibility of a function, I suspect it may be a much bigger problem in practice.


This example also shows exactly what's wrong with simple generics: if this function takes an Iterable[String], it doesn't just return a Mapping[String, int], it returns a Mapping of _the same String type_. If your annotations can't express that, any value that passes through this function loses type information. 

If I define a subclass X of str, split() still returns a List[str] rather than List[X], unless I override something, so this wouldn't work with the above example:

>>> class X(str): pass
...
>>> type(X('x y').split()[0])
<class 'str'>


And not being able to tell whether the keys in word_count(f) are str or bytes *even if you know that f was a text file* seems like a pretty major loss.

Mypy considers bytes incompatible with str, and vice versa. The annotation Iterable[str] says that Iterable[bytes] (such as a binary file) would not be a valid argument. Text files and binary files have different types, though the return type of open(...) is not inferred correctly right now. It would be easy to fix this for the most common cases, though.

You could use AnyStr to make the example work with bytes as well:

  def word_count(input: Iterable[AnyStr]) -> Dict[AnyStr, int]:
      result = {}  #type: Dict[AnyStr, int]
      for line in input:
          for word in line.split():
              result[word] = result.get(word, 0) + 1
      return result

Again, if this is just a simple utility function that you use once or twice, I see no reason to spend a lot of effort in coming up with the most general signature. Types are an abstraction and they can't express everything precisely -- there will always be a lot of cases where you can't express the most general type. However, I think that relatively simple types work well enough most of the time, and give the most bang for the buck.

Jukka