[Python-ideas] Proposal: Use mypy syntax for function annotations

Fri Aug 15 07:35:28 CEST 2014

Guido van Rossum schrieb am 15.08.2014 um 06:34:
> On Thu, Aug 14, 2014 at 9:12 PM, Stefan Behnel wrote:
>> Guido van Rossum schrieb am 14.08.2014 um 07:24:
>>> On Wed, Aug 13, 2014 at 9:06 PM, Jukka Lehtosalo wrote:
>>>> You could use AnyStr to make the example work with bytes as well:
>>>>
>>>>   def word_count(input: Iterable[AnyStr]) -> Dict[AnyStr, int]:
>>>>       result = {}  #type: Dict[AnyStr, int]
>>>>
>>>>       for line in input:
>>>>           for word in line.split():
>>>>               result[word] = result.get(word, 0) + 1
>>>>       return result
>>>>
>>>> Again, if this is just a simple utility function that you use once or
>>>> twice, I see no reason to spend a lot of effort in coming up with the
>> most
>>>> general signature. Types are an abstraction and they can't express
>>>> everything precisely -- there will always be a lot of cases where you
>> can't
>>>> express the most general type. However, I think that relatively simple
>>>> types work well enough most of the time, and give the most bang for the
>>>> buck.
>>>
>>> I heartily agree. But just for the type theorists amongst us, if I really
>>> wanted to write the most general type, how would I express that the
>> AnyStr
>>> in the return type matches the one in the argument? (I think pytypedecl
>>> would use something like T <= AnyStr.)
>>
>> That's how Cython's "fused types" (generics) work, at least. They go by
>> name: same name of the type, same type. Otherwise, use alias names, which
>> make the types independent from each other.
>>
>> http://docs.cython.org/src/userguide/fusedtypes.html
>>
>> While it's a matter of definition what way to go here (same type or not),
>> practice has shown that it's clearly the right decision to make identical
>> types the default.
> 
> I don't understand those docs at all

I'm not surprised. ;)

The main idea is that you declare (typedef) a "fused" type that means "any
of the following list of types". Then you use it in a function signature
and the compiler explodes it into multiple specialised implementations that
get separately optimised for the specific type(s) they use. Compile time
generic functions, essentially. You get the cross product of all different
fused types that your function uses, but in practice, you almost always
want only one specialisation for each type, regardless of how often you
used it in the argument list.

> but I do think I understand the rule
> "same name, same type" and I think I like it. Let me be clear -- in this
> example:
> 
> def word_count(input: Iterable[AnyStr]) -> Mapping[AnyStr, int]:
>     ...
> 
> the implication would be that if the input is Iterable[bytes] the output is
> Mapping[bytes, int] while if the input is Iterable[str] the output is
> Mapping[str, int]. Have I got that right? I hope so, because I think it is
> a nice simplifying rule that covers a lot of cases in practice.

Yes, absolutely. One caveat for Python: static analysis tools (including
Cython) will usually have the AST available and thus see the type name
used. Once the annotation is in the __annotations__ dict, however, it's
lost and reduced to the base type object instance. So renaming types would
have to be an explicit operation that the type object knows about.
Otherwise, you'd loose semantics at runtime (not sure it matters much in
practice, but it would when used for documentation purposes). Not very DRY,
but as I said, the hugely more normal case is to want all types the same.

> BTW there are a lot of messy things to consider around bytes, and IIUC mypy
> currently doesn't really cover them. Often when you write code that accepts
> a bytes instance, in practice it will accept anything that supports the
> buffer protocol (e.g. bytearray and memoryview).

Yes, totally. I've been teaching people that for years now, but it's so
much easier for them to write "x: bytes" than to remember to be forgiving
about input and think about what that means for their specific code. Not
typing the input at all is actually the best solution in many cases, but
getting that into the head of users who are just discovering the beauty of
an optionally typed Python language is a true up-hill battle. Sometimes I
even encourage them to use memory views instead of expecting "real" byte
string input, even though that can make working with the data less
"stringish". But it's what the users of their code will want.

Cython essentially uses NumPy-ish syntax for (compile time) memory views,
i.e. you'd write "int[:] x" to unpack a 1-dimensional buffer of item type C
int, or "unsigned char[:] b" for a plain uchar buffer. Here's a bunch of
examples:

http://docs.cython.org/src/userguide/memoryviews.html

This is a very well established syntax by now, with lots of code out there
using it. Makes working with arbitrary buffer providers a charm. Note that
Cython has its own C level memory view implementation, so this is way more
efficient than Python's generic memoryview objects (but PEP 3118 based, so
compatible).

> Except when you are going
> to use it as a dict key, then bytearray won't work. And if you say that you
> are returning bytes, you probably shouldn't be returning a memoryview or
> bytearray.

Right. Hashability, strict output, all that.

> I don't expect that any type system we can come up with will be
> quite precise enough to cover all the cases, so we probably shouldn't lose
> too much sleep over this.

Well, Cython's type system has pretty much all you'd need. But it's linked
to the compiler in the sense that some features exist because Cython can do
them at compile time. Not everything can be done in pure Python at runtime
or import time.

Stefan