[Python-ideas] Add support for external annotations in the typing module

Thu Jan 17 17:33:57 EST 2019

We started a discussion in https://github.com/python/typing/issues/600
about adding support for extra annotations in the typing module.

Since this is probably going to turn into a PEP I'm transferring the
discussion here to have more visibility.

The document below has been modified a bit from the one in GH to reflect
the feedback I got:

 + Added a small blurb about how ``Annotated`` should support being used as
an alias

Things that were raised but are not reflected in this document:

 + The dataclass example is confusing. I kept it for now because
dataclasses often come up in conversations about why we might want to
support annotations in the typing module. Maybe I should rework the
section.

 + `...` as a valid parameter for the first argument (if you want to add an
annotation but use the type inferred by your type checker). This is an
interesting idea, it's probably worth adding support for it if and only if
we decide to support in other places. (c.f.:
https://github.com/python/typing/issues/276)

Thanks,

Add support for external annotations in the typing module
==========================================================

We propose adding an ``Annotated`` type to the typing module to decorate
existing types with context-specific metadata. Specifically, a type ``T``
can be annotated with metadata ``x`` via the typehint ``Annotated[T, x]``.
This metadata can be used for either static analysis or at runtime. If a
library (or tool) encounters a typehint ``Annotated[T, x]`` and has no
special logic for metadata ``x``, it should ignore it and simply treat the
type as ``T``. Unlike the `no_type_check` functionality that current exists
in the ``typing`` module which completely disables typechecking annotations
on a function or a class, the ``Annotated`` type allows for both static
typechecking of ``T`` (e.g., via MyPy or Pyre, which can safely ignore
``x``)  together with runtime access to ``x`` within a specific
application. We believe that the introduction of this type would address a
diverse set of use cases of interest to the broader Python community.

Motivating examples:
~~~~~~~~~~~~~~~~~~~~

reading binary data
+++++++++++++++++++

The ``struct`` module provides a way to read and write C structs directly
from their byte representation. It currently relies on a string
representation of the C type to read in values::

  record = b'raymond   \x32\x12\x08\x01\x08'
  name, serialnum, school, gradelevel = unpack('<10sHHb', record)

The struct documentation [struct-examples]_ suggests using a named tuple to
unpack the values and make this a bit more tractable::

  from collections import namedtuple
  Student = namedtuple('Student', 'name serialnum school gradelevel')
  Student._make(unpack('<10sHHb', record))
  # Student(name=b'raymond   ', serialnum=4658, school=264, gradelevel=8)

However, this recommendation is somewhat problematic; as we add more
fields, it's going to get increasingly tedious to match the properties in
the named tuple with the arguments in ``unpack``.

Instead, annotations can provide better interoperability with a type
checker or an IDE without adding any special logic outside of the
``struct`` module::

  from typing import NamedTuple
  UnsignedShort = Annotated[int, struct.ctype('H')]
  SignedChar = Annotated[int, struct.ctype('b')]

  @struct.packed
  class Student(NamedTuple):
    # MyPy typechecks 'name' field as 'str'
    name: Annotated[str, struct.ctype("<10s")]
    serialnum: UnsignedShort
    school: SignedChar
    gradelevel: SignedChar

  # 'unpack' only uses the metadata within the type annotations
  Student.unpack(record))
  # Student(name=b'raymond   ', serialnum=4658, school=264, gradelevel=8)

dataclasses
++++++++++++

Here's an example with dataclasses [dataclass]_ that is a problematic from
the typechecking standpoint::

  from dataclasses import dataclass, field

  @dataclass
  class C:
    myint: int = 0
    # the field tells the @dataclass decorator that the default action in
the
    # constructor of this class is to set "self.mylist = list()"
    mylist: List[int] = field(default_factory=list)

Even though one might expect that ``mylist`` is a class attribute
accessible via ``C.mylist`` (like ``C.myint`` is) due to the assignment
syntax, that is not the case. Instead, the ``@dataclass`` decorator strips
out the assignment to this attribute, leading to an ``AttributeError`` upon
access::

  C.myint  # Ok: 0
  C.mylist  # AttributeError: type object 'C' has no attribute 'mylist'

This can lead to confusion for newcomers to the library who may not expect
this behavior. Furthermore, the typechecker needs to understand the
semantics of dataclasses and know to not treat the above example as an
assignment operation in (which translates to additional complexity).

It makes more sense to move the information contained in ``field`` to an
annotation::

  @dataclass
  class C:
      myint: int = 0
      mylist: Annotated[List[int], field(default_factory=list)]

  # now, the AttributeError is more intuitive because there is no
assignment operator
  C.mylist  # AttributeError

  # the constructor knows how to use the annotations to set the 'mylist'
attribute
  c = C()
  c.mylist  # []

The main benefit of writing annotations like this is that it provides a way
for clients to gracefully degrade when they don't know what to do with the
extra annotations (by just ignoring them). If you used a typechecker that
didn't have any special handling for dataclasses and the ``field``
annotation, you would still be able to run checks as though the type were
simply::

  class C:
      myint: int = 0
      mylist: List[int]

lowering barriers to developing new types
+++++++++++++++++++++++++++++++++++++++++

Typically when adding a new type, we need to upstream that type to the
typing module and change MyPy [MyPy]_, PyCharm [PyCharm]_, Pyre [Pyre]_,
pytype [pytype]_, etc. This is particularly important when working on
open-source code that makes use of our new types, seeing as the code would
not be immediately transportable to other developers' tools without
additional logic (this is a limitation of MyPy plugins [MyPy-plugins]_),
which allow for extending MyPy but would require a consumer of new
typehints to be using MyPy and have the same plugin installed). As a
result, there is a high cost to developing and trying out new types in a
codebase. Ideally, we should be able to introduce new types in a manner
that allows for graceful degradation when clients do not have a custom MyPy
plugin, which would lower the barrier to development and ensure some degree
of backward compatibility.

For example, suppose that we wanted to add support for tagged unions
[tagged-unions]_ to Python. One way to accomplish would be to annotate
``TypedDict`` in Python such that only one field is allowed to be set::

  Currency = Annotated(
    TypedDict('Currency', {'dollars': float, 'pounds': float}, total=False),
    TaggedUnion,
  )

This is a somewhat cumbersome syntax but it allows us to iterate on this
proof-of-concept and have people with non-patched IDEs work in a codebase
with tagged unions. We could easily test this proposal and iron out the
kinks before trying to upstream tagged union to `typing`, MyPy, etc.
Moreover, tools that do not have support for parsing the ``TaggedUnion``
annotation would still be able able to treat `Currency` as a ``TypedDict``,
which is still a close approximation (slightly less strict).

Details of proposed changes to ``typing``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

syntax
++++++

``Annotated`` is parameterized with a type and an arbitrary list of Python
values that represent the annotations. Here are the specific details of the
syntax:

* The first argument to ``Annotated`` must be a valid ``typing`` type or
``...`` (to use the infered type).

* Multiple type annotations are supported (Annotated supports variadic
arguments): ``Annotated[int, ValueRange(3, 10), ctype("char")]``

* ``Annotated`` must be called with at least two arguments
(``Annotated[int]`` is not valid)

* The order of the annotations is preserved and matters for equality
checks::

   Annotated[int, ValueRange(3, 10), ctype("char")] != \
    Annotated[int, ctype("char"), ValueRange(3, 10)]

* Nested ``Annotated`` types are flattened, with metadata ordered starting
with the innermost annotation::

   Annotated[Annotated[int, ValueRange(3, 10)], ctype("char")] ==\
    Annotated[int, ValueRange(3, 10), ctype("char")]

* Duplicated annotations are not removed: ``Annotated[int, ValueRange(3,
10)] != Annotated[int, ValueRange(3, 10), ValueRange(3, 10)]``

* ``Annotation`` can be used a higher order aliases::

    Typevar T = ...
    Vec = Annotated[List[Tuple[T, T]], MaxLen(10)]
    # Vec[int] == `Annotated[List[Tuple[int, int]], MaxLen(10)]

consuming annotations
++++++++++++++++++++++

Ultimately, the responsibility of how to interpret the annotations (if at
all) is the responsibility of the tool or library encountering the
`Annotated` type. A tool or library encountering an `Annotated` type can
scan through the annotations to determine if they are of interest (e.g.,
using `isinstance`).

**Unknown annotations**
  When a tool or a library does not support annotations or encounters an
unknown annotation it should just ignore it and treat annotated type as the
underlying type. For example, if we were to add an annotation that is not
an instance of `struct.ctype` to the annotation for name (e.g.,
`Annotated[str, 'foo', struct.ctype("<10s")]`), the unpack method should
ignore it.

**Namespacing annotations**
  We do not need namespaces for annotations since the class used by the
annotations acts as a namespace.

**Multiple annotations**
  It's up to the tool consuming the annotations to decide whether the
client is allowed to have several annotations on one type and how to merge
those annotations.

  Since the ``Annotated`` type allows you to put several annotations of the
same (or different) type(s) on any node, the tools or libraries consuming
those annotations are in charge of dealing with potential duplicates. For
example, if you are doing value range analysis you might allow this::

    T1 = Annotated[int, ValueRange(-10, 5)]
    T2 = Annotated[T1, ValueRange(-20, 3)]

  Flattening nested annotations, this translates to::

    T2 = Annotated[int, ValueRange(-10, 5), ValueRange(-20, 3)]

  An application consuming this type might choose to reduce these
annotations via an intersection of the ranges, in which case ``T2`` would
be treated equivalently to ``Annotated[int, ValueRange(-10, 3)]``.

  An alternative application might reduce these via a union, in which case
``T2`` would be treated equivalently to ``Annotated[int, ValueRange(-20,
5)]``.

  Other applications may decide to not support multiple annotations and
throw an exception.

References
===========

.. [struct-examples]
   https://docs.python.org/3/library/struct.html#examples

.. [dataclass]
   https://docs.python.org/3/library/dataclasses.html

.. [MyPy]
   https://github.com/python/mypy

.. [MyPy-plugins]

https://mypy.readthedocs.io/en/latest/extending_mypy.html#extending-mypy-using-plugins

.. [PyCharm]
   https://www.jetbrains.com/pycharm/

.. [Pyre]
   https://pyre-check.org/

.. [pytype]
   https://github.com/google/pytype

.. [tagged-unions]
   https://en.wikipedia.org/wiki/Tagged_union
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20190117/3a17f4cf/attachment-0001.html>