[Python-ideas] Type hints for text/binary data in Python 2+3 code

Andrey Vlasovskikh andrey.vlasovskikh at gmail.com
Fri Mar 18 21:45:25 EDT 2016


With the addition of the comment-based syntax [3] for Python 2.7 + 3 in PEP 0484 having a Python 2/3 compatible way of adding type hints for text and binary values becomes important.

Following the issue #1141 at the Mypy GitHub site [1], I've came up with a draft proposal based on those ideas that I'd like to discuss here.


# Abstract

This proposal contains recommendations on how to annotate text/binary data in
newly added PEP 0484 comment-based type hints in order to make them Python 2/3
compatible when the single-source approach to porting from 2 to 3 is used.

It introduces a new type `typing.Text` that represents text data in both Python
2 and 3, deprecates `str -> unicode` promotion used in type checkers, suggests
an approach for type checkers to find implicit conversion errors by tracking ASCII
text/binary values, recommends that type checkers should warn about `unicode` in
the 2+3 mode.


# Rationale

With the addition of the comment-based syntax for Python 2.7 + 3 having a Python
2/3 compatible way of annotating types of text and binary values becomes
important. Currently having a single-source code base is the main approach to
2/3 compatibility, so it is highly desirable to have 2/3 compatible
comment-based type hints that would help porting code from 2 to 2+3 to 3.

While migrating their code from Python 2 to 3 users are most likely to discover
the following types of text/binary errors (presumably, in the descending order
of their frequency in typical code):

1. Implicit text/binary conversions removed in Python 3
2. Calling changed APIs that accept or return text/binary data
3. Calling removed/changed methods of text/binary types
4. Overriding special text/binary methods and using the related built-ins
   (`str()`, `repr()`, `unicode()`)

Only the first two types of errors -- implicit conversions and calling changed
text/binary APIs -- depend on being able to express the semantics of Python 2+3
compatible text/binary interfaces using type hints.

PEP 0484 doesn't contain any recommendations on how to document various typical
cases in text/binary APIs in order to make type hints 2+3 compatible.


# Proposal

This document is based on some text/binary handling options and the problems
associated with them propsed at python/mypy#1141 by Jukka Lehtosalo, Guido van
Rossum, and others [1]. It also takes into account the experience of the PyCharm
team with their pre-PEP484 notation for type hints [2] and handling Python 2/3
issues reported by users in PyCharm code inspections.


## Handling removed implicit conversions

In addition to the existing types (`bytes`, `str`, `unicode`, `typing.AnyStr`)
let's introduce a new type for *2+3 compatible text data* -- `typing.Text` (should
we add a fake built-in `unicode` type for Python 3 to type checkers instead of
introducing a new name?):

* `typing.Text`: text data
    * Python 2: `unicode`
    * Python 3: `str`

Just to remind the semantics of the existing types:

* `bytes`: binary data
    * Python 2: `bytes` (== `str`)
    * Python 3: `bytes`
* `str`: "native" string, `type('foo')`
    * Python 2: `str`
    * Python 3: `str`
* `unicode`: Python 2-only text data
    * Python 2: `unicode`
    * Python 3: error
* `typing.AnyStr`: type variable constrained to both text and binary data

With the addition of `typing.Text` it is possible to express the type analogous
to `typing.AnyStr` that doesn't impose any type constraints (should we call it
`typing.BaseString`?):

* `typing.Union[typing.Text, bytes]`: both text and binary data when a type
                                      varibale isn't needed

Using only `typing.Text`, `bytes`, `str`, and `typing.AnyStr` in the type hints
for an API would mean that this API is Python 2 and 3 compatible in respect to
implicit text/binary conversions.

For Python 2 we should *not* have the implicit `str` -> `unicode` promotion
since it hides errors related to implicit conversions.

For 7-bit ASCII string literals in Python 2 type checkers should infer special
internal types `typing._AsciiStr` and `typing._AsciiUnicode` that are compatible
with both `str` and `unicode` (a *special type-checking rule* is needed):

    class _AsciiStr(str):
        pass

    class _AsciiUnicode(unicode):
        pass

The details of inferring ASCII types are up to specific type checkers.

In the 2+3 mode type checkers should show errors when comment- or stub- based
type hints contain `unicode`.


## Examples of typical 2+3 functions

A function that accepts "native" strings. It uses implicit ASCII
unicode-to-str conversion at runtime in Python 2 and accepts only text data in
Python 3:

    def getattr(o: Any, name: str, default: Any = None) -> Any: ...

A function that does implicit str-to-unicode conversion at runtime in Python 2
and accepts only text data in Python 3:

    def hello_rus(name: Text) -> Text:
        return u'Привет, ' + name

A function that transforms text-to-text or binary-to-binary or handles both text
and binary data in some other way in both Python 2 and 3:

    def listdir(path: AnyStr) -> AnyStr: ...

A function that works with both text and binary data in Python 2 and 3, where
the author of the function some reason doens't want to have a type variable
associated with `AnyStr`:

    def upper_len(s: Union[bytes, Text]) -> int:
        return len(s.upper())

A PEP-3333 compatible WSGI app function that uses "native" strings for environ
and headers data while returning an iterable over binary data in both Python 2
and 3:

    def app(environ: Dict[str, Any],
            start_response: Callable[[str, List[Tuple[str, str]]], None]) \
            -> Iterable[bytes]: ...

A type inference example that features a type checker being able to infer
`typing._AsciiStr` or `typing._AsciiUnicode` types for Python 2 using the
functions defined above:

    method_name = u'update'   # _AsciiUnicode
    getattr({}, method_name)  # OK, implicit ASCII-only unicode-to-bytes in Py2

    nonascii_data = b'\xff'   # _AsciiStr
    hello_rus(nonascii_data)  # Type checker warning
                              # Non-ASCII bytes are not compatible with Text

                              # _AsciiUnicode + bytes
    u'foo' + b'\xff'          # Type checker warning
                              # Non-ASCII bytes are not compatible with Text

    def f(x: AnyStr, y: AnyStr) -> AnyStr:
        return os.path.join('base', x, y)  # _AsciiStr compatible with AnyStr
                                           # since it's compatible with
                                           # both str and unicode

There are cases mentioned in [1] where more advanced type inference rules are
required in order to be able to handle ASCII types. It remains unclear if
these rules would be easy enough to implement in type checkers.


## Handling other types of text/binary errors

No new types besides `typing.Text` are needed in order to find errors of the
other types of errors listed in the Rationale section.

Based on the type hints that use the above text / binary types, type checkers
in the 2+3 mode should show errors when the user accesses the attributes of
these types not available in both Python 2 and Python 3.


  [1]: https://github.com/python/mypy/issues/1141
  [2]: https://github.com/JetBrains/python-skeletons#types
  [3]: https://www.python.org/dev/peps/pep-0484/#suggested-syntax-for-python-2-7-and-straddling-code


-- 
Andrey Vlasovskikh

Web: http://pirx.ru/



More information about the Python-ideas mailing list