Type hints for text/binary data in Python 2+3 code

With the addition of the comment-based syntax [3] for Python 2.7 + 3 in PEP 0484 having a Python 2/3 compatible way of adding type hints for text and binary values becomes important. Following the issue #1141 at the Mypy GitHub site [1], I've came up with a draft proposal based on those ideas that I'd like to discuss here. # Abstract This proposal contains recommendations on how to annotate text/binary data in newly added PEP 0484 comment-based type hints in order to make them Python 2/3 compatible when the single-source approach to porting from 2 to 3 is used. It introduces a new type `typing.Text` that represents text data in both Python 2 and 3, deprecates `str -> unicode` promotion used in type checkers, suggests an approach for type checkers to find implicit conversion errors by tracking ASCII text/binary values, recommends that type checkers should warn about `unicode` in the 2+3 mode. # Rationale With the addition of the comment-based syntax for Python 2.7 + 3 having a Python 2/3 compatible way of annotating types of text and binary values becomes important. Currently having a single-source code base is the main approach to 2/3 compatibility, so it is highly desirable to have 2/3 compatible comment-based type hints that would help porting code from 2 to 2+3 to 3. While migrating their code from Python 2 to 3 users are most likely to discover the following types of text/binary errors (presumably, in the descending order of their frequency in typical code): 1. Implicit text/binary conversions removed in Python 3 2. Calling changed APIs that accept or return text/binary data 3. Calling removed/changed methods of text/binary types 4. Overriding special text/binary methods and using the related built-ins (`str()`, `repr()`, `unicode()`) Only the first two types of errors -- implicit conversions and calling changed text/binary APIs -- depend on being able to express the semantics of Python 2+3 compatible text/binary interfaces using type hints. PEP 0484 doesn't contain any recommendations on how to document various typical cases in text/binary APIs in order to make type hints 2+3 compatible. # Proposal This document is based on some text/binary handling options and the problems associated with them propsed at python/mypy#1141 by Jukka Lehtosalo, Guido van Rossum, and others [1]. It also takes into account the experience of the PyCharm team with their pre-PEP484 notation for type hints [2] and handling Python 2/3 issues reported by users in PyCharm code inspections. ## Handling removed implicit conversions In addition to the existing types (`bytes`, `str`, `unicode`, `typing.AnyStr`) let's introduce a new type for *2+3 compatible text data* -- `typing.Text` (should we add a fake built-in `unicode` type for Python 3 to type checkers instead of introducing a new name?): * `typing.Text`: text data * Python 2: `unicode` * Python 3: `str` Just to remind the semantics of the existing types: * `bytes`: binary data * Python 2: `bytes` (== `str`) * Python 3: `bytes` * `str`: "native" string, `type('foo')` * Python 2: `str` * Python 3: `str` * `unicode`: Python 2-only text data * Python 2: `unicode` * Python 3: error * `typing.AnyStr`: type variable constrained to both text and binary data With the addition of `typing.Text` it is possible to express the type analogous to `typing.AnyStr` that doesn't impose any type constraints (should we call it `typing.BaseString`?): * `typing.Union[typing.Text, bytes]`: both text and binary data when a type varibale isn't needed Using only `typing.Text`, `bytes`, `str`, and `typing.AnyStr` in the type hints for an API would mean that this API is Python 2 and 3 compatible in respect to implicit text/binary conversions. For Python 2 we should *not* have the implicit `str` -> `unicode` promotion since it hides errors related to implicit conversions. For 7-bit ASCII string literals in Python 2 type checkers should infer special internal types `typing._AsciiStr` and `typing._AsciiUnicode` that are compatible with both `str` and `unicode` (a *special type-checking rule* is needed): class _AsciiStr(str): pass class _AsciiUnicode(unicode): pass The details of inferring ASCII types are up to specific type checkers. In the 2+3 mode type checkers should show errors when comment- or stub- based type hints contain `unicode`. ## Examples of typical 2+3 functions A function that accepts "native" strings. It uses implicit ASCII unicode-to-str conversion at runtime in Python 2 and accepts only text data in Python 3: def getattr(o: Any, name: str, default: Any = None) -> Any: ... A function that does implicit str-to-unicode conversion at runtime in Python 2 and accepts only text data in Python 3: def hello_rus(name: Text) -> Text: return u'Привет, ' + name A function that transforms text-to-text or binary-to-binary or handles both text and binary data in some other way in both Python 2 and 3: def listdir(path: AnyStr) -> AnyStr: ... A function that works with both text and binary data in Python 2 and 3, where the author of the function some reason doens't want to have a type variable associated with `AnyStr`: def upper_len(s: Union[bytes, Text]) -> int: return len(s.upper()) A PEP-3333 compatible WSGI app function that uses "native" strings for environ and headers data while returning an iterable over binary data in both Python 2 and 3: def app(environ: Dict[str, Any], start_response: Callable[[str, List[Tuple[str, str]]], None]) \ -> Iterable[bytes]: ... A type inference example that features a type checker being able to infer `typing._AsciiStr` or `typing._AsciiUnicode` types for Python 2 using the functions defined above: method_name = u'update' # _AsciiUnicode getattr({}, method_name) # OK, implicit ASCII-only unicode-to-bytes in Py2 nonascii_data = b'\xff' # _AsciiStr hello_rus(nonascii_data) # Type checker warning # Non-ASCII bytes are not compatible with Text # _AsciiUnicode + bytes u'foo' + b'\xff' # Type checker warning # Non-ASCII bytes are not compatible with Text def f(x: AnyStr, y: AnyStr) -> AnyStr: return os.path.join('base', x, y) # _AsciiStr compatible with AnyStr # since it's compatible with # both str and unicode There are cases mentioned in [1] where more advanced type inference rules are required in order to be able to handle ASCII types. It remains unclear if these rules would be easy enough to implement in type checkers. ## Handling other types of text/binary errors No new types besides `typing.Text` are needed in order to find errors of the other types of errors listed in the Rationale section. Based on the type hints that use the above text / binary types, type checkers in the 2+3 mode should show errors when the user accesses the attributes of these types not available in both Python 2 and Python 3. [1]: https://github.com/python/mypy/issues/1141 [2]: https://github.com/JetBrains/python-skeletons#types [3]: https://www.python.org/dev/peps/pep-0484/#suggested-syntax-for-python-2-7-an... -- Andrey Vlasovskikh Web: http://pirx.ru/

This sounds like a more correct approach, thanks. Looking at MarkupSafe (and, now, f-strings), would/will it be possible to use Typing.Text as a base class for even-more abstract string types ("strypes") e.g. XML, XHTML, HTML4, HTML5, HTML5.1, SQL? There are implicit casts and contextual adaptations/transformations (which MarkupSafe specs a bit). (I've no real code here, just a general idea that we're not tracking enough string metadata to be safe here) On Mar 18, 2016 8:45 PM, "Andrey Vlasovskikh" <andrey.vlasovskikh@gmail.com> wrote:

I believe having separate string types for XML or SQL content is out of the scope of this proposal. In PyCharm we already treat the contents of string literals with SQL as a separate SQL syntax tree and we understand basic string operations like concatenation or formatting. Going beyond that with the help of XML/SQL/etc. string types is possible, but I doubt we need a standard for that. -- Andrey Vlasovskikh Web: http://pirx.ru/

On Mar 22, 2016 4:36 PM, "Andrey Vlasovskikh" <andrey.vlasovskikh@gmail.com> wrote:
to use Typing.Text as a base class for even-more abstract string types ("strypes") e.g. XML, XHTML, HTML4, HTML5, HTML5.1, SQL? There are implicit casts and contextual adaptations/transformations (which MarkupSafe specs a bit). (I've no real code here, just a general idea that we're not tracking enough string metadata to be safe here)
I believe having separate string types for XML or SQL content is out of
the scope of this proposal.
In PyCharm we already treat the contents of string literals with SQL as a
separate SQL syntax tree and we understand basic string operations like concatenation or formatting. Going beyond that with the help of XML/SQL/etc. string types is possible, but I doubt we need a standard for that. At the least, it would be helpful to either have: a) a slot / attribute for additional string type metadata (is this an object subclass that I can just add attrs to) b) a minimal Text base class SQL is harder because dialects.

... OT (I'm finished): https://github.com/cloudera/ibis/blob/master/ibis/sql/alchemy.py On Mar 22, 2016 4:40 PM, "Wes Turner" <wes.turner@gmail.com> wrote:

* Text.encoding * Text.lang (urn:ietf:rfc:3066) * ... IRIs and RDF literals may be useful test cases here: * https://en.m.wikipedia.org/wiki/Comparison_of_Unicode_encodings * https://en.m.wikipedia.org/wiki/Control_character * https://en.wikipedia.org/wiki/Internationalized_resource_identifier * is this already punycoded? * http://rdflib.readthedocs.org/en/stable/rdf_terms.html * http://rdflib.readthedocs.org/en/stable/apidocs/rdflib.html#rdflib.term.Lite... (value, datatype, lang (RFC 3066)) On Mar 22, 2016 4:40 PM, "Wes Turner" <wes.turner@gmail.com> wrote:

I like the way this is going. I think it needs to be a separate PEP; PEP 484 is already too long and this topic deserves being written up carefully (like you have done here). I have a few remarks. * Do we really need _AsciiUnicode? I see the point of _AsciiStr, because Python 2 accepts 'x' + u'' but fails '\xff' + u'', so 'x' needs to be of type _AsciiStr while '\xff' should not (it should be just str). However there's no difference in how u'x' is treated from how u'\u1234' or u'\xff' are treated -- none of them can be concatenated to '\xff' and all of them can be concatenated to _'x'. * It would be helpful to spell out exactly what is and isn't allowed when different core types (bytes, str, unicode, Text) meet in Python 2 and in Python 3. Something like a table with a row and a column for each and the type of x+y (or "error") in each of the cells. * I propose that Python 2+3 mode is just the intersection of what Python 2 and Python 3 mode allow. (In mypy, I don't think we'll implement this -- users will just have to run mypy twice with and without --py2. But for PyCharm it makes sense to be able to declare this. Yet I think it would be good not to have to spell out separately which rules it uses, defining it as the intersection of 2 and 3 is all we need. On Fri, Mar 18, 2016 at 6:45 PM, Andrey Vlasovskikh <andrey.vlasovskikh@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

One thing I can do without committing to too much is to add a Text type to typing.py. It would have no (defined) behavior but it could be imported and used in annotations. Then mypy and other checkers could start using it and we could even experiment with different proposals without having to make more changes to typing.py (which we've found are hard to push out, because it's in the 3.5 stdlib -- it's provisional so we can change it, but we can't easily change what's already in 3.5.0 or 3.5.1). Should it be possible to subclass Text, and what should it mean? Or perhaps at runtime (i.e. in typing.py) Text would just be an alias for str in Python 3 and an alias for unicode in Python 2? That's easiest.

On Mar 22, 2016, at 10:58, Guido van Rossum <guido@python.org> wrote:
It seems like the worry is that you may need to change it again (e.g., to some "virtual type" that's like unicode in 2.7 except that it doesn't have a constructor from str or methods that accept str), and changing something that's in the stdlib (even in a provisional module) is hard? If so, could you define it as an alias for str in Python 3.5+, and leave it up to backports to define appropriately? Of course you're writing the 2.7 backport, and you'll define it as an alias for unicode--but if you later decide that was too inflexible, you can change the backport without having any effect on the PEP, 3.5 docs, or 3.5 stdlib.

Defining typing.Text as an alias to str in Python 3 and unicode for Python 2 (the way six.text_type is defined) looks like a good idea. I would recommend to prohibit subclassing typing.Text at the moment in the module docs and in PEP 484. We can always allow subclassing it later, but right now it's not clear wether it's safe or not given the fact that it's defined conditionally for 2/3. -- Andrey Vlasovskikh Web: http://pirx.ru/

I would like to experiment with various text/binary types for Python 2 and 3 for some time before coming up with a PEP about it. And I would like everybody interested in 2/3 compatible type hints join the discussion. My perspective (mostly PyCharm-specific) might be a bit narrow here.
I was concerned with UnicodeEncodeErrors in Python 2 during implicit conversions from unicode to bytes: getattr(obj, u'Non-ASCII-name') There are several places in the Python 2 API where these ASCII-based unicode->bytes conversions take place, so the _AsciiUnicode type comes to mind.
Agreed. I'll try to come up with specific rules for handling text/binary types (bytes, str, unicode, Text, _Ascii*) in Python 2 and 3. For me the rules for dealing with _Ascii* look the most problematic at the moment as it's unclear how these types should propagate via text-handling functions.
Yes, there is no need in having a specific 2+3 mode, I was really referring to the intersection of the Python 2 and 3 APIs when the user accesses a text / binary method not available in both. -- Andrey Vlasovskikh Web: http://pirx.ru/

On Tue, Mar 22, 2016 at 3:18 PM, Andrey Vlasovskikh <andrey.vlasovskikh@gmail.com> wrote:
As you wish!
OK, so you want the type of u'hello' to be _AsciiUnicode but the type of u'Здравствуйте' to be just unicode, right? And getattr()'s second argument would be typed as... What?
You can try that out at runtime though.
Cool. -- --Guido van Rossum (python.org/~guido)

On Wed, Mar 23, 2016 at 2:39 PM, Guido van Rossum <guido@python.org> wrote:
AIUI, getattr's second argument is simply 'str'; but in Python 2, _AsciiUnicode (presumably itself a subclass of unicode) can be implicitly promoted to str. A non-ASCII attribute name works fine, but getattr converts unicode to str using the 'ascii' codec. ChrisA

Right. I'm not sure that a non-ASCII attribute name is fine in Python 2 though. -- Andrey Vlasovskikh Web: http://pirx.ru/

On Wed, Mar 23, 2016 at 6:45 PM, Andrey Vlasovskikh <andrey.vlasovskikh@gmail.com> wrote:
It's legal. I don't know that it's a good idea, but it is legal. rosuav@sikorsky:~$ python Python 2.7.11+ (default, Feb 22 2016, 16:38:42) [GCC 5.3.1 20160220] on linux2 Type "help", "copyright", "credits" or "license" for more information.
ChrisA

The type of the second argument would be str, the "native string" type. If people use from __future__ import unicode_literals then there are many places in Python 2 where str is expected but an ASCII-unicode literal is given. Having the internal _AsciiUnicode type that inherits from unicode while being compatible with str (and bytes) would solve this issue. -- Andrey Vlasovskikh Web: http://pirx.ru/

Upon further investigation of the problem I've come up with an alternative idea that looks simpler and yet still capable of finding most text/binary conversion errors. Here is a rendered Markdown version: https://gist.github.com/vlasovskikh/1a8d5effe95d5944b919 ## TL;DR * Introduce `typing.Text` for text data in Python 2+3 * `bytes`, `str`, `unicode`, `typing.Text` in type hints mean whatever they mean at runtime for Python 2 or 3 * Allow `str -> unicode` and `unicode -> str` promotions for Python 2 * Type checking for Python 2 *and* Python 3 actually finds most text/binary errors * A few false negatives for Python 2 are not worth special handling besides possible ad-hoc handling of non-ASCII literals conversions ## Summary for Python users If you want your code to be Python 2+3 compatible: * Write text/binary type hints in 2+3 compatible comments * Use `typing.Text` for text data, `bytes` for binary data * Use `str` only for rare cases of "native strings" * Don't use `unicode` since it's absent in Python 3 * Run a type checker for *both* Python 2 and Python 3 ## Summary for authors of type checkers The semantics of types `bytes`, `str`, `unicode`, `typing.Text` and the type checking rules for them should match the *runtime behavior* of these types in Python 2 and Python 3 depending on Python 2 or 3 modes. Using the runtime semantics for the types is easy to understand while it still allows to catch most errors. The Python 2+3 compatibility mode is just a sum of Python 2 and Python 3 warnings. Type checkers *should* promote `str`/`bytes` to `unicode`/`Text` and `unicode`/`Text` to `str`/`bytes` for Python 2. Most text/binary conversion errors can be found by running a type checker for Python 2 *and* for Python 3. ## typing.Text: Python 2+3 compatible type for text data The `typing.Text` type is a Python 2+3 compatible type for text data. It's defined as follows: if sys.version_info < (3,): Text = unicode else: Text = str For a Python 2+3 compatible type for binary data use `bytes` that is available in both 2 and 3. ## Implicit text/binary conversions In Python 2 text data is implicitly converted to binary data and vice versa using the ASCII encoding. Only if the data isn't ASCII-compatible, then a `UnicodeEncodeError` or a `UnicodeDecodeError` is raised. This results in many programs that aren't well-tested regarding non-ASCII data handling. In Python 3 converting text data to binary data always raises a `TypeError`. A type checker run in the Python 3 mode will find most of Python 2 implicit conversion errors. ## Checking for Python 2+3 compatibility In order to be Python 2+3 compatible a program has to pass *both* Python 2 and Python 3 type checking. In other words, the warnings found in the Python 2+3 compatible mode are a simple sum of Python 2 warnings and Python 3 warnings. ## Runtime type compatibility Here is a table of types whose values are compatible at runtime. Columns are the expected types, rows are the actual types: | Text | bytes | str | unicode --------+-------+-------+-------+--------- Text | . . | * F | * . | . F bytes | * F | . . | . F | * F str | * . | . F | . . | * F unicode | . F | * F | * F | . F Each cell contains two characters: the result in Python 2 and in Python 3 respectively. Abbreviations: * `.` — types are compatible * `F` — types are not compatible * `*` — types are compatible, ignoring implicit ASCII conversions At runtime in Python 2 `str` is compatible with `unicode` and vice versa (ignoring possible implicit ASCII conversion errors). Using `unicode` in Python 3 is always an error since there is no `unicode` name in Python 3. As you can see from the table above, many implicit ASCII conversion errors in a Python 2 program can be found just by running a type checker in the Python 3 mode. The only problematic conversions that may result in errors are `Text` to `str` and vice versa in Python 2. Example 1. `Text` to `str` def foo(obj, x): # type: (Any, str) -> Any return getattr(obj, x) foo(..., u'привет') # False negative warning for non-ASCII in Python 2 Example 2. `str` to `Text` def foo(x): # type: (Text) -> Any return u'Привет, ' + x foo('Мир') # False negative warning for non-ASCII in Python 2 For non-ASCII text literals passed to functions that expect `Text` or `str` in Python 2 a type checker can analyze the contents of the literal and show additional warnings based on this information. For non-ASCII data coming from sources other than literals this check would be more complicated. To summarize, with this type compatibility table in place, a type checker run for *both* Python 2 and Python 3 is able to find *almost all errors* related to text and binary data except for a few text to "native string" conversions and vice versa in Python 2. ## Current Mypy type compatibility (non-runtime semantics) Mypy implies `str` to `unicode` promotion for Python 2, but it doesn't promote `unicode` to `str`. Here is an example of a Python 2 program that is correct given the runtime type compatibility semantics shown in the table above, but is incorrect for Mypy: def foo(obj, x): # type: (Any, str) -> Any return getattr(obj, x) foo({}, u'upper') # False positive warning in Mypy for ASCII in Python 2 Here is the type compatibility table for the current version of Mypy: | Text | bytes | str | unicode --------+-------+-------+-------+--------- Text | . . | F F | F . | . F bytes | * F | . . | . F | * F str | * . | . F | . . | * F unicode | . F | F F | F F | . F Running the Mypy type checker in Python 2 mode *and* Python 3 mode for the same program would find almost all implicit ASCII conversion errors except for `str` to `Text` conversions. To summarize, the current Mypy type compatibility table covers almost all text and binary data handling errors when used for *both* Python 2 and Python 3. But it doesn't notice errors in "native string" to text conversions in Python 2 and produces *false warnings* for text to "native string" conversions in Python 2. -- Andrey Vlasovskikh Web: http://pirx.ru/

On Mar 24, 2016, at 17:00, Andrey Vlasovskikh <andrey.vlasovskikh@gmail.com> wrote:
The only problematic conversions that may result in errors are `Text` to `str` and vice versa in Python 2.
So any time you use Text strings together with strings from sys.argv, sys.stdin/raw_input(), os.listdir(), ZipFile, csv.reader, etc., all of which are native str, they'll pass as valid in a 2+3 test, even though they're not actually valid in 2.x?

Yes, these errors will go unnoticed, unfortunately. But this guarantees that there will be no false positive warnings related to text/binary types. And a model of text/binary types that matches the runtime semantics is easier for users. This kind of errors would have been more important to find if users had been expected to port their code from Python 3 back to Python 2 more often than from 2 to 3. Speaking of ways to actually find these errors, one idea discussed in the issue tracker of Mypy [1] was to have a separate _AsciiStr type for things that are certainly ASCII-compatible. However, treating all str values as non-ASCII by default would result in false positive warnings. We could have a reverse type, say, _NonAsciiStr (there should be a better name for that) not compatible with Text for things we know are non-ASCII for sure: * Non-ASCII str literals * Functions like those you mentioned above There will be false negatives in cases not covered by _NonAsciiStr, but at least there will be a way of documenting non-ASCII native str interfaces for the users who care about this kind of Python 2 errors. The downside is that _NonAsciiStr is harder to understand and apply correctly than str. [1]: https://github.com/python/typing/issues/19 -- Andrey Vlasovskikh Web: http://pirx.ru/

On Fri, Mar 25, 2016 at 12:00 AM, Andrey Vlasovskikh < andrey.vlasovskikh@gmail.com> wrote:
...
I'm against this, as it would seem to make str and unicode pretty much the same type in Python 2, and thus Python 2 mode seems much weaker than necessary. I wrote a more detailed reply in the mypy issue tracker ( https://github.com/python/mypy/issues/1141#issuecomment-201799761). I'm not copying it all here since much of that is somewhat mypy-specific and related to the rest of the discussion on that issue, but I'll summarize my main points here. I prefer the idea of doing better type checking in Python 2 mode for str and unicode, though I suspect we need to implement a prototype to decide whether it will be practical. * Type checking for Python 2 *and* Python 3 actually finds most text/binary
errors
This may be true, but I'm worried about usability for Python 2 code bases. Also, the effort needed to pass type checking in both modes (which is likely pretty close to the effort of a full Python 3 migration, if the entire code will be annotated) might be impractical for a large Python 2 code base. ## Summary for authors of type checkers
At least for mypy, the Python 2+3 compatibility mode would likely that twice as much CPU to run, which is a pretty high cost as type checking speed is one of the biggest open issues we have right now. ## Runtime type compatibility
...
Each cell contains two characters: the result in Python 2 and in Python 3 respectively. Abbreviations:
...
* `*` — types are compatible, ignoring implicit ASCII conversions
Am I reading this right if I understand this as "considered valid during type checking but may fail at runtime"? For non-ASCII text literals passed to functions that expect `Text` or `str`
I wonder what would the check look like in the latter case? I can't imagine how this would work for non-literals. Jukka

This sounds like a more correct approach, thanks. Looking at MarkupSafe (and, now, f-strings), would/will it be possible to use Typing.Text as a base class for even-more abstract string types ("strypes") e.g. XML, XHTML, HTML4, HTML5, HTML5.1, SQL? There are implicit casts and contextual adaptations/transformations (which MarkupSafe specs a bit). (I've no real code here, just a general idea that we're not tracking enough string metadata to be safe here) On Mar 18, 2016 8:45 PM, "Andrey Vlasovskikh" <andrey.vlasovskikh@gmail.com> wrote:

I believe having separate string types for XML or SQL content is out of the scope of this proposal. In PyCharm we already treat the contents of string literals with SQL as a separate SQL syntax tree and we understand basic string operations like concatenation or formatting. Going beyond that with the help of XML/SQL/etc. string types is possible, but I doubt we need a standard for that. -- Andrey Vlasovskikh Web: http://pirx.ru/

On Mar 22, 2016 4:36 PM, "Andrey Vlasovskikh" <andrey.vlasovskikh@gmail.com> wrote:
to use Typing.Text as a base class for even-more abstract string types ("strypes") e.g. XML, XHTML, HTML4, HTML5, HTML5.1, SQL? There are implicit casts and contextual adaptations/transformations (which MarkupSafe specs a bit). (I've no real code here, just a general idea that we're not tracking enough string metadata to be safe here)
I believe having separate string types for XML or SQL content is out of
the scope of this proposal.
In PyCharm we already treat the contents of string literals with SQL as a
separate SQL syntax tree and we understand basic string operations like concatenation or formatting. Going beyond that with the help of XML/SQL/etc. string types is possible, but I doubt we need a standard for that. At the least, it would be helpful to either have: a) a slot / attribute for additional string type metadata (is this an object subclass that I can just add attrs to) b) a minimal Text base class SQL is harder because dialects.

... OT (I'm finished): https://github.com/cloudera/ibis/blob/master/ibis/sql/alchemy.py On Mar 22, 2016 4:40 PM, "Wes Turner" <wes.turner@gmail.com> wrote:

* Text.encoding * Text.lang (urn:ietf:rfc:3066) * ... IRIs and RDF literals may be useful test cases here: * https://en.m.wikipedia.org/wiki/Comparison_of_Unicode_encodings * https://en.m.wikipedia.org/wiki/Control_character * https://en.wikipedia.org/wiki/Internationalized_resource_identifier * is this already punycoded? * http://rdflib.readthedocs.org/en/stable/rdf_terms.html * http://rdflib.readthedocs.org/en/stable/apidocs/rdflib.html#rdflib.term.Lite... (value, datatype, lang (RFC 3066)) On Mar 22, 2016 4:40 PM, "Wes Turner" <wes.turner@gmail.com> wrote:

I like the way this is going. I think it needs to be a separate PEP; PEP 484 is already too long and this topic deserves being written up carefully (like you have done here). I have a few remarks. * Do we really need _AsciiUnicode? I see the point of _AsciiStr, because Python 2 accepts 'x' + u'' but fails '\xff' + u'', so 'x' needs to be of type _AsciiStr while '\xff' should not (it should be just str). However there's no difference in how u'x' is treated from how u'\u1234' or u'\xff' are treated -- none of them can be concatenated to '\xff' and all of them can be concatenated to _'x'. * It would be helpful to spell out exactly what is and isn't allowed when different core types (bytes, str, unicode, Text) meet in Python 2 and in Python 3. Something like a table with a row and a column for each and the type of x+y (or "error") in each of the cells. * I propose that Python 2+3 mode is just the intersection of what Python 2 and Python 3 mode allow. (In mypy, I don't think we'll implement this -- users will just have to run mypy twice with and without --py2. But for PyCharm it makes sense to be able to declare this. Yet I think it would be good not to have to spell out separately which rules it uses, defining it as the intersection of 2 and 3 is all we need. On Fri, Mar 18, 2016 at 6:45 PM, Andrey Vlasovskikh <andrey.vlasovskikh@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

One thing I can do without committing to too much is to add a Text type to typing.py. It would have no (defined) behavior but it could be imported and used in annotations. Then mypy and other checkers could start using it and we could even experiment with different proposals without having to make more changes to typing.py (which we've found are hard to push out, because it's in the 3.5 stdlib -- it's provisional so we can change it, but we can't easily change what's already in 3.5.0 or 3.5.1). Should it be possible to subclass Text, and what should it mean? Or perhaps at runtime (i.e. in typing.py) Text would just be an alias for str in Python 3 and an alias for unicode in Python 2? That's easiest.

On Mar 22, 2016, at 10:58, Guido van Rossum <guido@python.org> wrote:
It seems like the worry is that you may need to change it again (e.g., to some "virtual type" that's like unicode in 2.7 except that it doesn't have a constructor from str or methods that accept str), and changing something that's in the stdlib (even in a provisional module) is hard? If so, could you define it as an alias for str in Python 3.5+, and leave it up to backports to define appropriately? Of course you're writing the 2.7 backport, and you'll define it as an alias for unicode--but if you later decide that was too inflexible, you can change the backport without having any effect on the PEP, 3.5 docs, or 3.5 stdlib.

Defining typing.Text as an alias to str in Python 3 and unicode for Python 2 (the way six.text_type is defined) looks like a good idea. I would recommend to prohibit subclassing typing.Text at the moment in the module docs and in PEP 484. We can always allow subclassing it later, but right now it's not clear wether it's safe or not given the fact that it's defined conditionally for 2/3. -- Andrey Vlasovskikh Web: http://pirx.ru/

I would like to experiment with various text/binary types for Python 2 and 3 for some time before coming up with a PEP about it. And I would like everybody interested in 2/3 compatible type hints join the discussion. My perspective (mostly PyCharm-specific) might be a bit narrow here.
I was concerned with UnicodeEncodeErrors in Python 2 during implicit conversions from unicode to bytes: getattr(obj, u'Non-ASCII-name') There are several places in the Python 2 API where these ASCII-based unicode->bytes conversions take place, so the _AsciiUnicode type comes to mind.
Agreed. I'll try to come up with specific rules for handling text/binary types (bytes, str, unicode, Text, _Ascii*) in Python 2 and 3. For me the rules for dealing with _Ascii* look the most problematic at the moment as it's unclear how these types should propagate via text-handling functions.
Yes, there is no need in having a specific 2+3 mode, I was really referring to the intersection of the Python 2 and 3 APIs when the user accesses a text / binary method not available in both. -- Andrey Vlasovskikh Web: http://pirx.ru/

On Tue, Mar 22, 2016 at 3:18 PM, Andrey Vlasovskikh <andrey.vlasovskikh@gmail.com> wrote:
As you wish!
OK, so you want the type of u'hello' to be _AsciiUnicode but the type of u'Здравствуйте' to be just unicode, right? And getattr()'s second argument would be typed as... What?
You can try that out at runtime though.
Cool. -- --Guido van Rossum (python.org/~guido)

On Wed, Mar 23, 2016 at 2:39 PM, Guido van Rossum <guido@python.org> wrote:
AIUI, getattr's second argument is simply 'str'; but in Python 2, _AsciiUnicode (presumably itself a subclass of unicode) can be implicitly promoted to str. A non-ASCII attribute name works fine, but getattr converts unicode to str using the 'ascii' codec. ChrisA

Right. I'm not sure that a non-ASCII attribute name is fine in Python 2 though. -- Andrey Vlasovskikh Web: http://pirx.ru/

On Wed, Mar 23, 2016 at 6:45 PM, Andrey Vlasovskikh <andrey.vlasovskikh@gmail.com> wrote:
It's legal. I don't know that it's a good idea, but it is legal. rosuav@sikorsky:~$ python Python 2.7.11+ (default, Feb 22 2016, 16:38:42) [GCC 5.3.1 20160220] on linux2 Type "help", "copyright", "credits" or "license" for more information.
ChrisA

The type of the second argument would be str, the "native string" type. If people use from __future__ import unicode_literals then there are many places in Python 2 where str is expected but an ASCII-unicode literal is given. Having the internal _AsciiUnicode type that inherits from unicode while being compatible with str (and bytes) would solve this issue. -- Andrey Vlasovskikh Web: http://pirx.ru/

Upon further investigation of the problem I've come up with an alternative idea that looks simpler and yet still capable of finding most text/binary conversion errors. Here is a rendered Markdown version: https://gist.github.com/vlasovskikh/1a8d5effe95d5944b919 ## TL;DR * Introduce `typing.Text` for text data in Python 2+3 * `bytes`, `str`, `unicode`, `typing.Text` in type hints mean whatever they mean at runtime for Python 2 or 3 * Allow `str -> unicode` and `unicode -> str` promotions for Python 2 * Type checking for Python 2 *and* Python 3 actually finds most text/binary errors * A few false negatives for Python 2 are not worth special handling besides possible ad-hoc handling of non-ASCII literals conversions ## Summary for Python users If you want your code to be Python 2+3 compatible: * Write text/binary type hints in 2+3 compatible comments * Use `typing.Text` for text data, `bytes` for binary data * Use `str` only for rare cases of "native strings" * Don't use `unicode` since it's absent in Python 3 * Run a type checker for *both* Python 2 and Python 3 ## Summary for authors of type checkers The semantics of types `bytes`, `str`, `unicode`, `typing.Text` and the type checking rules for them should match the *runtime behavior* of these types in Python 2 and Python 3 depending on Python 2 or 3 modes. Using the runtime semantics for the types is easy to understand while it still allows to catch most errors. The Python 2+3 compatibility mode is just a sum of Python 2 and Python 3 warnings. Type checkers *should* promote `str`/`bytes` to `unicode`/`Text` and `unicode`/`Text` to `str`/`bytes` for Python 2. Most text/binary conversion errors can be found by running a type checker for Python 2 *and* for Python 3. ## typing.Text: Python 2+3 compatible type for text data The `typing.Text` type is a Python 2+3 compatible type for text data. It's defined as follows: if sys.version_info < (3,): Text = unicode else: Text = str For a Python 2+3 compatible type for binary data use `bytes` that is available in both 2 and 3. ## Implicit text/binary conversions In Python 2 text data is implicitly converted to binary data and vice versa using the ASCII encoding. Only if the data isn't ASCII-compatible, then a `UnicodeEncodeError` or a `UnicodeDecodeError` is raised. This results in many programs that aren't well-tested regarding non-ASCII data handling. In Python 3 converting text data to binary data always raises a `TypeError`. A type checker run in the Python 3 mode will find most of Python 2 implicit conversion errors. ## Checking for Python 2+3 compatibility In order to be Python 2+3 compatible a program has to pass *both* Python 2 and Python 3 type checking. In other words, the warnings found in the Python 2+3 compatible mode are a simple sum of Python 2 warnings and Python 3 warnings. ## Runtime type compatibility Here is a table of types whose values are compatible at runtime. Columns are the expected types, rows are the actual types: | Text | bytes | str | unicode --------+-------+-------+-------+--------- Text | . . | * F | * . | . F bytes | * F | . . | . F | * F str | * . | . F | . . | * F unicode | . F | * F | * F | . F Each cell contains two characters: the result in Python 2 and in Python 3 respectively. Abbreviations: * `.` — types are compatible * `F` — types are not compatible * `*` — types are compatible, ignoring implicit ASCII conversions At runtime in Python 2 `str` is compatible with `unicode` and vice versa (ignoring possible implicit ASCII conversion errors). Using `unicode` in Python 3 is always an error since there is no `unicode` name in Python 3. As you can see from the table above, many implicit ASCII conversion errors in a Python 2 program can be found just by running a type checker in the Python 3 mode. The only problematic conversions that may result in errors are `Text` to `str` and vice versa in Python 2. Example 1. `Text` to `str` def foo(obj, x): # type: (Any, str) -> Any return getattr(obj, x) foo(..., u'привет') # False negative warning for non-ASCII in Python 2 Example 2. `str` to `Text` def foo(x): # type: (Text) -> Any return u'Привет, ' + x foo('Мир') # False negative warning for non-ASCII in Python 2 For non-ASCII text literals passed to functions that expect `Text` or `str` in Python 2 a type checker can analyze the contents of the literal and show additional warnings based on this information. For non-ASCII data coming from sources other than literals this check would be more complicated. To summarize, with this type compatibility table in place, a type checker run for *both* Python 2 and Python 3 is able to find *almost all errors* related to text and binary data except for a few text to "native string" conversions and vice versa in Python 2. ## Current Mypy type compatibility (non-runtime semantics) Mypy implies `str` to `unicode` promotion for Python 2, but it doesn't promote `unicode` to `str`. Here is an example of a Python 2 program that is correct given the runtime type compatibility semantics shown in the table above, but is incorrect for Mypy: def foo(obj, x): # type: (Any, str) -> Any return getattr(obj, x) foo({}, u'upper') # False positive warning in Mypy for ASCII in Python 2 Here is the type compatibility table for the current version of Mypy: | Text | bytes | str | unicode --------+-------+-------+-------+--------- Text | . . | F F | F . | . F bytes | * F | . . | . F | * F str | * . | . F | . . | * F unicode | . F | F F | F F | . F Running the Mypy type checker in Python 2 mode *and* Python 3 mode for the same program would find almost all implicit ASCII conversion errors except for `str` to `Text` conversions. To summarize, the current Mypy type compatibility table covers almost all text and binary data handling errors when used for *both* Python 2 and Python 3. But it doesn't notice errors in "native string" to text conversions in Python 2 and produces *false warnings* for text to "native string" conversions in Python 2. -- Andrey Vlasovskikh Web: http://pirx.ru/

On Mar 24, 2016, at 17:00, Andrey Vlasovskikh <andrey.vlasovskikh@gmail.com> wrote:
The only problematic conversions that may result in errors are `Text` to `str` and vice versa in Python 2.
So any time you use Text strings together with strings from sys.argv, sys.stdin/raw_input(), os.listdir(), ZipFile, csv.reader, etc., all of which are native str, they'll pass as valid in a 2+3 test, even though they're not actually valid in 2.x?

Yes, these errors will go unnoticed, unfortunately. But this guarantees that there will be no false positive warnings related to text/binary types. And a model of text/binary types that matches the runtime semantics is easier for users. This kind of errors would have been more important to find if users had been expected to port their code from Python 3 back to Python 2 more often than from 2 to 3. Speaking of ways to actually find these errors, one idea discussed in the issue tracker of Mypy [1] was to have a separate _AsciiStr type for things that are certainly ASCII-compatible. However, treating all str values as non-ASCII by default would result in false positive warnings. We could have a reverse type, say, _NonAsciiStr (there should be a better name for that) not compatible with Text for things we know are non-ASCII for sure: * Non-ASCII str literals * Functions like those you mentioned above There will be false negatives in cases not covered by _NonAsciiStr, but at least there will be a way of documenting non-ASCII native str interfaces for the users who care about this kind of Python 2 errors. The downside is that _NonAsciiStr is harder to understand and apply correctly than str. [1]: https://github.com/python/typing/issues/19 -- Andrey Vlasovskikh Web: http://pirx.ru/

On Fri, Mar 25, 2016 at 12:00 AM, Andrey Vlasovskikh < andrey.vlasovskikh@gmail.com> wrote:
...
I'm against this, as it would seem to make str and unicode pretty much the same type in Python 2, and thus Python 2 mode seems much weaker than necessary. I wrote a more detailed reply in the mypy issue tracker ( https://github.com/python/mypy/issues/1141#issuecomment-201799761). I'm not copying it all here since much of that is somewhat mypy-specific and related to the rest of the discussion on that issue, but I'll summarize my main points here. I prefer the idea of doing better type checking in Python 2 mode for str and unicode, though I suspect we need to implement a prototype to decide whether it will be practical. * Type checking for Python 2 *and* Python 3 actually finds most text/binary
errors
This may be true, but I'm worried about usability for Python 2 code bases. Also, the effort needed to pass type checking in both modes (which is likely pretty close to the effort of a full Python 3 migration, if the entire code will be annotated) might be impractical for a large Python 2 code base. ## Summary for authors of type checkers
At least for mypy, the Python 2+3 compatibility mode would likely that twice as much CPU to run, which is a pretty high cost as type checking speed is one of the biggest open issues we have right now. ## Runtime type compatibility
...
Each cell contains two characters: the result in Python 2 and in Python 3 respectively. Abbreviations:
...
* `*` — types are compatible, ignoring implicit ASCII conversions
Am I reading this right if I understand this as "considered valid during type checking but may fail at runtime"? For non-ASCII text literals passed to functions that expect `Text` or `str`
I wonder what would the check look like in the latter case? I can't imagine how this would work for non-literals. Jukka
participants (6)
-
Andrew Barnert
-
Andrey Vlasovskikh
-
Chris Angelico
-
Guido van Rossum
-
Jukka Lehtosalo
-
Wes Turner