Arbitrary Literal Strings

Literal Types ([PEP 586](https://www.python.org/dev/peps/pep-0586/)) allow us to type a specific literal string like `x: Literal[“foo”] = “foo”`. This is useful when we know exactly which string or set of strings we want to accept. However, I’ve run into use cases where we'd want to accept *any* literal string such as “foo”, “bar”, etc. For example, we might have a custom format string function. For security reasons, we would want the typechecker to enforce that the format string is a literal, not an arbitrary string. Otherwise, an attacker could read or write arbitrary data by changing the format string (the so-called “format string attack” [1]): ``` def my_format_string(s: str, *args: FormatArgument) -> str: … my_format_string(“hello: %A”, a) # OK my_format_string(user_controlled_string, a) # BAD ``` Likewise, if we have a custom shell execution command like `my_execute`, we might want to enforce that the command name is a literal, not an arbitrary string. Otherwise, an attacker might be able to insert arbitrary shell code in the string and execute it: ``` def my_execute(command: str, *args: str) -> None: ... my_execute("ls", file1, file2) # OK command = input() my_execute(command, file1, file2) # BAD ``` There is no way to specify the above in the current type system. # Proposal We can allow `Literal[str]`, which would represent *any* literal string: ``` from typing import Literal def my_format_string(s: Literal[str], *args: FormatArgument) -> str: … my_format_string(“hello: %A: %B”, a, b) # OK because it is a literal string. my_format_string(user_controlled_string, sensitive_data) # Type error: Expected Literal[str], got str. ``` The same goes for the shell command function: ``` def my_execute(command: Literal[str], *args: CommandArgument) -> None: … my_execute(“ls”, files) # OK my_execute(arbitrary_string, files) # Type error: Expected Literal[str], got str. ``` Other usage will work as expected: ``` from typing import Literal, TypeVar # Type variable that accepts only literal strings. TLiteral = TypeVar("TLiteral", bound=Literal[str]) def identity(s: TLiteral) -> TLiteral: ... y = identity("hello") reveal_type(y) # Literal[“hello”] s: Literal[str] y2 = identity(s) reveal_type(y2) # Literal[str] literal_string: Literal[str] s: str = literal_string # OK literal_string: Literal[str] = s # Type error literal_string: Literal[str] = “hello” # OK x = “hello” literal_string: Literal[str] = x # OK ``` ## Backward compatibility **Backward compatibility**: `Literal[str]` is acceptable at runtime, so this doesn’t require any changes to Python itself. **Reference Implementation**: This was quite easy to implement and is available in Pyre v0.9.3. **Rejected alternatives**: `T = TypeVar(“T”, bound=Literal[Any])` isn’t something allowed in PEP 586 and would anyway be too broad. It would also allow literal bools to be passed in when we want only literal strings. ## Other uses for Literal[str] Other places where it might be useful to statically enforce literal strings for safety and readability: ``` # struct struct.unpack("<I", self.read(n)) # datetime datetime.now().strftime('%B %d, %Y - %X') # builtins open(path, encoding='utf-8') my_string.encode('latin-1') # PyTorch self.register_buffer("weight", torch.zeros(a, b)) # argparse parser.add_argument("--my-flag", action="store_true") # argparse ``` The same idea would apply to `Literal[int]` and `Literal[bool]`, but I don’t have compelling use cases for them yet. I suspect `Literal[int]` will be useful for Tensor types in the future. Others might have run into use cases in the wild. Thoughts? Opinions? [1]: [ https://owasp.org/www-community/attacks/Format_string_attack](https://owasp.org/www-community/attacks/Format_string_attack) -- S Pradeep Kumar

Setting aside the question of whether it can easily be determined whether a given string is a literal or not (I don't know, but would be interested in knowing the answer)... Strings are immutable, so an attacker can't "change" the string as you suggest. A variable can point to a new string, that string could be also be a string literal, and could be malformed/malicious. So I'm not getting what the benefit would be here. If I want to dynamically assemble a malicious string literal, all I would have to do is generate code that can be evaluated to produce a string literal.
eval("".join(['"', "evil", " ", "string", " ", "literal", '"'])) 'evil string literal'
On Wed, 2021-08-04 at 12:52 -0700, S Pradeep Kumar wrote:

As per PEP 586:
In practice, a typechecker will consider a string to be a literal string if it is in quotes. Performing an operation like appending a string will make the typechecker no longer treat it as a literal string. Here is an example [1]. So, a malicious string dynamically assembled (by appending, etc.) will not be treated as a literal string and will be flagged by the typechecker when passed to something that expects `Literal[str]`. I hope that makes sense.
You're right, my wording was confusing. I meant that an attacker could pass in an arbitrary string, not that he could mutate it.
Note that `eval` is typed as returning `Any`, which means we loses all safety, so that's a whole different problem :) The security use cases I was talking about were innocuous inputs that happened to be user-controlled, such as a field in a request JSON that got passed or appended to the format string or shell command function. [1]: https://mypy-play.net/?mypy=latest&python=3.8&gist=4aadb9d2778ec6fe7566382e0a68fbd3 <https://mypy-play.net/?mypy=latest&python=3.8&gist=db831b584ec7b1fff5fd0c35d4a18b98> -- S Pradeep Kumar

On Wed, 2021-08-04 at 14:14 -0700, S Pradeep Kumar wrote:
Right, so we're on the same page, want to confirm that should be no rule dictating how 3 was arrived at. It could have been the result of addition. I expect Literal["aaa", "bbb"] would currently accept the string "aaa" regardless of whether it was "a"*3 , "a" + "aa", or "aaa".
It makes sense. I'm still dubious about the benefits though.
1. I think passing or appending a format string from user input is (or should be!) a security anti-pattern, just like you should also not eval raw user input. This strikes me as more a job for a source code vulnerability scanner rather than a static type checking library. 2. If the consensus is that this proposal has enough merit to be added, my request would be to allow for the detection of a string's "literalness" at runtime, not just through graphing it in a static type checker.

Jelle: Thanks, that's another good use case! I'm open to other ways to denote arbitrary literal strings if you have ideas. On Wed, Aug 4, 2021 at 3:29 PM Paul Bryan <pbryan@anode.ca> wrote:
Paul: No, only "aaa" is accepted. That's what I was trying to show with the Mypy snippet I linked :) -- S Pradeep Kumar

conveniently enough this came up last week in a little challenge I posted as part of my stream discord -- here's a proof of concept mypy plugin that I came up with (more on that in the video I posted: https://www.youtube.com/watch?v=KWjGyDflNKQ ) it could probably be generalized to some specific syntax or decorator or Annotated as well but this was just a small proof of concept ```python from __future__ import annotations from typing import Callable from mypy.plugin import Plugin from mypy.plugin import FunctionContext from mypy.types import Instance from mypy.types import LiteralType from mypy.types import Type class CustomPlugin(Plugin): def _require_only_literals(self, func: FunctionContext) -> Type: # invalid signature, but already handled if len(func.arg_types) != 1 or len(func.arg_types[0]) != 1: return func.default_return_type tp = func.arg_types[0][0] if ( not isinstance(tp, Instance) or not isinstance(tp.last_known_value, LiteralType) ): func.api.fail('expected string literal for argument 1', func.context) return func.default_return_type # already handled by signature # if f'{tp.type.module_name}.{tp.type.name}' != 'builtins.str': return func.default_return_type def get_function_hook( self, name: str, ) -> Callable[[FunctionContext], Type] | None: if name == 't.somefunc': return self._require_only_literals else: return None def plugin(version: str) -> type[Plugin]: return CustomPlugin ``` Anthony On Wed, Aug 4, 2021 at 7:15 PM Paul Bryan <pbryan@anode.ca> wrote:

I have implemented a similar check in our internal tools: you can only do a database query with a statically inferable string literal. I agree the feature would be useful more generally. I'm not sure I like the `Literal[str]` syntax though: to me, that means you can only pass the `str` type, not that you can pass any literal string. El mié, 4 ago 2021 a las 12:53, S Pradeep Kumar (<gohanpra@gmail.com>) escribió:

On Thu, Aug 5, 2021 at 1:50 AM Guido van Rossum <guido@python.org> wrote:
I think you are looking for the concept named “tainting” in other languages.
Guido: No, I wasn't looking for full-fledged taint analysis for finding security vulnerabilities. (We do that with Pysa [1], for example :) ) Format strings, explicit database queries, etc. are places where we simply don't want to allow any non-literals. As Jelle had pointed out, people already write ad hoc tools to check for literalness. I think the concept of arbitrary literal strings gives us a simple, readable way to express our intentions. [1]: https://pyre-check.org/docs/pysa-basics -- S Pradeep Kumar

On Thu, Aug 5, 2021 at 6:41 PM S Pradeep Kumar <gohanpra@gmail.com> wrote:
Okay, fine. Format strings, explicit database queries, etc. are places where we simply
don't want to allow any non-literals.
Still, you intend this as a security check, right? (Quoting your first message: "my_format_string(user_controlled_string, sensitive_data)".) And one could certainly write a taint-clearing function like this (following your identity() example): def clear_or_die(a: str) -> L[str]: if not verify(a): raise ValueError(...) return cast(L[str], a)
Your examples didn't *quite* show this, but I do think you'd allow this, right? def f(a: L[str]): ... def g(a: L[str]): f(a) # Allowed, even though 'a' isn't a literal -- its type is stil literal g("hello") I do agree with Jelle that the notation Literal[str] feels inconsistent with Literal["hello"]. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Yes, you're right. That's the intention here. Other uses are for readability where we want to insist that the user pass in a literal (`struct.unpack("<I", ...)`). def clear_or_die(a: str) -> L[str]:
Yes, that's certainly allowed since it's still a literal type. The same goes for `"a" if foo else "b"`, etc. Basically, if `reveal_type()` shows that it's a literal string (either `Literal["hello"]` or `Literal[str]`), it is acceptable.
I do agree with Jelle that the notation Literal[str] feels inconsistent with Literal["hello"].
I agree that it's different from the existing uses in that you can't assign `str` to `Literal[str]`. Open to alternatives. Perhaps something like `LiteralString`? The disadvantages were that: 1. You'd need a separate import for `LiteralString` whereas we can just import the familiar `Literal`. 2. `LiteralString` doesn't show the close relation to `Literal["hello"]` whereas `Literal[str]` does. I read both of them as "literal string" but ymmv. It's also a straightforward generalization for `Literal[int]`, `Literal[bool]`, etc. Other options are `Literal[AnyLiteralString]`, but this still requires a separate import and `AnyLiteralString` by itself wouldn't be a valid type, which seems janky.
-- S Pradeep Kumar

I think that supporting this feature would be nice and it is something that I have come across several times when thinking about other features like tensor typing, and in my opinion supporting this would make the current literals feel more "complete". However I did not try to make a concrete proposal because I expected this discussion to have some rough edges. Ideally, we should be able to use the concept of Literal[str] (or whatever is the syntax) for typing Literals in typeshed.

I find this proposal confusing. It appears to be based on a misunderstanding that the only way to generate a literal type is through the use of a literal expression. Literal types and literal expressions are not the same thing. A literal expression does evaluate to a literal type, but other (non-literal) expressions can also evaluate to literal types. In type calculus, the union of all possible literal types is equivalent to the wider class upon which those literals are based. For example, the type `bool` is equivalent to the union of `Literal[False]` and `Literal[True]` and vice versa. An enum type is the union of its literal element types and vice versa. Type checkers are allowed to compose and decompose these types during type narrowing and type merging. For example: ```python class Color(Enum): red = 0 green = 1 blue = 2 def func(x: Color): if x is not Color.red and x is not Color.blue: reveal_type(x) # Literal[Color.green] ``` Or consider this: ```python v1: Literal[False] = False reveal_type(v1) # Literal[False] v2: Literal[True] = True reveal_type(v2) # Literal[True] v3 = v1 or v2 reveal_type(v3) # Literal[True] v4 = v1 and v2 reveal_type(v4) # Literal[False] v5 = v1 if random() > 0.5 else v2 reveal_type(v5) # bool ``` In the case of `int` and `str`, the number of enumerable literal types is very large, but "the union of all int literals" is still equivalent to `int` and vice versa. So I don't see the need to add the notion of `AnyLiteral[int]`. From a type perspective, that's the same as `int`. -Eric --- Eric Traut Contributor to pyright & pylance Microsoft Corp.

(1)
There is a nuance here. That "equivalence" is not true when it comes to checking overloads. [1] For example, just because a `bool` can only be a `True` or a `False` doesn't mean these two overloads are enough to match any given bool [2]: ``` @overload def foo(x: Literal[True]) -> Foo: ... @overload def foo(x: Literal[False]) -> Bar: ... def bar(y: bool) -> None: foo(y) # Type error ``` Likewise, there's an observable difference between `Union[Literal[True], Literal[False]]` and `bool` - compatibility doesn't go both ways. The same concept applies to enums, strings, etc. The type system currently allows us to express: 1. a known literal (`Literal["a"]`) 2. a union of known literals (`Literal["a", "b"]`) 3. arbitrary strings (`str`) There is no way to express a union of arbitrary, unknown literals. Hence this proposal. (2)
Joining of types is fine. As mentioned earlier in the thread, `"a" if foo else "b"` is still compatible with `Literal[str]` because we join the two types to get `Literal["a", "b"]`. Narrowing is also fine (such as the enum or bool cases you mentioned). As long as the revealed type is a literal string, it should be acceptable. Even for strings, let's say a typechecker narrows the type based on a literal equality check (`message == "hello, %A"`). That's still acceptable for our purposes because we can't arbitrarily go from a str to a Literal[str] without some explicit narrowing in the code. Likewise for the `clear_or_die` narrowing function that Guido shared. ``` def my_format_string(s: Literal[str], x: object) -> str: ... def foo(message: str, a: object) -> None: if message == "hello, %A": print(my_format_string(message, a)) # OK if the revealed type is a literal type print(my_format_string(message, a)) # Not OK ``` (3)
Lastly, if you still feel it is unnecessary, it would help to share how you would handle the motivating examples without `Literal[str]`. Specifically: ``` def my_format_string(s: Literal[str], x: object) -> str: ... def print_request(user_request: Dict[str, str]) -> None: # OK because the type of the format string is a literal string print(my_format_string("hello, %A", user_request["name"])) # Not OK because the type is an arbitrary str print(my_format_string(user_request["message_format"], user_request["message"])) ``` [1]: https://www.python.org/dev/peps/pep-0586/#interactions-with-overloads [2]: Mypy snippet for overloads: https://mypy-play.net/?mypy=latest&python=3.8&gist=92f7c56aae6678c7a3fc13aaa19be309 On Fri, Aug 6, 2021 at 9:47 AM Eric Traut <eric@traut.com> wrote:
-- S Pradeep Kumar

I still assert there is no difference between `Literal[False, True]` and `bool`. They are completely equivalent, and to treat them differently would be inconsistent.
That "equivalence" is not true when it comes to checking overloads
If a type checker deconstructs a union argument type when matching overloads (a behavior that is not specified in PEP 484 but is implemented by most Python type checkers), it would also be appropriate for it to deconstruct a `bool` into `Literal[True]` and `Literal[False]` when matching overloads. I don’t think any of the popular Python type checkers do this currently because it would be expensive, and overloads generally include an explicit fallback as recommended in PEP 586. But from the perspective of the type system, this would be perfectly defensible and consistent for `bool` and enum literals, both of which have a closed number of variants. Fallback overloads are always needed for `str` or `int` literals. Here’s an example to consider. ```python def func1(val: Literal[True, False]): pass def func2(val: bool): func1(val) # mypy generates an error here; pyright does not ``` In pyright, the above sample type checks fine. Mypy emits an error. I consider that a bug. It probably hasn’t ever been reported because it would be odd to annotate a parameter as `Literal[False, True]`. PEP 586 introduced literals to Python, and it explicitly says that it borrowed the idea from TypeScript. As you might expect, TypeScript treats `bool` as equivalent to `false | true`. To do otherwise would introduce an odd and unnecessary inconsistency into the type system. ```typescript function func1(a: false | true) { } function func2(a: boolean) { func1(a); // TypeScript is fine with this } ``` PEP 586 doesn’t explicitly say whether a union that includes all literal variants is equivalent to its non-literal counterpart. It does indicate that deconstruction and merging are possible when the type has a closed number of literal variants, which implies that they are equivalent in those cases. Absent any explicit statement, it seems reasonable to conclude that they are treated consistently, as they are in TypeScript.
it would help to share how you would handle the motivating examples
I guess I don’t find the motivating examples very motivating. :) The proposal is based on the assumption that a literal type is a good proxy for “a value that has been validated as safe by the caller”. That assumption seems tenuous. There are many ways the caller can perform such validation. For example, it could use a regex filter, compare against a table of known good values, scan for dangerous character sequences, or pass it though an escape transform. None of these techniques will produce a literal-typed value. Python is a runtime-type-safe language, so it already prevents format string attacks from accessing stack or heap locations outside of the target object. Are you primarily concerned about cases where Python code invokes code written in a different language that potentially has vulnerabilities because of a lack of runtime type safety? It sounds like you are looking for some form of taint analysis, but this doesn’t strike me as a good solution to that problem. -Eric -- Eric Traut Contributor to pyright & pylance Microsoft Corp.

Thinking more about this, I actually agree with that statement. Bools and enums should be treated as equivalent to unions of the component literals. However, note that functions for them still need a fallback overload. This is not because of performance reasons or lack of typechecker support like you mentioned, but because we need to handle the case where the input type is a union (e.g., `bool`). That is, we need an overload each for `Literal[True]`, `Literal[False]`, and `bool`. This is true even in TypeScript [1]. This doesn't affect the proposal here, though. I've already addressed the joining and narrowing concerns. The fact remains that there is no way in Python (or TypeScript) to represent a union of arbitrary literal types.
I guess I don't find the motivating examples very motivating. :)
That's fair enough. But it's not the same as saying that unions of arbitrary literals can be expressed with current types :) Regarding how format strings can be unsafe in Python and why we'd want to address these in the typechecker instead of a full-fledged taint analysis, I've asked a security engineer (Graham Bleaney) to chime in on this thread. [1]: TypeScript boolean overloads: https://www.typescriptlang.org/play?ts=4.2.0-dev.20201221#code/CYUwxgNghgTiA... On Sat, Aug 7, 2021 at 10:43 AM Eric Traut <eric@traut.com> wrote:
-- S Pradeep Kumar

The proposal is based on the assumption that a literal type is a good proxy for “a value that has been validated as safe by the caller”. That assumption seems tenuous. There are many ways the caller can perform such validation. For example, it could use a regex filter, compare against a table of known good values, scan for dangerous character sequences, or pass it though an escape transform. None of these techniques will produce a literal-typed value.
You're 100% right! There are lots of ways to make sure a given string is safe to use in a given command. The problem is, many of the ways that you suggested (regex filtering, scanning for dangerous characters, escape transforms) can have subtle implementation flaws that allow an attacker to bypass them. Additionally, users can entirely forget to make those checks. As a security engineer, I have to assume that someone somewhere along the line is going to mess up their ad-hoc regex check (or forget to write it) and let a special character slip though. `Literal[str]` gives me another way though. It lets me create *opinionated* APIs that say the user *must* supply a literal string, which the type checker can then tell me with 100% certainty was not dynamically created with user controlled input. Let's make things concrete and talk about SQL injection. The canonical way to prevent it is to use parameterized queries: ``` def get_data(value: str): SQL = "SELECT * FROM table WHERE col = %s" return conn.query(SQL, value) ``` The problem is, nothing stops a developer from inserting a dynamically created SQL string into that first parameter and creating a SQL injection vulnerability despite the availability of parameterization: ``` def get_data(value: str): SQL = f"SELECT * FROM table WHERE col = '{value}'" return conn.query(SQL) ``` If I change the interface of `query` to require that the first argument is a literal, I can prevent this SQL injection issue from happening.
Python is a runtime-type-safe language, so it already prevents format string attacks from accessing stack or heap locations outside of the target object. Are you primarily concerned about cases where Python code invokes code written in a different language that potentially has vulnerabilities because of a lack of runtime type safety?
My concerns have nothing to do with memory or type safety, and everything to do with preventing the confusion of data and commands. As you've alluded to, many APIs take a string and run it as code (`pickle`, `eval`, etc. run Python code, SQL APIs run SQL code, `os.system` runs shell commands, `python-ldap`'s APIs let you run LDAP queries, etc.). Sometimes the commands they run need data (IE. a value to insert into the table). If the commands and data are a part of the same string, injection vulnerabilities can occur. Using `Literal[str]`, API designers can enforce the separation of commands and data by requiring that the commands be literals within the python program, rather than coming from some external source that is user controlled data.
It sounds like you are looking for some form of taint analysis, but this doesn’t strike me as a good solution to that problem.
Full taint analysis is definitely useful, and I actually spend the majority of my time working on Pysa which is a taint analysis tool build on top of Pyre. To me, the reasons to want this in a type checker are: 1) It's way faster to run and give feedback to developers. Pysa will take an hour+ to report an issue to a developer on a massive codebase, wheres Pyre can do it in a second. 2) Taint analysis requires that you're able to track sources of user controlled data into the dangerous function. There is always a risk of false negative there, whereas I can't imagine a case of a false negative coming from a type check for `Literal` (outside of explicit lint suppression)

On Mon, Aug 9, 2021 at 4:03 PM Paul Bryan <pbryan@anode.ca> wrote:
Yes and no. Yes in that I agree -- SQL queries tend to have too much dynamic nature to be simple literals. No in that -- SQL does show having the type system aid you in preventing queries with untrusted contents is quite useful. IIRC, one of the type checkers out there does exactly this? I think the focus on literal-ness is a bit misleading. In the print_request example, that it is literal isn't the concern so much as it adhering to the requirement that it's a format string with %A in it (or w/e), and being able to know that from static analysis. msg_format = "Hello, " msg_format += "%A" Isn't a literal literal, but would also be valid usage (I'd expect it to be, anyways), and statically knowable, and thus should be allowed. All this reminds me of Google's Java Errorprone and its CompileTimeConstant <https://github.com/google/error-prone/blob/master/annotations/src/main/java/...> annotation. It's an annotation applied to a field (which already has its own type), so its like a "trait" or "state" or "requirement" (for lack of a better term) of an instance of a type, rather than a type in its own right. I've seen this used in APIs that accept a "literal" and then return a "trusted" type, and then all the e.g. query() apis only accept that trusted type. It works pretty well as a middle ground between "ban all dynamic strings" and "any string is allowed". The overall intent also seems like a more formal way to write the sort of thing that taint-checking is doing internally: accept instances of type str, but only if they have the "has_percent_A" bit set. We can imagine all sorts of cases like this: JS/HTML/etc strings, objects that had __enter__() called previously, objects that haven't had closed() called, object that had a post_init() called, etc, etc. Doesn't Annotated fit this case nicely? e.g. sql: Annotated[str, SafeForSql] or user_home: Annotated[LatLong, SensitiveData], etc. And then it's up to a type checker to implement that state tracking.
Well, sort of. Most, if not all, of the Python type checkers have easy ways to arbitrarily cast, disable, or otherwise subvert the type checker for a given call site. A stray Any can sneak in and subvert everything, too. As a library author, you can't force downstream users to use a type checker, either. My point is, a type checker is focused on enforcing types and their relationships, not so much looking for security flaws. It can help, but it's not a panacea. Giving API authors the tools to prevent inadvertent misuse is still a good goal, though.

I agree with Richard. The name `Literal[str]` may be a bit misleading -- the intention here is not to have a type that just represents "literal strings". Instead, what we are really looking for is a notion of "statically-determinable string" or "compile-time constant string", in which case conditionally-built SQL queries are perfectly acceptable as long as the type checker can show that the query string is built from statically determinable constants no matter what the conditions are. @Eric I do think under this kind of interpretation, things like `Union[Literal[False], Literal[True]]` would no longer be equivalent to `bool` -- the union should be equivalent to `Literal[bool]` instead. Previously, we do not really differentiate between "statically-determinable values" and "runtime-determinable values" at type level, and hence `Union[Literal[False], Literal[True]]` and `bool` would be indistinguishable. But once we make that distinction explicit with Pradeep's proposal, the difference between `Literal[bool]` and `bool` starts to matter: `Literal[bool]` would be a subtype of `bool` (as any statically determinable value is also runtime-determinable), but *not* the other way around. And type checker would need to adjust their implementation of subtyping relations accordingly. Of course, whether the additional capability of type-level compile-time constant tracking is worth the cost of these implementation changes is indeed debatable. But I don't think there's a fundamental reason why can't make that distinction and can't deviate from what TypeScript does. @Richard The comparison with Errorprone's CompileTimeConstant is interesting. But when it comes to replicating that with Python's `Annotated` type, my read of PEP 593 is that `Annotated` is more geared towards special-purpose runtime/compile-time analyses than a general-purpose type system feature. Type checkers are not supposed to assign/interpret the meanings of those annotations, and I think it makes more sense that way since it's not clear how things like `Union[Annotated[str, X], Annotated[str, Y]]` should be addressed -- is it a subtype/supertype of `str`, or `Annotated[str, [X,Y]]`, `Annotated[str, {X, Y}]`, or some other options? And once we start to hard-code certain interpretation of annotations in the type checker, what happens when our interpretation conflicts with that of other special-purpose analyses? Therefore, my take on this is that if we want type checker to take on this kind of analysis then it's better to have a dedicated type rather than relying on `Annotated`. On the other hand, if we decide that type checker is not the best place to implement the feature, then a special-purpose static analyses that relies on `Annotated` could be one potential option to explore. - Jia

many non-trivial queries (e.g. conditionally including clauses) require generating SQL statements dynamically.
Yup, you're totally right. And a dynamically created SQL query that uses only strings that are statically knowable (read: is just a bunch of `Literal[str]`s concatenated together) is going to be safe for use in a SQL API in all but the most contrived scenarios. And we can still get a `Literal[str]` on a dynamic SQL string built in this way if we add an override on `str.__add__`: ``` def __add__(self: Literal[str], s: Literal[str]) -> Literal[str]: ... ```

The majority of dynamic strings I build are arrays of strings that I join to a single string at the end. Will these be statically knowable?
`List` is already generic, so I think that `List[Literal[str]]` should just work out of the box. Similar to `__add__`, I think we could overload `join` to support this usecase: ``` def join(self, __iterable: Iterable[Literal[str]]) -> Literal[str]: ... ```

I thought more about this issue and did some brainstorming with my colleagues. Here are a couple of other solutions worth considering. 1. Rather than use "literal" as a proxy for "validated input", use NewType(str). This creates a speed bump so developers who are calling an API know that they are responsible for validating the str input beforehand. A library could even provide functions that perform validation of str inputs and convert them to the new "validated" type. 2. Browsers implement a solution called "trusted types", and some of those ideas are potentially applicable here. See https://web.dev/trusted-types/ for details. I continue to assert that using "literal" in the proposed manner is a misuse of the Python type system that would have negative consequences. I encourage you to consider alternative solutions to this problem. -Eric -- Eric Traut Contributor to pylance & pyright Microsoft Corp.

Where would you propose this version of join be implemented?
I'm suggesting that we provide the `join` method I wrote out as a overload for the `str.join` method in typeshed
1. Rather than use "literal" as a proxy for "validated input", use NewType(str). This creates a speed bump so developers who are calling an API know that they are responsible for validating the str input beforehand. A library could even provide functions that perform validation of str inputs and convert them to the new "validated" type.
We actually do have an API like that, but the same problem persists that developers can pass any argument in to the constructor of `NewType`. The problem is that no matter what API one designs, it has to start with a `str` at some point, and there is no way to know whether that `str` could contain user input or not. Requiring that it be a literal prevents some valid uses of the API, but also prevents all but the most contrived invalid uses of the API.
2. Browsers implement a solution called "trusted types", and some of those ideas are potentially applicable here.
Trusted types is a really cool concept, and believe it or not it's actually at the beginning of a ~2 year long road that ended with this proposal. Without getting *really* invasive into the runtime and changing how strings themselves are handled (you'd effectively need runtime taint tracking to make sure a string didn't come from the network, filesystem, process IO, etc), trusted types basically devolves back into what you suggested in 1), and has the same fundamental limit.

On Wed, Aug 11, 2021 at 9:45 AM <gbleaney@gmail.com> wrote:
This bit may require some more thought. The join string itself (on which the “join” method is called) may be a literal or a dynamic string, and this overload is valid only if it is a literal. But I don’t think we have a way to define a method overload only on `Literal[str]` and not on `str`? Maybe could use self-types or something for this, but it’s something that would require additional support I think. Carl

Graham, Paul, and Richard brought up a frequent use case for SQL queries: adding literal strings based on a flag or joining an array of literals. For example: ``` SQL = "SELECT * FROM table WHERE col = %s" if limit: SQL += " LIMIT 1" ``` The idea is that we want strings constructed from statically-known literal strings, as Jia pointed out. These are either - a literal string expression ("foo") - conditional branches (`if b: return "foo"; else: return "bar"`) - narrowing using a guard (`if s == “foo”:`) - `+` on literals - `join` on literals We want to exclude strings that are constructed from some expression of type `str` since they are not statically-known to have been constructed from literals. Bikeshedding: - I'm ok with `AnyLiteral[str]` since `Literal[str]` can be somewhat confusing. The extra import is fine since this is mostly going to be used in libraries, not in user-written code, and we want the `[str]` parameter since we also want this for `[int]`, etc. - Another option is `MadeFromLiterals[str]`, which is the most accurate description, but is also a bit long. - Finally, we could have separate classes like `AnyStrLiteral`, `AnyIntLiteral`, etc. This would allow us to subclass `str` and override any methods explicitly, but may be harder to explain. Typecheckers would need to look up any method calls on `Literal["hello"]` on `AnyStrLiteral`, not `str`. Compatibility: This goes as `AnyLiteral["foo"] <: AnyLiteral[str] <: str` but not in the other direction. (`A <: B` means A is compatible with B.) Potential objection: Can an attacker contrive a malicious literal string by giving an input that traverses different branches of benign code and appends known literal strings in some order? Yes, but this seems a remote possibility. Based on discussions with security folks, I think that’s a tradeoff we can accept given that disallowing `+` would rule out natural idioms like appending a “LIMIT 1” to a query.
Carl: Sure, we can add an overload specifically for literals with an annotation for `self`. That will require literal strings for both the input list and the delimiter. This looks like: ``` class str(Sequence[str]): ... @overload def join(self: Literal[str], iterable: Iterable[Literal[str]]) -> Literal[str]: ... @overload def join(self, iterable: Iterable[str]) -> str: ... @overload def __add__(self: Literal[str], other: Literal[str]) -> Literal[str]: ... @overload def __add__(self, other: str) -> str: ... from typing import Literal def connection_query(sql: Literal[str], value: str) -> None: ... def my_query(value: str, limit: bool) -> None: SQL = "SELECT * FROM table WHERE col = %s" if limit: SQL += " LIMIT 1" connection_query(SQL, value) # OK connection_query(SQL + value, value) # Error: Expected Literal[str], got str. def foo(s: str) -> None: y = ", ".join(["a", "b", "c"]) reveal_type(y) # => Literal[str] y2 = ", ".join(["a", "b", s]) reveal_type(y2) # => str xs: list[Literal[str]] y3 = ", ".join(xs) reveal_type(y3) # => Literal[str] y4 = s.join(xs) reveal_type(y4) # => str because the delimiter is `str`. ``` The above example works with Pyre after changing the stub for `str`. Another option is to special-case `+` and `join` within typecheckers like you said in case we don’t want to change the `str` stubs. (Or, if we go down the `AnyStrLiteral` route, these specialized stubs would live in that class.) -- S Pradeep Kumar

Setting aside the question of whether it can easily be determined whether a given string is a literal or not (I don't know, but would be interested in knowing the answer)... Strings are immutable, so an attacker can't "change" the string as you suggest. A variable can point to a new string, that string could be also be a string literal, and could be malformed/malicious. So I'm not getting what the benefit would be here. If I want to dynamically assemble a malicious string literal, all I would have to do is generate code that can be evaluated to produce a string literal.
eval("".join(['"', "evil", " ", "string", " ", "literal", '"'])) 'evil string literal'
On Wed, 2021-08-04 at 12:52 -0700, S Pradeep Kumar wrote:

As per PEP 586:
In practice, a typechecker will consider a string to be a literal string if it is in quotes. Performing an operation like appending a string will make the typechecker no longer treat it as a literal string. Here is an example [1]. So, a malicious string dynamically assembled (by appending, etc.) will not be treated as a literal string and will be flagged by the typechecker when passed to something that expects `Literal[str]`. I hope that makes sense.
You're right, my wording was confusing. I meant that an attacker could pass in an arbitrary string, not that he could mutate it.
Note that `eval` is typed as returning `Any`, which means we loses all safety, so that's a whole different problem :) The security use cases I was talking about were innocuous inputs that happened to be user-controlled, such as a field in a request JSON that got passed or appended to the format string or shell command function. [1]: https://mypy-play.net/?mypy=latest&python=3.8&gist=4aadb9d2778ec6fe7566382e0a68fbd3 <https://mypy-play.net/?mypy=latest&python=3.8&gist=db831b584ec7b1fff5fd0c35d4a18b98> -- S Pradeep Kumar

On Wed, 2021-08-04 at 14:14 -0700, S Pradeep Kumar wrote:
Right, so we're on the same page, want to confirm that should be no rule dictating how 3 was arrived at. It could have been the result of addition. I expect Literal["aaa", "bbb"] would currently accept the string "aaa" regardless of whether it was "a"*3 , "a" + "aa", or "aaa".
It makes sense. I'm still dubious about the benefits though.
1. I think passing or appending a format string from user input is (or should be!) a security anti-pattern, just like you should also not eval raw user input. This strikes me as more a job for a source code vulnerability scanner rather than a static type checking library. 2. If the consensus is that this proposal has enough merit to be added, my request would be to allow for the detection of a string's "literalness" at runtime, not just through graphing it in a static type checker.

Jelle: Thanks, that's another good use case! I'm open to other ways to denote arbitrary literal strings if you have ideas. On Wed, Aug 4, 2021 at 3:29 PM Paul Bryan <pbryan@anode.ca> wrote:
Paul: No, only "aaa" is accepted. That's what I was trying to show with the Mypy snippet I linked :) -- S Pradeep Kumar

conveniently enough this came up last week in a little challenge I posted as part of my stream discord -- here's a proof of concept mypy plugin that I came up with (more on that in the video I posted: https://www.youtube.com/watch?v=KWjGyDflNKQ ) it could probably be generalized to some specific syntax or decorator or Annotated as well but this was just a small proof of concept ```python from __future__ import annotations from typing import Callable from mypy.plugin import Plugin from mypy.plugin import FunctionContext from mypy.types import Instance from mypy.types import LiteralType from mypy.types import Type class CustomPlugin(Plugin): def _require_only_literals(self, func: FunctionContext) -> Type: # invalid signature, but already handled if len(func.arg_types) != 1 or len(func.arg_types[0]) != 1: return func.default_return_type tp = func.arg_types[0][0] if ( not isinstance(tp, Instance) or not isinstance(tp.last_known_value, LiteralType) ): func.api.fail('expected string literal for argument 1', func.context) return func.default_return_type # already handled by signature # if f'{tp.type.module_name}.{tp.type.name}' != 'builtins.str': return func.default_return_type def get_function_hook( self, name: str, ) -> Callable[[FunctionContext], Type] | None: if name == 't.somefunc': return self._require_only_literals else: return None def plugin(version: str) -> type[Plugin]: return CustomPlugin ``` Anthony On Wed, Aug 4, 2021 at 7:15 PM Paul Bryan <pbryan@anode.ca> wrote:

I have implemented a similar check in our internal tools: you can only do a database query with a statically inferable string literal. I agree the feature would be useful more generally. I'm not sure I like the `Literal[str]` syntax though: to me, that means you can only pass the `str` type, not that you can pass any literal string. El mié, 4 ago 2021 a las 12:53, S Pradeep Kumar (<gohanpra@gmail.com>) escribió:

On Thu, Aug 5, 2021 at 1:50 AM Guido van Rossum <guido@python.org> wrote:
I think you are looking for the concept named “tainting” in other languages.
Guido: No, I wasn't looking for full-fledged taint analysis for finding security vulnerabilities. (We do that with Pysa [1], for example :) ) Format strings, explicit database queries, etc. are places where we simply don't want to allow any non-literals. As Jelle had pointed out, people already write ad hoc tools to check for literalness. I think the concept of arbitrary literal strings gives us a simple, readable way to express our intentions. [1]: https://pyre-check.org/docs/pysa-basics -- S Pradeep Kumar

On Thu, Aug 5, 2021 at 6:41 PM S Pradeep Kumar <gohanpra@gmail.com> wrote:
Okay, fine. Format strings, explicit database queries, etc. are places where we simply
don't want to allow any non-literals.
Still, you intend this as a security check, right? (Quoting your first message: "my_format_string(user_controlled_string, sensitive_data)".) And one could certainly write a taint-clearing function like this (following your identity() example): def clear_or_die(a: str) -> L[str]: if not verify(a): raise ValueError(...) return cast(L[str], a)
Your examples didn't *quite* show this, but I do think you'd allow this, right? def f(a: L[str]): ... def g(a: L[str]): f(a) # Allowed, even though 'a' isn't a literal -- its type is stil literal g("hello") I do agree with Jelle that the notation Literal[str] feels inconsistent with Literal["hello"]. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Yes, you're right. That's the intention here. Other uses are for readability where we want to insist that the user pass in a literal (`struct.unpack("<I", ...)`). def clear_or_die(a: str) -> L[str]:
Yes, that's certainly allowed since it's still a literal type. The same goes for `"a" if foo else "b"`, etc. Basically, if `reveal_type()` shows that it's a literal string (either `Literal["hello"]` or `Literal[str]`), it is acceptable.
I do agree with Jelle that the notation Literal[str] feels inconsistent with Literal["hello"].
I agree that it's different from the existing uses in that you can't assign `str` to `Literal[str]`. Open to alternatives. Perhaps something like `LiteralString`? The disadvantages were that: 1. You'd need a separate import for `LiteralString` whereas we can just import the familiar `Literal`. 2. `LiteralString` doesn't show the close relation to `Literal["hello"]` whereas `Literal[str]` does. I read both of them as "literal string" but ymmv. It's also a straightforward generalization for `Literal[int]`, `Literal[bool]`, etc. Other options are `Literal[AnyLiteralString]`, but this still requires a separate import and `AnyLiteralString` by itself wouldn't be a valid type, which seems janky.
-- S Pradeep Kumar

I think that supporting this feature would be nice and it is something that I have come across several times when thinking about other features like tensor typing, and in my opinion supporting this would make the current literals feel more "complete". However I did not try to make a concrete proposal because I expected this discussion to have some rough edges. Ideally, we should be able to use the concept of Literal[str] (or whatever is the syntax) for typing Literals in typeshed.

I find this proposal confusing. It appears to be based on a misunderstanding that the only way to generate a literal type is through the use of a literal expression. Literal types and literal expressions are not the same thing. A literal expression does evaluate to a literal type, but other (non-literal) expressions can also evaluate to literal types. In type calculus, the union of all possible literal types is equivalent to the wider class upon which those literals are based. For example, the type `bool` is equivalent to the union of `Literal[False]` and `Literal[True]` and vice versa. An enum type is the union of its literal element types and vice versa. Type checkers are allowed to compose and decompose these types during type narrowing and type merging. For example: ```python class Color(Enum): red = 0 green = 1 blue = 2 def func(x: Color): if x is not Color.red and x is not Color.blue: reveal_type(x) # Literal[Color.green] ``` Or consider this: ```python v1: Literal[False] = False reveal_type(v1) # Literal[False] v2: Literal[True] = True reveal_type(v2) # Literal[True] v3 = v1 or v2 reveal_type(v3) # Literal[True] v4 = v1 and v2 reveal_type(v4) # Literal[False] v5 = v1 if random() > 0.5 else v2 reveal_type(v5) # bool ``` In the case of `int` and `str`, the number of enumerable literal types is very large, but "the union of all int literals" is still equivalent to `int` and vice versa. So I don't see the need to add the notion of `AnyLiteral[int]`. From a type perspective, that's the same as `int`. -Eric --- Eric Traut Contributor to pyright & pylance Microsoft Corp.

(1)
There is a nuance here. That "equivalence" is not true when it comes to checking overloads. [1] For example, just because a `bool` can only be a `True` or a `False` doesn't mean these two overloads are enough to match any given bool [2]: ``` @overload def foo(x: Literal[True]) -> Foo: ... @overload def foo(x: Literal[False]) -> Bar: ... def bar(y: bool) -> None: foo(y) # Type error ``` Likewise, there's an observable difference between `Union[Literal[True], Literal[False]]` and `bool` - compatibility doesn't go both ways. The same concept applies to enums, strings, etc. The type system currently allows us to express: 1. a known literal (`Literal["a"]`) 2. a union of known literals (`Literal["a", "b"]`) 3. arbitrary strings (`str`) There is no way to express a union of arbitrary, unknown literals. Hence this proposal. (2)
Joining of types is fine. As mentioned earlier in the thread, `"a" if foo else "b"` is still compatible with `Literal[str]` because we join the two types to get `Literal["a", "b"]`. Narrowing is also fine (such as the enum or bool cases you mentioned). As long as the revealed type is a literal string, it should be acceptable. Even for strings, let's say a typechecker narrows the type based on a literal equality check (`message == "hello, %A"`). That's still acceptable for our purposes because we can't arbitrarily go from a str to a Literal[str] without some explicit narrowing in the code. Likewise for the `clear_or_die` narrowing function that Guido shared. ``` def my_format_string(s: Literal[str], x: object) -> str: ... def foo(message: str, a: object) -> None: if message == "hello, %A": print(my_format_string(message, a)) # OK if the revealed type is a literal type print(my_format_string(message, a)) # Not OK ``` (3)
Lastly, if you still feel it is unnecessary, it would help to share how you would handle the motivating examples without `Literal[str]`. Specifically: ``` def my_format_string(s: Literal[str], x: object) -> str: ... def print_request(user_request: Dict[str, str]) -> None: # OK because the type of the format string is a literal string print(my_format_string("hello, %A", user_request["name"])) # Not OK because the type is an arbitrary str print(my_format_string(user_request["message_format"], user_request["message"])) ``` [1]: https://www.python.org/dev/peps/pep-0586/#interactions-with-overloads [2]: Mypy snippet for overloads: https://mypy-play.net/?mypy=latest&python=3.8&gist=92f7c56aae6678c7a3fc13aaa19be309 On Fri, Aug 6, 2021 at 9:47 AM Eric Traut <eric@traut.com> wrote:
-- S Pradeep Kumar

I still assert there is no difference between `Literal[False, True]` and `bool`. They are completely equivalent, and to treat them differently would be inconsistent.
That "equivalence" is not true when it comes to checking overloads
If a type checker deconstructs a union argument type when matching overloads (a behavior that is not specified in PEP 484 but is implemented by most Python type checkers), it would also be appropriate for it to deconstruct a `bool` into `Literal[True]` and `Literal[False]` when matching overloads. I don’t think any of the popular Python type checkers do this currently because it would be expensive, and overloads generally include an explicit fallback as recommended in PEP 586. But from the perspective of the type system, this would be perfectly defensible and consistent for `bool` and enum literals, both of which have a closed number of variants. Fallback overloads are always needed for `str` or `int` literals. Here’s an example to consider. ```python def func1(val: Literal[True, False]): pass def func2(val: bool): func1(val) # mypy generates an error here; pyright does not ``` In pyright, the above sample type checks fine. Mypy emits an error. I consider that a bug. It probably hasn’t ever been reported because it would be odd to annotate a parameter as `Literal[False, True]`. PEP 586 introduced literals to Python, and it explicitly says that it borrowed the idea from TypeScript. As you might expect, TypeScript treats `bool` as equivalent to `false | true`. To do otherwise would introduce an odd and unnecessary inconsistency into the type system. ```typescript function func1(a: false | true) { } function func2(a: boolean) { func1(a); // TypeScript is fine with this } ``` PEP 586 doesn’t explicitly say whether a union that includes all literal variants is equivalent to its non-literal counterpart. It does indicate that deconstruction and merging are possible when the type has a closed number of literal variants, which implies that they are equivalent in those cases. Absent any explicit statement, it seems reasonable to conclude that they are treated consistently, as they are in TypeScript.
it would help to share how you would handle the motivating examples
I guess I don’t find the motivating examples very motivating. :) The proposal is based on the assumption that a literal type is a good proxy for “a value that has been validated as safe by the caller”. That assumption seems tenuous. There are many ways the caller can perform such validation. For example, it could use a regex filter, compare against a table of known good values, scan for dangerous character sequences, or pass it though an escape transform. None of these techniques will produce a literal-typed value. Python is a runtime-type-safe language, so it already prevents format string attacks from accessing stack or heap locations outside of the target object. Are you primarily concerned about cases where Python code invokes code written in a different language that potentially has vulnerabilities because of a lack of runtime type safety? It sounds like you are looking for some form of taint analysis, but this doesn’t strike me as a good solution to that problem. -Eric -- Eric Traut Contributor to pyright & pylance Microsoft Corp.

Thinking more about this, I actually agree with that statement. Bools and enums should be treated as equivalent to unions of the component literals. However, note that functions for them still need a fallback overload. This is not because of performance reasons or lack of typechecker support like you mentioned, but because we need to handle the case where the input type is a union (e.g., `bool`). That is, we need an overload each for `Literal[True]`, `Literal[False]`, and `bool`. This is true even in TypeScript [1]. This doesn't affect the proposal here, though. I've already addressed the joining and narrowing concerns. The fact remains that there is no way in Python (or TypeScript) to represent a union of arbitrary literal types.
I guess I don't find the motivating examples very motivating. :)
That's fair enough. But it's not the same as saying that unions of arbitrary literals can be expressed with current types :) Regarding how format strings can be unsafe in Python and why we'd want to address these in the typechecker instead of a full-fledged taint analysis, I've asked a security engineer (Graham Bleaney) to chime in on this thread. [1]: TypeScript boolean overloads: https://www.typescriptlang.org/play?ts=4.2.0-dev.20201221#code/CYUwxgNghgTiA... On Sat, Aug 7, 2021 at 10:43 AM Eric Traut <eric@traut.com> wrote:
-- S Pradeep Kumar

The proposal is based on the assumption that a literal type is a good proxy for “a value that has been validated as safe by the caller”. That assumption seems tenuous. There are many ways the caller can perform such validation. For example, it could use a regex filter, compare against a table of known good values, scan for dangerous character sequences, or pass it though an escape transform. None of these techniques will produce a literal-typed value.
You're 100% right! There are lots of ways to make sure a given string is safe to use in a given command. The problem is, many of the ways that you suggested (regex filtering, scanning for dangerous characters, escape transforms) can have subtle implementation flaws that allow an attacker to bypass them. Additionally, users can entirely forget to make those checks. As a security engineer, I have to assume that someone somewhere along the line is going to mess up their ad-hoc regex check (or forget to write it) and let a special character slip though. `Literal[str]` gives me another way though. It lets me create *opinionated* APIs that say the user *must* supply a literal string, which the type checker can then tell me with 100% certainty was not dynamically created with user controlled input. Let's make things concrete and talk about SQL injection. The canonical way to prevent it is to use parameterized queries: ``` def get_data(value: str): SQL = "SELECT * FROM table WHERE col = %s" return conn.query(SQL, value) ``` The problem is, nothing stops a developer from inserting a dynamically created SQL string into that first parameter and creating a SQL injection vulnerability despite the availability of parameterization: ``` def get_data(value: str): SQL = f"SELECT * FROM table WHERE col = '{value}'" return conn.query(SQL) ``` If I change the interface of `query` to require that the first argument is a literal, I can prevent this SQL injection issue from happening.
Python is a runtime-type-safe language, so it already prevents format string attacks from accessing stack or heap locations outside of the target object. Are you primarily concerned about cases where Python code invokes code written in a different language that potentially has vulnerabilities because of a lack of runtime type safety?
My concerns have nothing to do with memory or type safety, and everything to do with preventing the confusion of data and commands. As you've alluded to, many APIs take a string and run it as code (`pickle`, `eval`, etc. run Python code, SQL APIs run SQL code, `os.system` runs shell commands, `python-ldap`'s APIs let you run LDAP queries, etc.). Sometimes the commands they run need data (IE. a value to insert into the table). If the commands and data are a part of the same string, injection vulnerabilities can occur. Using `Literal[str]`, API designers can enforce the separation of commands and data by requiring that the commands be literals within the python program, rather than coming from some external source that is user controlled data.
It sounds like you are looking for some form of taint analysis, but this doesn’t strike me as a good solution to that problem.
Full taint analysis is definitely useful, and I actually spend the majority of my time working on Pysa which is a taint analysis tool build on top of Pyre. To me, the reasons to want this in a type checker are: 1) It's way faster to run and give feedback to developers. Pysa will take an hour+ to report an issue to a developer on a massive codebase, wheres Pyre can do it in a second. 2) Taint analysis requires that you're able to track sources of user controlled data into the dangerous function. There is always a risk of false negative there, whereas I can't imagine a case of a false negative coming from a type check for `Literal` (outside of explicit lint suppression)

On Mon, Aug 9, 2021 at 4:03 PM Paul Bryan <pbryan@anode.ca> wrote:
Yes and no. Yes in that I agree -- SQL queries tend to have too much dynamic nature to be simple literals. No in that -- SQL does show having the type system aid you in preventing queries with untrusted contents is quite useful. IIRC, one of the type checkers out there does exactly this? I think the focus on literal-ness is a bit misleading. In the print_request example, that it is literal isn't the concern so much as it adhering to the requirement that it's a format string with %A in it (or w/e), and being able to know that from static analysis. msg_format = "Hello, " msg_format += "%A" Isn't a literal literal, but would also be valid usage (I'd expect it to be, anyways), and statically knowable, and thus should be allowed. All this reminds me of Google's Java Errorprone and its CompileTimeConstant <https://github.com/google/error-prone/blob/master/annotations/src/main/java/...> annotation. It's an annotation applied to a field (which already has its own type), so its like a "trait" or "state" or "requirement" (for lack of a better term) of an instance of a type, rather than a type in its own right. I've seen this used in APIs that accept a "literal" and then return a "trusted" type, and then all the e.g. query() apis only accept that trusted type. It works pretty well as a middle ground between "ban all dynamic strings" and "any string is allowed". The overall intent also seems like a more formal way to write the sort of thing that taint-checking is doing internally: accept instances of type str, but only if they have the "has_percent_A" bit set. We can imagine all sorts of cases like this: JS/HTML/etc strings, objects that had __enter__() called previously, objects that haven't had closed() called, object that had a post_init() called, etc, etc. Doesn't Annotated fit this case nicely? e.g. sql: Annotated[str, SafeForSql] or user_home: Annotated[LatLong, SensitiveData], etc. And then it's up to a type checker to implement that state tracking.
Well, sort of. Most, if not all, of the Python type checkers have easy ways to arbitrarily cast, disable, or otherwise subvert the type checker for a given call site. A stray Any can sneak in and subvert everything, too. As a library author, you can't force downstream users to use a type checker, either. My point is, a type checker is focused on enforcing types and their relationships, not so much looking for security flaws. It can help, but it's not a panacea. Giving API authors the tools to prevent inadvertent misuse is still a good goal, though.

I agree with Richard. The name `Literal[str]` may be a bit misleading -- the intention here is not to have a type that just represents "literal strings". Instead, what we are really looking for is a notion of "statically-determinable string" or "compile-time constant string", in which case conditionally-built SQL queries are perfectly acceptable as long as the type checker can show that the query string is built from statically determinable constants no matter what the conditions are. @Eric I do think under this kind of interpretation, things like `Union[Literal[False], Literal[True]]` would no longer be equivalent to `bool` -- the union should be equivalent to `Literal[bool]` instead. Previously, we do not really differentiate between "statically-determinable values" and "runtime-determinable values" at type level, and hence `Union[Literal[False], Literal[True]]` and `bool` would be indistinguishable. But once we make that distinction explicit with Pradeep's proposal, the difference between `Literal[bool]` and `bool` starts to matter: `Literal[bool]` would be a subtype of `bool` (as any statically determinable value is also runtime-determinable), but *not* the other way around. And type checker would need to adjust their implementation of subtyping relations accordingly. Of course, whether the additional capability of type-level compile-time constant tracking is worth the cost of these implementation changes is indeed debatable. But I don't think there's a fundamental reason why can't make that distinction and can't deviate from what TypeScript does. @Richard The comparison with Errorprone's CompileTimeConstant is interesting. But when it comes to replicating that with Python's `Annotated` type, my read of PEP 593 is that `Annotated` is more geared towards special-purpose runtime/compile-time analyses than a general-purpose type system feature. Type checkers are not supposed to assign/interpret the meanings of those annotations, and I think it makes more sense that way since it's not clear how things like `Union[Annotated[str, X], Annotated[str, Y]]` should be addressed -- is it a subtype/supertype of `str`, or `Annotated[str, [X,Y]]`, `Annotated[str, {X, Y}]`, or some other options? And once we start to hard-code certain interpretation of annotations in the type checker, what happens when our interpretation conflicts with that of other special-purpose analyses? Therefore, my take on this is that if we want type checker to take on this kind of analysis then it's better to have a dedicated type rather than relying on `Annotated`. On the other hand, if we decide that type checker is not the best place to implement the feature, then a special-purpose static analyses that relies on `Annotated` could be one potential option to explore. - Jia

many non-trivial queries (e.g. conditionally including clauses) require generating SQL statements dynamically.
Yup, you're totally right. And a dynamically created SQL query that uses only strings that are statically knowable (read: is just a bunch of `Literal[str]`s concatenated together) is going to be safe for use in a SQL API in all but the most contrived scenarios. And we can still get a `Literal[str]` on a dynamic SQL string built in this way if we add an override on `str.__add__`: ``` def __add__(self: Literal[str], s: Literal[str]) -> Literal[str]: ... ```

The majority of dynamic strings I build are arrays of strings that I join to a single string at the end. Will these be statically knowable?
`List` is already generic, so I think that `List[Literal[str]]` should just work out of the box. Similar to `__add__`, I think we could overload `join` to support this usecase: ``` def join(self, __iterable: Iterable[Literal[str]]) -> Literal[str]: ... ```

I thought more about this issue and did some brainstorming with my colleagues. Here are a couple of other solutions worth considering. 1. Rather than use "literal" as a proxy for "validated input", use NewType(str). This creates a speed bump so developers who are calling an API know that they are responsible for validating the str input beforehand. A library could even provide functions that perform validation of str inputs and convert them to the new "validated" type. 2. Browsers implement a solution called "trusted types", and some of those ideas are potentially applicable here. See https://web.dev/trusted-types/ for details. I continue to assert that using "literal" in the proposed manner is a misuse of the Python type system that would have negative consequences. I encourage you to consider alternative solutions to this problem. -Eric -- Eric Traut Contributor to pylance & pyright Microsoft Corp.

Where would you propose this version of join be implemented?
I'm suggesting that we provide the `join` method I wrote out as a overload for the `str.join` method in typeshed
1. Rather than use "literal" as a proxy for "validated input", use NewType(str). This creates a speed bump so developers who are calling an API know that they are responsible for validating the str input beforehand. A library could even provide functions that perform validation of str inputs and convert them to the new "validated" type.
We actually do have an API like that, but the same problem persists that developers can pass any argument in to the constructor of `NewType`. The problem is that no matter what API one designs, it has to start with a `str` at some point, and there is no way to know whether that `str` could contain user input or not. Requiring that it be a literal prevents some valid uses of the API, but also prevents all but the most contrived invalid uses of the API.
2. Browsers implement a solution called "trusted types", and some of those ideas are potentially applicable here.
Trusted types is a really cool concept, and believe it or not it's actually at the beginning of a ~2 year long road that ended with this proposal. Without getting *really* invasive into the runtime and changing how strings themselves are handled (you'd effectively need runtime taint tracking to make sure a string didn't come from the network, filesystem, process IO, etc), trusted types basically devolves back into what you suggested in 1), and has the same fundamental limit.

On Wed, Aug 11, 2021 at 9:45 AM <gbleaney@gmail.com> wrote:
This bit may require some more thought. The join string itself (on which the “join” method is called) may be a literal or a dynamic string, and this overload is valid only if it is a literal. But I don’t think we have a way to define a method overload only on `Literal[str]` and not on `str`? Maybe could use self-types or something for this, but it’s something that would require additional support I think. Carl

Graham, Paul, and Richard brought up a frequent use case for SQL queries: adding literal strings based on a flag or joining an array of literals. For example: ``` SQL = "SELECT * FROM table WHERE col = %s" if limit: SQL += " LIMIT 1" ``` The idea is that we want strings constructed from statically-known literal strings, as Jia pointed out. These are either - a literal string expression ("foo") - conditional branches (`if b: return "foo"; else: return "bar"`) - narrowing using a guard (`if s == “foo”:`) - `+` on literals - `join` on literals We want to exclude strings that are constructed from some expression of type `str` since they are not statically-known to have been constructed from literals. Bikeshedding: - I'm ok with `AnyLiteral[str]` since `Literal[str]` can be somewhat confusing. The extra import is fine since this is mostly going to be used in libraries, not in user-written code, and we want the `[str]` parameter since we also want this for `[int]`, etc. - Another option is `MadeFromLiterals[str]`, which is the most accurate description, but is also a bit long. - Finally, we could have separate classes like `AnyStrLiteral`, `AnyIntLiteral`, etc. This would allow us to subclass `str` and override any methods explicitly, but may be harder to explain. Typecheckers would need to look up any method calls on `Literal["hello"]` on `AnyStrLiteral`, not `str`. Compatibility: This goes as `AnyLiteral["foo"] <: AnyLiteral[str] <: str` but not in the other direction. (`A <: B` means A is compatible with B.) Potential objection: Can an attacker contrive a malicious literal string by giving an input that traverses different branches of benign code and appends known literal strings in some order? Yes, but this seems a remote possibility. Based on discussions with security folks, I think that’s a tradeoff we can accept given that disallowing `+` would rule out natural idioms like appending a “LIMIT 1” to a query.
Carl: Sure, we can add an overload specifically for literals with an annotation for `self`. That will require literal strings for both the input list and the delimiter. This looks like: ``` class str(Sequence[str]): ... @overload def join(self: Literal[str], iterable: Iterable[Literal[str]]) -> Literal[str]: ... @overload def join(self, iterable: Iterable[str]) -> str: ... @overload def __add__(self: Literal[str], other: Literal[str]) -> Literal[str]: ... @overload def __add__(self, other: str) -> str: ... from typing import Literal def connection_query(sql: Literal[str], value: str) -> None: ... def my_query(value: str, limit: bool) -> None: SQL = "SELECT * FROM table WHERE col = %s" if limit: SQL += " LIMIT 1" connection_query(SQL, value) # OK connection_query(SQL + value, value) # Error: Expected Literal[str], got str. def foo(s: str) -> None: y = ", ".join(["a", "b", "c"]) reveal_type(y) # => Literal[str] y2 = ", ".join(["a", "b", s]) reveal_type(y2) # => str xs: list[Literal[str]] y3 = ", ".join(xs) reveal_type(y3) # => Literal[str] y4 = s.join(xs) reveal_type(y4) # => str because the delimiter is `str`. ``` The above example works with Pyre after changing the stub for `str`. Another option is to special-case `+` and `join` within typecheckers like you said in case we don’t want to change the `str` stubs. (Or, if we go down the `AnyStrLiteral` route, these specialized stubs would live in that class.) -- S Pradeep Kumar
participants (11)
-
Alfonso L. Castaño
-
Anthony Sottile
-
Carl Meyer
-
Eric Traut
-
gbleaney@gmail.com
-
Guido van Rossum
-
Jelle Zijlstra
-
Jia Chen
-
Paul Bryan
-
Richard Levasseur
-
S Pradeep Kumar