On Mon, Aug 9, 2021 at 4:03 PM Paul Bryan
I don't think the SQL query is a good use case; many non-trivial queries (e.g. conditionally including clauses) require generating SQL statements dynamically.
Yes and no. Yes in that I agree -- SQL queries tend to have too much dynamic nature to be simple literals. No in that -- SQL does show having the type system aid you in preventing queries with untrusted contents is quite useful. IIRC, one of the type checkers out there does exactly this? I think the focus on literal-ness is a bit misleading. In the print_request example, that it is literal isn't the concern so much as it adhering to the requirement that it's a format string with %A in it (or w/e), and being able to know that from static analysis. msg_format = "Hello, " msg_format += "%A" Isn't a literal literal, but would also be valid usage (I'd expect it to be, anyways), and statically knowable, and thus should be allowed. All this reminds me of Google's Java Errorprone and its CompileTimeConstant https://github.com/google/error-prone/blob/master/annotations/src/main/java/... annotation. It's an annotation applied to a field (which already has its own type), so its like a "trait" or "state" or "requirement" (for lack of a better term) of an instance of a type, rather than a type in its own right. I've seen this used in APIs that accept a "literal" and then return a "trusted" type, and then all the e.g. query() apis only accept that trusted type. It works pretty well as a middle ground between "ban all dynamic strings" and "any string is allowed". The overall intent also seems like a more formal way to write the sort of thing that taint-checking is doing internally: accept instances of type str, but only if they have the "has_percent_A" bit set. We can imagine all sorts of cases like this: JS/HTML/etc strings, objects that had __enter__() called previously, objects that haven't had closed() called, object that had a post_init() called, etc, etc. Doesn't Annotated fit this case nicely? e.g. sql: Annotated[str, SafeForSql] or user_home: Annotated[LatLong, SensitiveData], etc. And then it's up to a type checker to implement that state tracking.
On Mon, 2021-08-09 at 22:23 +0000, gbleaney@gmail.com wrote:
The proposal is based on the assumption that a literal type is a good proxy for “a value that has been validated as safe by the caller”. That assumption seems tenuous. There are many ways the caller can perform such validation. For example, it could use a regex filter, compare against a table of known good values, scan for dangerous character sequences, or pass it though an escape transform. None of these techniques will produce a literal-typed value.
You're 100% right! There are lots of ways to make sure a given string is safe to use in a given command. The problem is, many of the ways that you suggested (regex filtering, scanning for dangerous characters, escape transforms) can have subtle implementation flaws that allow an attacker to bypass them. Additionally, users can entirely forget to make those checks. As a security engineer, I have to assume that someone somewhere along the line is going to mess up their ad-hoc regex check (or forget to write it) and let a special character slip though. `Literal[str]` gives me another way though. It lets me create *opinionated* APIs that say the user *must* supply a literal string, which the type checker can then tell me with 100% certainty was not dynamically created with user controlled input.
Well, sort of. Most, if not all, of the Python type checkers have easy ways to arbitrarily cast, disable, or otherwise subvert the type checker for a given call site. A stray Any can sneak in and subvert everything, too. As a library author, you can't force downstream users to use a type checker, either. My point is, a type checker is focused on enforcing types and their relationships, not so much looking for security flaws. It can help, but it's not a panacea. Giving API authors the tools to prevent inadvertent misuse is still a good goal, though.
Let's make things concrete and talk about SQL injection. The canonical way to prevent it is to use parameterized queries:
``` def get_data(value: str): SQL = "SELECT * FROM table WHERE col = %s" return conn.query(SQL, value) ```
The problem is, nothing stops a developer from inserting a dynamically created SQL string into that first parameter and creating a SQL injection vulnerability despite the availability of parameterization:
``` def get_data(value: str): SQL = f"SELECT * FROM table WHERE col = '{value}'" return conn.query(SQL) ```
If I change the interface of `query` to require that the first argument is a literal, I can prevent this SQL injection issue from happening.
Python is a runtime-type-safe language, so it already prevents format string attacks from accessing stack or heap locations outside of the target object. Are you primarily concerned about cases where Python code invokes code written in a different language that potentially has vulnerabilities because of a lack of runtime type safety?
My concerns have nothing to do with memory or type safety, and everything to do with preventing the confusion of data and commands. As you've alluded to, many APIs take a string and run it as code (`pickle`, `eval`, etc. run Python code, SQL APIs run SQL code, `os.system` runs shell commands, `python-ldap`'s APIs let you run LDAP queries, etc.). Sometimes the commands they run need data (IE. a value to insert into the table). If the commands and data are a part of the same string, injection vulnerabilities can occur. Using `Literal[str]`, API designers can enforce the separation of commands and data by requiring that the commands be literals within the python program, rather than coming from some external source that is user controlled data.
It sounds like you are looking for some form of taint analysis, but this doesn’t strike me as a good solution to that problem.
Full taint analysis is definitely useful, and I actually spend the majority of my time working on Pysa which is a taint analysis tool build on top of Pyre. To me, the reasons to want this in a type checker are:
1) It's way faster to run and give feedback to developers. Pysa will take an hour+ to report an issue to a developer on a massive codebase, wheres Pyre can do it in a second. 2) Taint analysis requires that you're able to track sources of user controlled data into the dangerous function. There is always a risk of false negative there, whereas I can't imagine a case of a false negative coming from a type check for `Literal` (outside of explicit lint suppression) _______________________________________________ Typing-sig mailing list -- typing-sig@python.org To unsubscribe send an email to typing-sig-leave@python.org https://mail.python.org/mailman3/lists/typing-sig.python.org/ Member address: pbryan@anode.ca
_______________________________________________ Typing-sig mailing list -- typing-sig@python.org To unsubscribe send an email to typing-sig-leave@python.org https://mail.python.org/mailman3/lists/typing-sig.python.org/ Member address: richardlev@gmail.com