[Typing-sig] Re: Arbitrary Literal Strings

9 Aug 2021

      On Mon, Aug 9, 2021 at 4:03 PM Paul Bryan  wrote:
...
I don't think the SQL query is a good use case; many non-trivial queries
(e.g. conditionally including clauses) require generating SQL statements
dynamically.
Yes and no. Yes in that I agree -- SQL queries tend to have too much
dynamic nature to be simple literals. No in that -- SQL does show having
the type system aid you in preventing queries with untrusted contents is
quite useful. IIRC, one of the type checkers out there does exactly this?

I think the focus on literal-ness is a bit misleading.

In the print_request example, that it is literal isn't the concern so much
as it adhering to the requirement that it's a format string with %A in it
(or w/e), and being able to know that from static analysis.
    msg_format = "Hello, "
    msg_format += "%A"
Isn't a literal literal, but would also be valid usage (I'd expect it to
be, anyways), and statically knowable, and thus should be allowed.

All this reminds me of Google's Java Errorprone and its CompileTimeConstant
https://github.com/google/error-prone/blob/master/annotations/src/main/java/...
annotation. It's an annotation applied to a field (which already has its
own type), so its like a "trait" or "state" or "requirement" (for lack of a
better term) of an instance of a type, rather than a type in its own right.
I've seen this used in APIs that accept a "literal" and then return a
"trusted" type, and then all the e.g. query() apis only accept that trusted
type. It works pretty well as a middle ground between "ban all dynamic
strings" and "any string is allowed".

The overall intent also seems like a more formal way to write the sort of
thing that taint-checking is doing internally: accept instances of type
str, but only if they have the "has_percent_A" bit set.

We can imagine all sorts of cases like this: JS/HTML/etc strings, objects
that had __enter__() called previously, objects that haven't had closed()
called, object that had a post_init() called, etc, etc.

Doesn't Annotated fit this case nicely? e.g. sql: Annotated[str,
SafeForSql] or user_home: Annotated[LatLong, SensitiveData], etc. And then
it's up to a type checker to implement that state tracking.
...
On Mon, 2021-08-09 at 22:23 +0000, gbleaney@gmail.com wrote:
The proposal is based on the assumption that a literal type is a good
proxy for “a value that has been validated as safe by the caller”. That
assumption seems tenuous. There are many ways the caller can perform such
validation. For example, it could use a regex filter, compare against a
table of known good values, scan for dangerous character sequences, or pass
it though an escape transform. None of these techniques will produce a
literal-typed value.
You're 100% right! There are lots of ways to make sure a given string is
safe to use in a given command. The problem is, many of the ways that you
suggested (regex filtering, scanning for dangerous characters, escape
transforms) can have subtle implementation flaws that allow an attacker to
bypass them. Additionally, users can entirely forget to make those checks.
As a security engineer, I have to assume that someone somewhere along the
line is going to mess up their ad-hoc regex check (or forget to write it)
and let a special character slip though. `Literal[str]` gives me another
way though. It lets me create *opinionated* APIs that say the user *must*
supply a literal string, which the type checker can then tell me with 100%
certainty was not dynamically created with user controlled input.
Well, sort of. Most, if not all, of the Python type checkers have easy ways
to arbitrarily cast, disable, or otherwise subvert the type checker for a
given call site. A stray Any can sneak in and subvert everything, too. As a
library author, you can't force downstream users to use a type checker,
either.

My point is, a type checker is focused on enforcing types and their
relationships, not so much looking for security flaws. It can help, but
it's not a panacea. Giving API authors the tools to prevent
inadvertent misuse is still a good goal, though.
...
Let's make things concrete and talk about SQL injection. The canonical way
to prevent it is to use parameterized queries:
```
def get_data(value: str):
    SQL = "SELECT * FROM table WHERE col = %s"
    return conn.query(SQL, value)
```
The problem is, nothing stops a developer from inserting a dynamically
created SQL string into that first parameter and creating a SQL injection
vulnerability despite the availability of parameterization:
```
def get_data(value: str):
    SQL = f"SELECT * FROM table WHERE col = '{value}'"
    return conn.query(SQL)
```
If I change the interface of `query` to require that the first argument is
a literal, I can prevent this SQL injection issue from happening.
Python is a runtime-type-safe language, so it already prevents format
string attacks from accessing stack or heap locations outside of the target
object. Are you primarily concerned about cases where Python code invokes
code written in a different language that potentially has vulnerabilities
because of a lack of runtime type safety?
My concerns have nothing to do with memory or type safety, and everything
to do with preventing the confusion of data and commands. As you've alluded
to, many APIs take a string and run it as code (`pickle`, `eval`, etc. run
Python code, SQL APIs run SQL code, `os.system` runs shell commands,
`python-ldap`'s APIs let you run LDAP queries, etc.). Sometimes the
commands they run need data (IE. a value to insert into the table). If the
commands and data are a part of the same string, injection vulnerabilities
can occur. Using `Literal[str]`, API designers can enforce the separation
of commands and data by requiring that the commands be literals within the
python program, rather than coming from some external source that is user
controlled data.
It sounds like you are looking for some form of taint analysis, but this
doesn’t strike me as a good solution to that problem.
Full taint analysis is definitely useful, and I actually spend the
majority of my time working on Pysa which is a taint analysis tool build on
top of Pyre. To me, the reasons to want this in a type checker are:
1) It's way faster to run and give feedback to developers. Pysa will take
an hour+ to report an issue to a developer on a massive codebase, wheres
Pyre can do it in a second.
2) Taint analysis requires that you're able to track sources of user
controlled data into the dangerous function. There is always a risk of
false negative there, whereas I can't imagine a case of a false negative
coming from a type check for `Literal` (outside of explicit lint
suppression)
_______________________________________________
Typing-sig mailing list -- typing-sig@python.org
To unsubscribe send an email to typing-sig-leave@python.org
https://mail.python.org/mailman3/lists/typing-sig.python.org/
Member address: pbryan@anode.ca
_______________________________________________
Typing-sig mailing list -- typing-sig@python.org
To unsubscribe send an email to typing-sig-leave@python.org
https://mail.python.org/mailman3/lists/typing-sig.python.org/
Member address: richardlev@gmail.com

[Typing-sig] Re: Arbitrary Literal Strings

Richard Levasseur