Use of marshal in the sandbox: is stdlib marshal OK?

The sandbox uses pypy's own implementation of marshal. In pypy/translator/sandbox/sandlib.py is this comment: # Note: we use lib_pypy/marshal.py instead of the built-in marshal # for two reasons. The built-in module could be made to segfault # or be attackable in other ways by sending malicious input to # load(). Also, marshal.load(f) blocks with the GIL held when # f is a pipe with no data immediately avaialble, preventing the # _waiting_thread to run. I'd like to remove as many dependencies as possible from the sandbox code, so I'd like to explore the possibility of using the standard library marshal module. The first reason above is about crashing marshal with malicious input. To my thinking, we are in control of what data is marshaled, so we don't have to worry about malicious input. The untrusted Python code running in the sandbox doesn't have a way of sending marshaled data, so we don't have to worry that it will be used to attack the marshal module. The stdout of the untrusted Python code will become a string that is marshaled, but that doesn't provide a way for the untrusted code to attack the marshal module. Or have I missed something? The second reason I can't address, is this still a problem? What bad effects will we see if it is? --Ned.

it will become an issue if there is a bug in the marshal code inside pypy-c-sandbox which is /creating/ the marshalled data, a bug that would allow a sandboxed program to alter the marshalled data in such a way that it can exploit the vulnerability of the stdlib marshal. Doesn't sound too likely, but in the spirit of having as many layers of security as possible, I propose simply bundling pypy's marshal.py with the sandbox. -- lahwran On Tue, Dec 27, 2011 at 7:30 PM, Ned Batchelder <ned@nedbatchelder.com> wrote:
The sandbox uses pypy's own implementation of marshal. In pypy/translator/sandbox/sandlib.py is this comment:
# Note: we use lib_pypy/marshal.py instead of the built-in marshal # for two reasons. The built-in module could be made to segfault # or be attackable in other ways by sending malicious input to # load(). Also, marshal.load(f) blocks with the GIL held when # f is a pipe with no data immediately avaialble, preventing the # _waiting_thread to run.
I'd like to remove as many dependencies as possible from the sandbox code, so I'd like to explore the possibility of using the standard library marshal module.
The first reason above is about crashing marshal with malicious input. To my thinking, we are in control of what data is marshaled, so we don't have to worry about malicious input. The untrusted Python code running in the sandbox doesn't have a way of sending marshaled data, so we don't have to worry that it will be used to attack the marshal module. The stdout of the untrusted Python code will become a string that is marshaled, but that doesn't provide a way for the untrusted code to attack the marshal module. Or have I missed something?
The second reason I can't address, is this still a problem? What bad effects will we see if it is?
--Ned. _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev

I guess that is a possibility, but another principle is to use well-used and widely-reviewed code where possible, no? I guess the problem is that built-in marshal isn't trying hard to protect itself against malicious data? The problem with "bundling pypy's marshal.py" is that it pulls in a lot of infrastructure modules, which bulks up the calling process. Maybe there's some low-hanging fruit there that we can trim. Any thoughts on the second issue? --Ned. On 12/27/2011 10:09 PM, lahwran wrote:
it will become an issue if there is a bug in the marshal code inside pypy-c-sandbox which is /creating/ the marshalled data, a bug that would allow a sandboxed program to alter the marshalled data in such a way that it can exploit the vulnerability of the stdlib marshal. Doesn't sound too likely, but in the spirit of having as many layers of security as possible, I propose simply bundling pypy's marshal.py with the sandbox.
-- lahwran
On Tue, Dec 27, 2011 at 7:30 PM, Ned Batchelder<ned@nedbatchelder.com> wrote:
The sandbox uses pypy's own implementation of marshal. In pypy/translator/sandbox/sandlib.py is this comment:
# Note: we use lib_pypy/marshal.py instead of the built-in marshal # for two reasons. The built-in module could be made to segfault # or be attackable in other ways by sending malicious input to # load(). Also, marshal.load(f) blocks with the GIL held when # f is a pipe with no data immediately avaialble, preventing the # _waiting_thread to run.
I'd like to remove as many dependencies as possible from the sandbox code, so I'd like to explore the possibility of using the standard library marshal module.
The first reason above is about crashing marshal with malicious input. To my thinking, we are in control of what data is marshaled, so we don't have to worry about malicious input. The untrusted Python code running in the sandbox doesn't have a way of sending marshaled data, so we don't have to worry that it will be used to attack the marshal module. The stdout of the untrusted Python code will become a string that is marshaled, but that doesn't provide a way for the untrusted code to attack the marshal module. Or have I missed something?
The second reason I can't address, is this still a problem? What bad effects will we see if it is?
--Ned. _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev

Hi Ned, On Wed, Dec 28, 2011 at 15:34, Ned Batchelder <ned@nedbatchelder.com> wrote:
The problem with "bundling pypy's marshal.py" is that it pulls in a lot of infrastructure modules, which bulks up the calling process.
Unsure what you mean. It seems to me that lib_pypy/marshal.py just imports lib_pypy/_marshal.py, which itself doesn't import any non-standard Python module. A bientôt, Armin.

Hi Ned,
On Wed, Dec 28, 2011 at 15:34, Ned Batchelder<ned@nedbatchelder.com> wrote:
The problem with "bundling pypy's marshal.py" is that it pulls in a lot of infrastructure modules, which bulks up the calling process. Unsure what you mean. It seems to me that lib_pypy/marshal.py just imports lib_pypy/_marshal.py, which itself doesn't import any non-standard Python module. True, but we use py to import it, why is that? Perhaps I'm barking up
On 12/28/2011 11:49 AM, Armin Rigo wrote: the wrong tree trying to reduce the size of the pypy source tree I need alongside my sandbox. --Ned.
A bientôt,
Armin.

On Thu, Dec 29, 2011 at 2:36 PM, Ned Batchelder <ned@nedbatchelder.com> wrote:
On 12/28/2011 11:49 AM, Armin Rigo wrote:
Hi Ned,
On Wed, Dec 28, 2011 at 15:34, Ned Batchelder<ned@nedbatchelder.com> wrote:
The problem with "bundling pypy's marshal.py" is that it pulls in a lot of infrastructure modules, which bulks up the calling process.
Unsure what you mean. It seems to me that lib_pypy/marshal.py just imports lib_pypy/_marshal.py, which itself doesn't import any non-standard Python module.
True, but we use py to import it, why is that? Perhaps I'm barking up the wrong tree trying to reduce the size of the pypy source tree I need alongside my sandbox.
No, it's actually good to keep sandbox relatively separate from the pypy tree.

Hi, On Thu, Dec 29, 2011 at 17:00, Maciej Fijalkowski <fijall@gmail.com> wrote:
True, but we use py to import it, why is that? Perhaps I'm barking up the wrong tree trying to reduce the size of the pypy source tree I need alongside my sandbox.
No, it's actually good to keep sandbox relatively separate from the pypy tree.
Yes, indeed. We should do that, maybe moving sandbox-the-external-process into a separate repository. This would give people more confidence hacking the sandbox code, knowing that it's all regular Python code. A bientôt, Armin.
participants (4)
-
Armin Rigo
-
lahwran
-
Maciej Fijalkowski
-
Ned Batchelder