[Twisted-Python] Safe Pickling using banana and jelly
Hi to all! Maybe this question has been answered elsewhere, but it doesn't seem to be stated anywhere in the Twisted HOWTO's or any other documentation safe the overall stance that jelly and banana are "replacements for pickle aimed at safety, not speed": Is unpickling _untrusted_ network data using banana and jelly a safe thing? After a length check on the data has been done, discarding all messages that are over 50k in size, of course... :) Thanks for any reply. :) Heiko.
Heiko Wundram wrote:
Is unpickling _untrusted_ network data using banana and jelly a safe thing? After a length check on the data has been done, discarding all messages that are over 50k in size, of course... :)
Having only used Twisted for about a day, cumulative, I am not the best person to answer that. However, it does seem that it has a security hole I pointed out in Python's pickle package, which is one of the reasons pickle is not to be trusted. In brief, jelly will unjelly anything, including objects which do destructive acts in the deallocator. And some exist in the standard Python libs. Here's an example.
from twisted.spread import jelly import tempfile import StringIO x = tempfile._TemporaryFileWrapper(StringIO.StringIO(""), "/blah") del x Exception exceptions.OSError: (2, 'No such file or directory', '/blah') in <bound method _TemporaryFileWrapper.__del__ of <tempfile._TemporaryFileWrapper instance at 0x626c60>> ignored x = tempfile._TemporaryFileWrapper(StringIO.StringIO(""), "/blah") jelly.jelly(x) ['tempfile._TemporaryFileWrapper', ['dictionary', ['close_called', ['boolean', 'false']], ['name', '/blah'], ['file', ['StringIO.StringIO', ['dictionary', ['softspace', 0], ['buflist', ['list']], ['pos', 0], ['len', 0], ['closed', 0], ['buf', '']]]]]] q = _ del x Exception exceptions.OSError: (2, 'No such file or directory', '/blah') in <bound method _TemporaryFileWrapper.__del__ of <tempfile._TemporaryFileWrapper instance at 0x626c60>> ignored jelly.unjelly(q) <tempfile._TemporaryFileWrapper instance at 0x626c60> 1 Exception exceptions.OSError: (2, 'No such file or directory', '/blah') in <bound method _TemporaryFileWrapper.__del__ of <tempfile._TemporaryFileWrapper instance at 0x626c60>> ignored 1
However, I don't know enough about how jellied data structures are handled when they come over the wire to know if they are indeed prone to this sort of attack. Eg, one solution is to state that only certain objects can be unpickled, which is the suggested solution for Python's stock pickles. Andrew dalke@dalkescientific.com
On Mon, 26 May 2003 14:41:34 -0600 Andrew Dalke <dalke@dalkescientific.com> wrote:
Having only used Twisted for about a day, cumulative, I am not the best person to answer that. However, it does seem that it has a security hole I pointed out in Python's pickle package, which is one of the reasons pickle is not to be trusted.
In brief, jelly will unjelly anything, including objects which do destructive acts in the deallocator. And some exist in the standard Python libs. Here's an example.
This is... inaccurate. Jelly has security policies. The one used in the jelly module's jelly() and unjelly() module-level functions is setup by default for allowing anything, so as to make using it easy. However, the security policy for jelly in the network protocol PB only allows deserializing classes which have been explicitly approved by the user. -- Itamar Shtull-Trauring http://itamarst.org/ http://www.zoteca.com -- Python & Twisted consulting
Itamar Shtull-Trauring:
This is... inaccurate. Jelly has security policies. The one used in the jelly module's jelly() and unjelly() module-level functions is setup by default for allowing anything, so as to make using it easy.
However, the security policy for jelly in the network protocol PB only allows deserializing classes which have been explicitly approved by the user.
Mmmm, I see the security code, and it looks safe enough against my attack. (I saw the code originally, but I went the wrong way down an if-statement. I saw a __class__ being created, but hadn't realized what it meant to be unjellyable.) Here's another attack: what happens if I pass back a string which encodes [[[[[[[[[[ ... ]]]]]]]]]] where there are enough []s to break the recursion limit?
banana.encode(jelly.jelly([])) '\x01\x80\x04\x82list' banana.encode(jelly.jelly([[]])) '\x02\x80\x04\x82list\x01\x80\x04\x82list' banana.encode(jelly.jelly([[[]]])) '\x02\x80\x04\x82list\x02\x80\x04\x82list\x01\x80\x04\x82list'
The safe unjelly code uses a recursive call, so there can be at most 1,000 levels. Each level takes 8 characters, so I can break the recursion limit at 8,000 bytes, and generate a RuntimeError. It looks like this exception is caught in an 'except:' in _recvMessage, so that's okay (except for a few OSes with a small stack size) ... except what happens if the connection then closes and answerRequired is set and so _sendError is called, which calls Banana's sendEncoded which does a transport.write which fails .... ahh, that's too deep into the way Twisted works for me to follow any further, but it looks like write is guaranteed to catch all underlying errors. So my first pass suggests the code can be used in a resonable secure way. BTW, in my review I saw that jelly.py uses this code: for i in dir(module): i_ = getattr(module, i) if type(i_) == types.ClassType: if issubclass(i_, baseClass): setUnjellyableForClass('%s%s' % (prefix, i), i_) Is there any reason this code doesn't use "for i, i_ in module.__dict__.items()"? I point it out because formally speaking the definition of "dir()" is documented as "primarily as a convenience for use at an interactive prompt" and "its detailed behavior may change across releases." OTOH, it ain't broke, so no need to fix it. For completeness, % fgrep 'dir(' */*.py | grep -v '[a-z]dir' im/gtkcommon.py: for n in dir(base): # the behaviour of dir has changed since this was created. # I think in newer Pythons it isn't needed. But it doesn't # break and is needed for older ones. manhole/explorer.py: for i in dir(instance): # The method is documented as # Note these are only the *instance* methods and members -- # if you want the class methods, you'll have to look up the class. # # However, with the change of dir behaviour in newer Pythons, this # is no longer true -- all names are returned. Fix doc or fix code? manhole/explorer.py: for i in dir(theClass): # not sure which behaviour is desired names/authority.py: for record in [x for x in dir(dns) if x.startswith('Record_')]: # should be # for record in [v for k,v in dns.__dict__.items() if k.startswith('Record_')]: # and get rid of the next line spread/jelly.py: for i in dir(module): spread/newjelly.py: for i in dir(module): # mentioned above trial/unittest.py: names = dir(module) % fgrep 'dir(' */*/*.py | grep -v '[a-z]dir' conch/ssh/connection.py:for v in dir(connection): conch/ssh/transport.py:for v in dir(transport): # connection.__dict__.names() also works conch/ssh/userauth.py:for v in dir(userauth): # userauth.__dict__ also works spread/ui/gtk2util.py: for k in dir(self): # not sure, but here's the context # # # mold can go away when we get a newer pygtk (post 1.99.14) # mold = {} # for k in dir(self): # mold[k] = getattr(self, k) # # Should it go away? Andrew dalke@dalkescientific.com
On Mon, 26 May 2003 17:29:06 -0600 Andrew Dalke <dalke@dalkescientific.com> wrote:
Is there any reason this code doesn't use "for i, i_ in module.__dict__.items()"?
Indeed, these are probably all bugs. Thanks for taking the time to point this out. I've seen bugs caused by this already, so please, everyone, use twisted.python.reflect or python's inspect module, not dir(). -- Itamar Shtull-Trauring http://itamarst.org/ http://www.zoteca.com -- Python & Twisted consulting
On 2003.05.26 16:41, Andrew Dalke wrote:
Heiko Wundram wrote:
Is unpickling _untrusted_ network data using banana and jelly a safe thing? After a length check on the data has been done, discarding all messages that are over 50k in size, of course... :)
Having only used Twisted for about a day, cumulative, I am not the best person to answer that. However, it does seem that it has a security hole I pointed out in Python's pickle package, which is one of the reasons pickle is not to be trusted.
In brief, jelly will unjelly anything, including objects which do destructive acts in the deallocator. And some exist in the standard Python libs. Here's an example.
Well, by default PB (which I assume is what Heiko is using) does not allow sending of arbitrary objects, only objects that have been registered -- and it's easy to make jelly disable arbitrary objects as well. -- Twisted | Christopher Armstrong: International Man of Twistery Radix | Release Manager, Twisted Project ---------+ http://twistedmatrix.com/users/radix.twistd/
On Tue, 2003-05-27 at 00:31, Christopher Armstrong wrote:
Well, by default PB (which I assume is what Heiko is using) does
No, I'm not using PB, just using jelly and banana on their own for encoding network packets. As I thought that it was way too insecure to have to code my own object checking function, I've written my own serializer in the mean time... :) The arguments that Andrew put forth reminded me of Python pickle, and that just doesn't work over an insecure transport, with the remote ends not even being known (maybe). What my serializer now basically does is only allow standard Python objects to be serialized/unserialized, and classes that have been explicitly registered with the Serializer. These classes implement __serialize and __unserialize, to return their internal state as a base Python object (or classes that have been registered), which are then dumped to the stream. The serializer plays nice with class inheritance, serializing all base classes, and unserializing them in the reverse order. __unserialize can e.g. do object checking after being unserialized, or change parameters that have been received over the network. The serializer currently _cannot_ handle recursive objects, but I guess it could be extended to handle that too. If there's any interest in the code, feel free to mail me. :) Maybe even a remote chance to get it included in Twisted (oh what a hypocrite I am). Heiko.
On Tue, May 27, 2003 at 02:38:07AM +0200, Heiko Wundram wrote:
On Tue, 2003-05-27 at 00:31, Christopher Armstrong wrote:
Well, by default PB (which I assume is what Heiko is using) does
No, I'm not using PB, just using jelly and banana on their own for encoding network packets. As I thought that it was way too insecure to have to code my own object checking function, I've written my own serializer in the mean time... :)
[...]
If there's any interest in the code, feel free to mail me. :) Maybe even a remote chance to get it included in Twisted (oh what a hypocrite I am).
While Twisted has a long history of inventing new persistence systems, I don't think we need yet another serialisation module right at the moment :) -Andrew.
On 2003.05.26 20:38, Heiko Wundram wrote:
On Tue, 2003-05-27 at 00:31, Christopher Armstrong wrote:
Well, by default PB (which I assume is what Heiko is using) does
No, I'm not using PB, just using jelly and banana on their own for encoding network packets. As I thought that it was way too insecure to have to code my own object checking function, I've written my own serializer in the mean time... :)
The arguments that Andrew put forth reminded me of Python pickle, and that just doesn't work over an insecure transport, with the remote ends not even being known (maybe).
It seems you've missed other posts in this thread (or my own? I don't remember now), that point out that jelly can indeed restrict instantiation of arbitrary classes. Implementing another half-broken serialization scheme is definitely not a good solution to any problem :-) Can you be more specific about how jelly is reminding you of Python pickle? The recursion problem he mentioned is indeed, as pointed out, caught, and an error is returned to the other end. btw, thanks to Andrew Dalke for an *excellent* beginning of a security audit for jelly. :-) -- Twisted | Christopher Armstrong: International Man of Twistery Radix | Release Manager, Twisted Project ---------+ http://twistedmatrix.com/users/radix.twistd/
On Tue, 2003-05-27 at 10:01, Christopher Armstrong wrote:
It seems you've missed other posts in this thread (or my own? I don't remember now), that point out that jelly can indeed restrict instantiation of arbitrary classes. Implementing another half-broken serialization scheme is definitely not a good solution to any problem :-)
The thing being: I don't like any serialization scheme that just says: Okay, all data is fine with us, we'll coredump if we deconstruct anything bad or obnoxious... (sort'a like that) I guess that's not the point when doing serialization for data received over the net... My intention behind creating my own serialization scheme were the following: 1. I can control how the data is actually stored in the pickle (classes aren't pickled as is, but instead get the chance to __serialize themselves; that's just mildly important, but... Hey. :)) And foremost: I can control exactly which classes are okay for serialization/unserialization. 2. Both Jelly and Banana can't handle a special case where I have an instance of class y (derived from class x), and want the serialization stream to show that class x was sent. Examples of this: My current program has a Host class, and a LocalHost class which is derived from it. The LocalHost, when sent to other hosts, should just get serialized as a Host. Normally you'd have to "cast" the LocalHost to a host object, and then serialize that. My scheme can handle "alias classes", which get serialized using the other's name, making sure that the other class doesn't even get a chance to serialize private attributes. 3. Support for different serialization protocols. I guess I've not made the mistake that most other pickle-like module authors made, that they didn't implement support for more than one protocol right from the start... (hypocrisy... ;)) This e.g. means that object-ids (which basically are names for the pickled class) can have different meanings, when the protocol changes. 4. And most important for me (except 1): I can control exactly what data gets put into the unserialized classes, as all __unserialize functions are called in reverse order (when having classes which are bases to other classes, etc.). The point being: when I load a Host from network, I have to reduce several timeouts as "local" timeouts are much higher than timeouts for instances received over the net. As I don't reduce them when I send them out, I just reduce them when they come in. As such, the Host class is derived from another class, which is called TimeoutBase, which controls this timeout stuff, as there are other classes which also need timeout behaviour. With all "normal" picklers, the Host instance would be created without it ever having the chance to change the private data of the TimeoutBase (yeah, I use private variables for all class data), except by calling some kind of utility function like checkTimeout() (yuck...) or hacking by changing private attributes (even more yuck...). I think this is pretty crufty. When the unserializer just calls all bases __unserialize in turn, I can just have the TimeoutBase store (just) it's values to network when __serialize is called, and when __unserialize is called load them again, and check/reduce them right as they are being put into the class. This also protects against broken objects which are received from the network, which I would have to catch using some kind of checkInstance() logic too... (objects which are loaded with incomplete data, eg.) 5. My pickles are signed data items, and while signing banana'd packets is no problem, I have integrated signature/digesting algorithms with the stream class, which allows for checking of SHA-digests and the like as they are being written out. I don't really know any other pickler that takes care of checksumming packets right in the pickler. You can of course turn off this functionality... :) 6. Without being hypocrite, I guess I can say that the serialization scheme is so simple, that there couldn't be much holes an attacker could come through... Tests like the things Andrew did will certainly not work, and the burden is put on the programmer to make sure that the classes he registers as being serializable are really classes that are okay to serialize (destroy, etc.). All base classes are safe to serialize. And actually only int, string, unicode, complex, long and float ever get written out to the stream, all other objects being abstractions of a _DataStream (which is basically a list kind of object).
btw, thanks to Andrew Dalke for an *excellent* beginning of a security audit for jelly. :-)
Thanks here too... Made me finally give myself the turn to actually write my own serializer... ;) I've wanted to do that for a long time, but I've always feared that it would be an overly complex task. Yesterday proved me wrong about that... :) I needed above functionality 1+4 more than once, when it wasn't available, and that always produced pretty crufty code... Heiko.
On Wed, May 28, 2003 at 06:21:52AM +0200, Heiko Wundram wrote:
On Tue, 2003-05-27 at 10:01, Christopher Armstrong wrote:
It seems you've missed other posts in this thread (or my own? I don't remember now), that point out that jelly can indeed restrict instantiation of arbitrary classes. Implementing another half-broken serialization scheme is definitely not a good solution to any problem :-)
The thing being: I don't like any serialization scheme that just says: Okay, all data is fine with us, we'll coredump if we deconstruct anything bad or obnoxious... (sort'a like that) I guess that's not the point when doing serialization for data received over the net...
Twisted Spread is intended to be safe and secure, because data from the net could be from anywhere, so it has to be robust in the face of dangerous data.
My intention behind creating my own serialization scheme were the following:
1. I can control how the data is actually stored in the pickle (classes aren't pickled as is, but instead get the chance to __serialize themselves; that's just mildly important, but... Hey. :)) And foremost: I can control exactly which classes are okay for serialization/unserialization.
That's what __getstate__ and __setstate__ are for with pickle; IIRC Jelly supports that and also it's own 'getStateToCopyFor' or something along those lines.
2. Both Jelly and Banana can't handle a special case where I have an instance of class y (derived from class x), and want the serialization stream to show that class x was sent. Examples of this: My current program has a Host class, and a LocalHost class which is derived from it. The LocalHost, when sent to other hosts, should just get serialized as a Host. Normally you'd have to "cast" the LocalHost to a host object, and then serialize that. My scheme can handle "alias classes", which get serialized using the other's name, making sure that the other class doesn't even get a chance to serialize private attributes.
I'm sure PB can handle this. Have you read "PB Copyable: Passing Complex Types", http://twistedmatrix.com/documents/howto/pb-copyable?
3. Support for different serialization protocols. I guess I've not made the mistake that most other pickle-like module authors made, that they didn't implement support for more than one protocol right from the start... (hypocrisy... ;)) This e.g. means that object-ids (which basically are names for the pickled class) can have different meanings, when the protocol changes.
I'm not quite sure what you're saying here.
4. And most important for me (except 1): I can control exactly what data gets put into the unserialized classes, as all __unserialize functions are called in reverse order (when having classes which are bases to other classes, etc.). The point being: when I load a Host from network, I have to reduce several timeouts as "local" timeouts are much higher than timeouts for instances received over the net. As I don't reduce [...]
Again, read http://twistedmatrix.com/documents/howto/pb-copyable. With PB you can control exactly what data is sent over the wire. You also get Referenceables, Copyables and Cacheables, which is considerably richer than a simple object protocol that just passes the occasional object-graph.
5. My pickles are signed data items, and while signing banana'd packets is no problem, I have integrated signature/digesting algorithms with the stream class, which allows for checking of SHA-digests and the like as they are being written out. I don't really know any other pickler that takes care of checksumming packets right in the pickler. You can of course turn off this functionality... :)
TCP guarantees the data isn't corrupted on the way through. Use SSL to guarantee that the traffic isn't being sniffed or tampered with in transit. PB has an authentication step so that you can know who the clients are. Between all this, I'm not sure what problem your signatures are solving that isn't already solved?
6. Without being hypocrite, I guess I can say that the serialization scheme is so simple, that there couldn't be much holes an attacker could come through... Tests like the things Andrew did will certainly not work, and the burden is put on the programmer to make sure that the classes he registers as being serializable are really classes that are okay to serialize (destroy, etc.). All base classes are safe to serialize. And actually only int, string, unicode, complex, long and float ever get written out to the stream, all other objects being abstractions of a _DataStream (which is basically a list kind of object).
Banana only understands a very small number of primitives too. PB has been implemented in languages other than Python, so I don't think it's overly complex -- and I believe Brian Warner is working on a rewrite that might make it simpler, faster and more flexible.
btw, thanks to Andrew Dalke for an *excellent* beginning of a security audit for jelly. :-)
Thanks here too... Made me finally give myself the turn to actually write my own serializer... ;) I've wanted to do that for a long time, but I've always feared that it would be an overly complex task. Yesterday proved me wrong about that... :) I needed above functionality 1+4 more than once, when it wasn't available, and that always produced pretty crufty code...
As far as I can see (although I'm no PB expert), PB satisfies 1 and 4. What makes you think it doesn't? -Andrew.
On Wed, 2003-05-28 at 06:45, Andrew Bennetts wrote:
Twisted Spread is intended to be safe and secure, because data from the net could be from anywhere, so it has to be robust in the face of dangerous data.
I know, I was just being a little smartass... ;)
That's what __getstate__ and __setstate__ are for with pickle; IIRC Jelly supports that and also it's own 'getStateToCopyFor' or something along those lines.
Sure it is, but both of them are called only for the topmost object from the inheritance tree, which, when I use private variables, has little chance to see the variables that are defined below in a way that's suitable. And returning __dict__ is a little strange, as reassigning to __dict__ in the __setstate__ mehod won't fix what I need...
I'm sure PB can handle this. Have you read "PB Copyable: Passing Complex Types", http://twistedmatrix.com/documents/howto/pb-copyable?
This is not about SenderPonds and ReceiverPonds, but about the simple thing that LocalHost implements a few other data members, which should not get sent over the wire, ever! (as e.g. private key, etc.) Of course, I could use a getStateToCopyFor() to return only the data members that I want to be returned, and be available in the remote object, but I think that's kind of a hack... (remember, all my data elemens in all classes are private) I think a solution which calls only the base classes __serialize methods (in this example starting with Host) is cleaner. (That's what aliases are all about.)
3. Support for different serialization protocols. I guess I've not made I'm not quite sure what you're saying here.
The serializer supports different protocols right from the start. The datastream is fixed format, but objects can choose to give unpickled data storage different meaning according to the serialization protocol that's chosen. What this works for: I have serialization to network and serialization to database. To database shouldn't change timeouts, to network should (what I stated last time). To database should store private keys, to network shouldn't. Easy solution: Protocol 1 is the network protocol, which has a alias from LocalHost to Host as specified under 2, Protocol 2 is the Database protocol, which doesn't have this alias.
Again, read http://twistedmatrix.com/documents/howto/pb-copyable. With PB you can control exactly what data is sent over the wire. You also get Referenceables, Copyables and Cacheables, which is considerably richer than a simple object protocol that just passes the occasional object-graph.
I'm not using PB, and I don't want to in this project. PB is fine, when you have to deal with complex objects (just as Pyro is in that case), but normally, you only have to send simple objects over the wire, and I guess you're much more worried about the fact that objects might be instantiated wrongly/have to do some parameter checking/get corrupted on transport/willfully. And control over this "dark side of networking" is exactly what this pickler tries to achieve by giving you full control over the pickling and unpickling process.
TCP guarantees the data isn't corrupted on the way through. Use SSL to guarantee that the traffic isn't being sniffed or tampered with in transit. PB has an authentication step so that you can know who the clients are. Between all this, I'm not sure what problem your signatures are solving that isn't already solved?
The transport I use isn't TCP, but UDP. And rewriting PB to use UDP would be a little strange... :) Anyway, what I actually meant by signatures: You can use public key encryption/signing to encrypt members of your data stream. Classes can request to have certain members encrypted/signed in the serialization stream, just as you can for the whole pickle, and when unserializing, you can specify what to do when non-decryptable packets or invalid signature packets are found, without discarding the whole pickle (which is useful, if you need to pass around some structures which have private "text" content, but should nevertheless be readable as to timeout and the like, the class being called "Data").
As far as I can see (although I'm no PB expert), PB satisfies 1 and 4. What makes you think it doesn't?
PB could certainly be crafted to satisfy 1+4, but it certainly doesn't do UDP transport, which I need... And I don't think any PB code which handles most of the points above is as clean as it is with my own serializer. Which isn't bad, of course! PB was just designed with completely different ideas in mind... Again just my 5 cents... :) Heiko.
participants (6)
-
Andrew Bennetts
-
Andrew Dalke
-
Christopher Armstrong
-
Heiko Wundram
-
Heiko Wundram
-
Itamar Shtull-Trauring