[Twisted-Python] Compressing PB communication
I am interested in compressing the PB objects after serialization and unpacking when received at the other side. I a using the pb.PBClientFactory() and ServerFactory from the documentation. Where is the most proper place to hack this? What is possible? I need to do this after the object becomes a string so i won't have to pickle a string. Is it possible to do something like factory.protocol.serialize=mySerialize and in mySerialize i use the banana.encode function to encode the list of jelly myself and return there a new list that contains the now ziped and encoded value and let it go as it were. This way i am not interfering with the jelly logic. -- Regards, Tzahi. -- Tzahi Fadida Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info WARNING TO SPAMMERS: see at http://members.lycos.co.uk/my2nis/spamwarning.html
On Thu, 29 Jun 2006 00:42:41 +0300, Tzahi Fadida <tzahi.ml@gmail.com> wrote:
I am interested in compressing the PB objects after serialization and unpacking when received at the other side. I a using the pb.PBClientFactory() and ServerFactory from the documentation. Where is the most proper place to hack this? What is possible? I need to do this after the object becomes a string so i won't have to pickle a string. Is it possible to do something like factory.protocol.serialize=mySerialize and in mySerialize i use the banana.encode function to encode the list of jelly myself and return there a new list that contains the now ziped and encoded value and let it go as it were. This way i am not interfering with the jelly logic.
Attached a demonstration of doing this. However, I doubt it's actually a useful idea. Jean-Paul
10x!, exactly what i needed It is useful for very big transfers (around 20mbytes). The clients have a limited bandwidth of 750kbits or less so it would be very useful to compress very big messages. Which leads me to ask if it is possible to associate a .write call to a specific callRemote or return value? or at least Mark the data i wish to send, to be compressed after serialization in the callRemote. Obviously there is no need to compress everything. More so, it is possible to add meta information per call, like encryption or tags for tracing etc... or if the server is loaded or, for example, use a server in between to handle compression by looking at the meta tag. Could be very useful. I'll need to do some testing to see what is better. On Thursday 29 June 2006 02:00, Jean-Paul Calderone wrote:
On Thu, 29 Jun 2006 00:42:41 +0300, Tzahi Fadida <tzahi.ml@gmail.com> wrote:
I am interested in compressing the PB objects after serialization and
Attached a demonstration of doing this.
However, I doubt it's actually a useful idea.
Jean-Paul
-- Regards, ��������Tzahi. -- Tzahi Fadida Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info WARNING TO SPAMMERS: �see at http://members.lycos.co.uk/my2nis/spamwarning.html
On Thu, 29 Jun 2006 11:21:40 +0300, Tzahi Fadida <tzahi.ml@gmail.com> wrote:
More so, it is possible to add meta information per call, like encryption or tags for tracing etc... or if the server is loaded or, for example, use a server in between to handle compression by looking at the meta tag. Could be very useful. I'll need to do some testing to see what is better.
No. PB doesn't have any facility like that (nor for per-message compression). It tries to abstract the wire as much as possible; its communication model is between objects witth methods, not byte streams or message processors. You could probably add a type-byte to jelly to do something like this but I certainly don't have the motivation personally :). Patches accepted...
On Thu, 29 Jun 2006 00:42:41 +0300, Tzahi Fadida <tzahi.ml@gmail.com> wrote:
I am interested in compressing the PB objects after serialization and unpacking when received at the other side.
Have you actually measured the size of the PB data you are thinking about compressing? PB's serialization tends to be extremely tight.
On Thursday 29 June 2006 02:21, glyph@divmod.com wrote:
On Thu, 29 Jun 2006 00:42:41 +0300, Tzahi Fadida <tzahi.ml@gmail.com> wrote:
I am interested in compressing the PB objects after serialization and unpacking when received at the other side.
Have you actually measured the size of the PB data you are thinking about compressing? PB's serialization tends to be extremely tight.
Yes, it's about 20megabytes of data (~20000 records). Although, each client only get this once when they log on and each time their cache is wiped. This is one of the reasons i backed off using a solution like XUL together with XMLRPC solutions and javascript, it would not hold (i tried). My only choice is to use pyGTK with either pyro/twisted and in the end i decided on going with twisted because of security concerns that are weighted with performance. This is why i was very insistent on compressing after serialization, so all the security features will be kept intact. 10x.
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
-- Regards, ��������Tzahi. -- Tzahi Fadida Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info WARNING TO SPAMMERS: �see at http://members.lycos.co.uk/my2nis/spamwarning.html
On Thu, 29 Jun 2006 09:41:09 +0300, Tzahi Fadida <tzahi.ml@gmail.com> wrote:
On Thursday 29 June 2006 02:21, glyph@divmod.com wrote:
Have you actually measured the size of the PB data you are thinking about compressing? PB's serialization tends to be extremely tight.
Yes, it's about 20megabytes of data (~20000 records).
You are aware that PB has a 640kb-per-message limit, right? You'll have to be breaking up that data set.
On Thursday 29 June 2006 21:33, glyph@divmod.com wrote:
On Thu, 29 Jun 2006 09:41:09 +0300, Tzahi Fadida <tzahi.ml@gmail.com> wrote:
On Thursday 29 June 2006 02:21, glyph@divmod.com wrote:
Have you actually measured the size of the PB data you are thinking about compressing? PB's serialization tends to be extremely tight.
Yes, it's about 20megabytes of data (~20000 records).
You are aware that PB has a 640kb-per-message limit, right? You'll have to be breaking up that data set.
No, that is new to me. I cannot break it up. can it be done using that writeSequence or Write in the ProtocolWrapper? I don't want to handle this on the application/PB level. I can see though that this is confined to the banana.py and cbanana, i can change it there as a last resort, right? 640kb looks arbitrary to me, times is changing, what was considered large in the past is small in the present. At the very least it should be configurable.
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
-- Regards, ��������Tzahi. -- Tzahi Fadida Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info WARNING TO SPAMMERS: �see at http://members.lycos.co.uk/my2nis/spamwarning.html
On Thursday 29 June 2006 22:31, Itamar Shtull-Trauring wrote:
On Thu, 2006-06-29 at 22:01 +0300, Tzahi Fadida wrote:
I don't want to handle this on the application/PB level.
twisted.spread.util.Pager and subclasses thereof will do it for you. First, the objects themselves hold , for a example, a list of records. The records themselves are probably never going to exceed 640kb. Looking at banana it looks at the string types per field. I.e. even if the message is 1gb but each field in the message, like a list of strings [str1,str2,...] does not exceed 640kb then there is no problem, right?
Assumming there is a problem: I don't see how using Pager, going to not involve circumventing a PB call. This should be done under the level of performing a call to the other side. As i said, i cannot break the arbitrary objects i am sending since i cannot predict each and every one of them in advance. The proper way to do it, looks to me in the underlying communication. I think that what should have been done was to send the size of the transfer in advance and only then say "ok, this transfer is going to be over 640kb don't want it." Or " ok the transmission has exceeded 640kb => error" And, according to this number and policy i enforce, limiting the size of the expanded message, pass a parameter to banana. For now, i think that i will set the limit parameter for banana.py pb.banana.SIZE_LIMIT=20*1024*1024 however, for cBanana i will have to either hardwire it or change it to include it as a parameter. This is not a variable in cBanana, it just says 640*1024 instead of SIZE_LIMIT. How can i pass parameters to cBanana (at least at initialization time), i am not sure if it is even on. Though i am seeing it in the directory: /usr/lib/python2.4/site-packages/twisted/spread/cBanana.so Maybe there should be something like pb.changeMessageSizeLimit(numBytes)
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
-- Regards, ��������Tzahi. -- Tzahi Fadida Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info WARNING TO SPAMMERS: �see at http://members.lycos.co.uk/my2nis/spamwarning.html
How can i pass parameters to cBanana (at least at initialization time), i am not sure if it is even on.
http://twistedmatrix.com/pipermail/twisted-python/2005-September/011343.html http://twistedmatrix.com/pipermail/twisted-python/2004-December/009158.html -- Nicola Larosa - http://www.tekNico.net/ [Python] insists upon arranging the cupboard into strict rows, but you stop noticing after a while, and eventually you come to prefer your shelves organized this way. Your friends think this is weird until they start dating Python too. -- Meredith L. Patterson, March 2006
On Thu, 29 Jun 2006 15:31:17 -0400, Itamar Shtull-Trauring <itamar@itamarst.org> wrote:
On Thu, 2006-06-29 at 22:01 +0300, Tzahi Fadida wrote:
I don't want to handle this on the application/PB level.
twisted.spread.util.Pager and subclasses thereof will do it for you.
I should emphasize that not only do these classes help with this sort of functionality; every PB application I'm aware of that had to transfer bulk data over the main channel worked this way. It's not like you're going to be in any new territory or implementing a large raft of functionality yourself.
On Thu, 29 Jun 2006 22:01:22 +0300, Tzahi Fadida <tzahi.ml@gmail.com> wrote:
No, that is new to me. I cannot break it up. can it be done using that writeSequence or Write in the ProtocolWrapper? I don't want to handle this on the application/PB level. I can see though that this is confined to the banana.py and cbanana, i can change it there as a last resort, right? 640kb looks arbitrary to me, times is changing, what was considered large in the past is small in the present. At the very least it should be configurable.
Not really. PB is optimized (heavily optimized) for lots of small control-channel messages. Serializing 20MB of data with Jelly would probably stop your process for a good 30 seconds, and due to the way that the original Jelly was designed, it cannot be processed incrementally; you will end up allocating something like 100MB of memory just to get the data serialized and the packet sent, if you raise the limit. Feel free to do so; if multi-second messages and multi-hundred-megabyte memory costs are acceptable to you then you can just change the arbitrary limit. I won't accept patches to make it easily configurable in Twisted though; IMHO that is just bad design. You should instead be factoring your data to be produced on demand in a series of messages rather than one giant message. In addition to lower overall memory cost and more immediate feedback, that has the additional advantage of providing you with a way to interleave other messages on the same channel.
On Friday 30 June 2006 03:44, glyph@divmod.com wrote:
On Thu, 29 Jun 2006 22:01:22 +0300, Tzahi Fadida <tzahi.ml@gmail.com> wrote:
No, that is new to me. I cannot break it up. can it be done using that writeSequence or Write in the ProtocolWrapper? I don't want to handle this on the application/PB level. I can see though that this is confined to the banana.py and cbanana, i can change it there as a last resort, right? 640kb looks arbitrary to me, times is changing, what was considered large in the past is small in the present. At the very least it should be configurable.
Not really. PB is optimized (heavily optimized) for lots of small control-channel messages. Serializing 20MB of data with Jelly would probably stop your process for a good 30 seconds, and due to the way that the original Jelly was designed, it cannot be processed incrementally; you will end up allocating something like 100MB of memory just to get the data serialized and the packet sent, if you raise the limit.
30 seconds of doing what? serializing? Let me try something else, what if i want to replace jelly with pickle for a special Server and Client factory for sending messages from the Server to the client only. If the client sends a message to the server, they must be jellied. The idea is that you are consciously saying that the client completely trust the server. I think this is a good idea for some security model that also want performance and resource savings.
Feel free to do so; if multi-second messages and multi-hundred-megabyte memory costs are acceptable to you then you can just change the arbitrary limit. I won't accept patches to make it easily configurable in Twisted though; IMHO that is just bad design. You should instead be factoring your data to be produced on demand in a series of messages rather than one giant message. In addition to lower overall memory cost and more immediate feedback, that has the additional advantage of providing you with a way to interleave other messages on the same channel.
I don't understand, you are saying that twisted does not send portions, i.e. it is blocking? that doesn't sound right to me, even if i send 20mb of data it should be portioned and let other transfer also get the chance. I was under the impression that twisted is multiplexing even on the channel level. I can, of course, always open 2 channels but this is evil.
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
-- Regards, ��������Tzahi. -- Tzahi Fadida Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info WARNING TO SPAMMERS: �see at http://members.lycos.co.uk/my2nis/spamwarning.html
On Fri, 30 Jun 2006 12:24:43 +0300, Tzahi Fadida <tzahi.ml@gmail.com> wrote:
On Friday 30 June 2006 03:44, glyph@divmod.com wrote:
On Thu, 29 Jun 2006 22:01:22 +0300, Tzahi Fadida <tzahi.ml@gmail.com> wrote:
Not really. PB is optimized (heavily optimized) for lots of small control-channel messages. Serializing 20MB of data with Jelly would probably stop your process for a good 30 seconds, and due to the way that the original Jelly was designed, it cannot be processed incrementally; you will end up allocating something like 100MB of memory just to get the data serialized and the packet sent, if you raise the limit.
30 seconds of doing what? serializing?
Yeah. This is just an estimate, of course, but passing 20 megabytes of structured data through jelly is going to be really slow. It definitely should be faster, but nobody really has the inclination; people with harder performance requirements tend to just go use other protocols rather than improve PB, as Bruce mentioned with STOMP.
Let me try something else, what if i want to replace jelly with pickle for a special Server and Client factory for sending messages from the Server to the client only. If the client sends a message to the server, they must be jellied. The idea is that you are consciously saying that the client completely trust the server. I think this is a good idea for some security model that also want performance and resource savings.
I don't think this is particularly a good idea, but then, it's an idea that has less and less to do with PB all the time. You're talking about implementing a new protocol that has about two dozen features that PB does not have: support for pluggable serialization mechanisms, message tagging, on the fly compression, chunked encoding of large messages. Some of these features would require that you change the way ordering guarantees work in PB and the way concurrency interacts with its API. Maybe you could use some small portion of PB to build this monster of a protocol, but when you're done, you would not be using PB. I can't see why you need the protocol machinery to do all of this for you. If I were building this application I'd certainly just use streaming producers and send PB messages (over an existing, unmodified PB) that were of a reasonable size until all the data had been transferred. Also, designing a protocol where you "completely trust" the other end of the wire, even if it is the "server", is a bad idea. You should only trust the other end of the wire if every message that it sends is encrypted and signed with a verified certificate, and even that is a stretch. Using pickle at the protocol level means that during the verification process, you are vulnerable to attacks that use the pickled signature exchange to send you an exploit.
I don't understand, you are saying that twisted does not send portions, i.e. it is blocking? that doesn't sound right to me, even if i send 20mb of data it should be portioned and let other transfer also get the chance. I was under the impression that twisted is multiplexing even on the channel level. I can, of course, always open 2 channels but this is evil.
If you send 20mb of data in a single write() call, that means you have at least 20mb of data sitting in the outbound write buffer until it can all be written, not to mention that the jelly serialization process is going to copy all of your data at least twice.
Well, Lets reset the discussion. You have convinced me. I'll move all the complexity to the application level. You are right about the scalability issue so i am inclined to agree with you. The idea is this: First, i will use pickled/jellied strings to send objects to the client, all the while, using Pagers. If the pickled/jellied string will be too large i will dump it to the disk so it won't be in memory. The main reason i need to use pickle or jelly is that it is hard to know what will be the eventual size of a jellied/pickled object (so message<640kb) and i can't use Pager on arbitrary objects. I will portion the big objects using some kind of a series of updating tasks. A bit of a headache though. this is on the application level, probably will need to be specified explicitly. However, i was going to do that anyway, for after sending the initial huge message for synchronization purposes. Sending from client to server will have to be done very explicitly, for example, an act for transfering documents, act for sending files, act for mass update for records, etc... Another headache. Is there a very very simple example for a client server using PBServerFactory and ClientFactory and also uses a Pager to send a big string. 10x. -- Regards, ��������Tzahi. -- Tzahi Fadida Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info WARNING TO SPAMMERS: �see at http://members.lycos.co.uk/my2nis/spamwarning.html
participants (5)
-
glyph@divmod.com
-
Itamar Shtull-Trauring
-
Jean-Paul Calderone
-
Nicola Larosa
-
Tzahi Fadida