[IPython-dev] Message spec draft more fleshed out

Wed Aug 11 03:03:12 EDT 2010

On Tue, Aug 10, 2010 at 17:23, Fernando Perez <fperez.net at gmail.com> wrote:

> Hey Min,
>
> On Tue, Aug 10, 2010 at 11:55 AM, MinRK <benjaminrk at gmail.com> wrote:
> > This is great.
> > There are a few additional functionalities I need on top of this, that I
> > have added to the message spec I use in my parallel code.
> > I have multiple clients,and need unique message ids, so clearly ints are
> > inadequate.  I switched msg_id to also be a uuid. I could certainly
> generate
> > unique msg ids in the controller by combining the msg id and the session
> id,
> > which should be unique.
>
> OK, should we just switch to these right away?  If so, does this sound
> like the right way to make them: ?
>

It's nice to have int access at the client level, and I have that builtin to
my client, but the real IDs used by the system are all uuids. This is fairly
easy to implement.

>
> uuid.uuid1(os.getpid())
>
> We'd obviously cache the pid, but this lets us seed the uuid1 call
> with the pid of eac client.  Alternatively we can call uuid4(), and
> trust that the probability of collisions is low enough. I sort of like
> better the idea of seeding with a known quantity; we could combine
> hostid and pid if we want to be extra safe, but I don't think it's
> worth worrying about that level of low-probability of collision, is
> it?
>

uuid1(pid) is nice because it makes collision impossible on a single
machine.

However, the pid seed replaces a 48b section of the UUID, and the rest is
time based. PIDs generally exist in a very small range (most likely low
thousands unless you have infinite uptime, in which case it is approximately
a random 15b number). If you are on many machines, rather than many
processes in one machine, the PID section is a dramatic reduction in
randomness, as is the time-based segment, which should be quite similar
across machines.

UUID Sections (from RFC 4122):
timestamp: 60b, resolution = 0.1us
version : 4b, constant
clock_seq: 16b, treat as random
node : 48b, assigned from 15b PID range

So for two uuids generated on different machines within 1ms (relative
internal clock time), the probability of collision is at least:
(likelihood of timestamp match) * (clock_seq match) * (PID match)
1:(1e4 * 2^16 * 2^15) ~ 1 in 1e13.

treating uuid4 as fully random (124b), the likelihood of the same two uuids
is
1:2^124 ~ 1 in 1e37.

Much less likely.

Running on a single machine with 8 engines (using my zmq IPython cluster), I
generated 100k UUIDs on each, as fast as I could, first with uuid1(1), and
second with uuid4().  I reliably had at least 1, and up to 5 collisions with
the uuid1 case. I never encountered a collision with uuid4.

I also discovered that uuid1 with a specified seed is notably faster than
uuid4 (22us vs 33us on my machine).

> > Since I need to inspect messages on the way, and don't want to have to
> > unpack the content of the message, I can't send the whole message as one
> > json object. For this, I split it, such that the headers and content are
> > sent separately. msg_type is added to the header for this.
>
> Should we list the multipart spec separately?  This would be only for
> 'data carrying' messages, or would it be for all communications?
>

It's not just for data carrying messages, it's relevant for all snooped
messages passing through a controller. The controller should never need to
unpack the content of a message, since it could be a massive code block in
an execute_request or a big fat reply. Currently, all messages are sent this
way in my code, but that doesn't need to be the case. It does need to be the
case for all messages sent from the client to the kernel, since those are
the ones whose headers are inspected.

>
> > I need to be able to send data without copying, and for that I added a
> > 'buffers' element at the top level of a message.  I also added an
> > apply_message type, for using Brian's apply model. I will write up how
> the
> > apply stuff works later (I expect there will be some discussion and
> > rearrangement of some of it).
> > I also added, but no longer use, a subheader, which allows senders of
> > messages to extend the header.  I needed this when the Controller parses
> a
> > message destined for an engine, it shouldn't unpack the content of the
> > message, only the header. Since the routing is now handled purely in zmq,
> I
> > don't currently have a need for the subheader, but I can certainly
> imagine
> > it being useful.  This is not so much a part of the root message format
> as
> > it is a part of the session.send() api.
>
> I'm finishing up the doc, it would be good if you could write up these
> ideas into it so we have all the design in one place...  I'll ping
> soon with the finished draft.
>

I will write them up (and have already done some in my fork). I am
travelling now, but will be back in Berkeley Thursday. I might have some
good writing time on the plane, though.

>
> Cheers,
>
> f
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20100811/52b58843/attachment.html>