Re: [pypy-dev] pre-emptive micro-threads utilizing shared memory message passing?

Aug. 6, 2010

      I don't mind replying to the mailing list unless it annoys someone? Maybe
some people could be interested by this discussion.

You have a lot of questions! :) My answers are inline.

2010/8/5 Kevin Ar18 <kevinar18@hotmail.com>
...
Note: Gabriel, do you think we should discuss this on another mailing list
(or in private) as I'm not sure this related to PyPy dev anymore?
Anywyas, what are your future plans for the project?
Is it just an experiment for school ... maybe in the hopes that others
would maintaining it if it was found to be interesting?
...
are you planning actual future development, maintenance, promotion of it
yourself?
Based on the interest and time I'll and and other people will have I plan to
debug this as much as possible. If people are interested to join in after my
thesis, I'll be more than open to welcome then in the project. Right now,
I'm writing my report and I'm also looking for a job. I won't have much time
to touch again to the code before next month to prepare it for my
presentation, along with a lot of examples and use cases.
...
-----------
On a personal note... the concept has a lot of similarities to what I am
exploring. However, I would have to make so many additional modifications.
Perhaps you can give some thoughts on whether it would take me a long time
to add such things?
Allright, my plan was to make all the needed lower level constructs that can
be used to build more complex things. For example, a mix of tasklet and sync
channels could be wrapped in an API to create async channels. I know this is
far from complete and I have a few ideas on how it could be improved in the
future but it's currently not needed for my project.

For now, the idea was to stay as close as possible to standard Stackless
Python and only add the needed APIs and functionalities to support
distributing tasklets between multiple interpreters.
...
Some examples:
* Two additional message passing styles (in addition to your own)
Queues - multiple tasklets can push onto queue, only one tasklet can
pop.... multiple tasklets can access the property to find out if there is
any data in the queue. Queues can be set to an infite size or set with a max
# of entries allowed.
This could easily be implemented using a standard channel and by starting
multiple tasklets to send data. With some helper methods on a channel it
could be possible to know how many tasklets are waiting to send their data.
A channel already have a built-in queue for send/receive requests. This
queue contains a list of all tasklets waiting for a send/receive operation.
Tasklets are supposed to be lightweight enough to support something like
this.
...
Streams - I'm not sure of the exact name, but kind of like an infinite
...
stream/buffer ... useful for passing infinite amounts of data. Only one
tasklet can write/add data. Only one tasklet can read/extract data.
Like a UNIX pipe()? Async? Again, some code wrapping standard channels could
be used for this.
...
* Message passing
When you create a tasklet, you assign a set number of queues or streams to
it (it can have many) and whether they extract data from them or write to
them (they can only either extract or write to it as noted above). The
tasklet's global namespace has access to these queues or streams and can
extract or add data to them.
In my case, I look at message passing from the perspective of the tasklet.
A tasklet can either be assigned a certain number of "in ports" and a
certain number of "out ports." In this case the "in ports" are the .read()
end of a queue or stream and the "out ports" are the .send() part of a queue
or stream.
Sorry, I don't really understand what you're trying to explain here. Maybe
an example could be helpful? :)
...
* Scheduler
For the scheduler, I would need to control when a tasklet runs. Currently,
I am thinking that I would look at all the "in ports" that a tasklet has and
make sure each one has some data. Only then would the tasklet be scheduled
to run by the scheduler.
Couldn't all those ports (channels) be read one at a time, then the
processing could be done? I don't exactly see the need to play with the
scheduler. Channels are blocking. A tasklet will be anyway unscheduled when
it tries to read on a channel in which no data is available.
...
------------
On another note, I am curious how you handled the issue of "nested"
objects. Consider send() and receive() that you use to pass objects around
in your project. Am I correct in that these objects cannot contain
references outside of themselves? Also, how do you handle extracting out of
the tree and making sure there are not references outside the object?
Right now, I did not really dig too far with this problem. With a local
communication, a reference to the object is sent through a channel. The
receiver tasklet will have the same access to the object and all the
sub-object as the sender tasklet.

For remote communications, pickling is involved. The object to send must be
picklable. It excludes any I/O object unless the programmer creates its own
pickling protocol for those. A copy of all the object tree will then be
made. Sometime it's good (small objects), sometime it's bad (really complex,
big objects, I/O objects, etc.). This is why I added the concept of
ref_object() using PyPy's proxy object space. For such objects, a proxy can
be made and only a reference object will be sent to the remote side. This
object will have the same type as the original object but all operations
will be forwarded to the host node. All replies will also be wrapped by
proxies when sent back to the remote reference object. The only case where a
proxy object is not created is with atomic types (string, int, float, etc).
It's useless for those because they are immutable anyway. A remote access to
those would introduce useless latency. With ref_object(), the object tree
always stay on the initial node. A move() operation will also be added to
those ref_object()s to be able to move them between interpreters if needed.
...
For example, consider the following object, where "->" means it has a
reference to that object
Object 1 -> Object 2
Object 2 -> Object 3
Object 2 -> Object 4
...
Object 4 -> Object 2
Now, let's say I have a tasklet like the following:
.... -> incoming data = pointer/reference to Object 1
1. read incoming data (get Object 1 reference)
2. remove Object 3
3. send Object 3 to tasklet B
4. send Object 1 to tasklet C
Result:
tasklet B now has this object:
pointer/reference to Object 1, which contains the following tree:
Object 1 -> Object 2
...
Object 2 -> Object 4
Object 4 -> Object 2
tasklet C now has this object:
pointer/reference to Object 3, which contains the following tree:
Object 3
I think you swapped tasklet B and tasklet C for the end result! ;)
...
On the other hand, consider the following scenario:
1. read incoming data (get Object 1 reference)
2. remove Object 4
ERROR: this would not be possible, as it refers to Object 2
Why isn't it possible?
By removing "Object 4" I guess you mean removing this link: Object 2 ->
Object 4? This is the only way Object 4 could be removed.
...
...
Sorry for the late answer, I was unavailable in the last few days.
About send() and receive(), it depends on if the communication is local
or not. For a local communication, anything can be passed since only
the reference is sent. This is the base model for Stackless channels.
For a remote communication (between two interpreters), any picklable
object (a copy will then be made) and it includes channels and tasklets
(for which a reference will automatically be created).
The use of the PyPy proxy object space is to make remote communication
more Stackless like by passing object by reference. If a ref_object is
made, only a reference will be passed when a tasklet is moved or the
object is sent on a channel. The object always resides where it was
created. A move() operation will also be implemented on those objects
so they can be moved around like tasklets.
I hope it helps,
Gabriel
2010/7/29 Kevin Ar18>
...
Hello Kevin,
I don't know if it can be a solution to your problem but for my
Master Thesis I'm working on making Stackless Python distributed. What
I did is working but not complete and I'm right now in the process of
writing the thesis (in french unfortunately). My code currently works
with PyPy's "stackless" module onlyis and use some PyPy specific
things. Here's what I added to Stackless:
- Possibility to move tasklets easily (ref_tasklet.move(node_id)). A
node is an instance of an interpreter.
- Each tasklet has its global namespace (to avoid sharing of data). The
state is also easier to move to another interpreter this way.
- Distributed channels: All requests are known by all nodes using the
channel.
- Distributed objets: When a reference is sent to a remote node, the
object is not copied, a reference is created using PyPy's proxy object
space.
- Automated dependency recovery when an object or a tasklet is loaded
on another interpreter
With a proper scheduler, many tasklets could be automatically spread in
multiple interpreters to use multiple cores or on multiple computers. A
bit like the N:M threading model where N lightweight threads/coroutines
can be executed on M threads.
Was able to have a look at the API...
If others don't mind my asking this on the mailing list:
* .send() and .receive()
What type of data can you send and receive between the tasklets? Can
you pass entire Python objects?
* .send() and .receive() memory model
When you send data between tasklets (pass messages) or whateve you want
to call it, how is this implemented under the hood? Does it use shared
memory under the hood or does it involve a more costly copying of the
data? I realize that if it is on another machine you have to copy the
data, but what about between two threads? You mentioned PyPy's proxy
object.... guess I'll need to read up on that.
_______________________________________________
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev
--
Gabriel Lavoie
glavoie@gmail.com
By the way, if you come to #pypy on FreeNode, I'm WildChild! I'm always
there though not alway available. I'm in the EST timezone (UTC-5).

See ya,

Gabriel

-- 
Gabriel Lavoie
glavoie@gmail.com