Hi all --
I am interested in making some serious ongoing contributions around
multiprocessing.
My inspiration, first and foremost, comes from the current documentation
for multiprocessing. There is great material there but I believe it is
being presented in a way that hinders adoption and understanding. I've
taken some initial baby-steps to propose specific changes:
http://bugs.python.org/issue22952http://bugs.python.org/issue23100
The first, issue22952, can reasonably be tackled with a patch like I've
submitted. Continuing with patches for issue23100 can also be made to
work. I realize that reviewing such patches takes non-trivial time from
volunteers yet I'm interested in submitting a series of patches to
hopefully make the documentation for multiprocessing much more consistent
with other module docs and much more accessible to end users. I don't want
to simply create more work for other volunteers -- I'd like to volunteer to
reduce / share some of their work as well.
Beyond the documentation, there is currently a backlog of 186 issues
mentioning multiprocessing, some with patches on offer, some without. I'd
like to volunteer my time reviewing and triaging these issues. Hopefully
you can already get a sense of my voice on issues from what I wrote in
those two issues above.
Rather than me simply walking through that backlog, offering comments or
encouragement here and there on issues, it makes more sense for me to ask:
what is the right way for me to proceed? What is the next step towards me
helping triage issues? Is there a bridge-keeper with at least three, no
more than five questions for me?
Thanks,
Davin
newbie first post on this list, if what follows is of context ...
Hi all,
I'm struggling with issue per the subject, read different threads and
issue http://bugs.python.org/issue15443 that started 2012 still opened
as of today.
Isn't there a legitimate case for nanosecond support? it's all over the
place in 'struct timespec' and maybe wrongly I always found python and C
were best neighbors. That's for the notional aspect.
More practically, aren't we close enough yet with current hardware, PTP
and the likes, this deserves more consideration?
Maybe this has been mentioned before but the limiting factor isn't just
getting nanoseconds, but anything sub-microseconds wont work with the
current format. OpcUA that I was looking right now has 10-th us
resolution, so really cares about 100ns, but the datetime 1us simply
wont cut it.
Regards,
Matthieu
This is a bit long as I espoused as if this was a blog post to try and give
background info on my thinking, etc. The TL;DR folks should start at the
"Ideal Scenario" section and read to the end.
P.S.: This is in Markdown and I have put it up at
https://gist.github.com/brettcannon/a9c9a5989dc383ed73b4 if you want a
nicer formatted version for reading.
# History lesson
Since I signed up for the python-dev mailing list way back in June 2002,
there seems to be a cycle where we as a group come to a realization that
our current software development process has not kept up with modern
practices and could stand for an update. For me this was first shown when
we moved from SourceForge to our own infrastructure, then again when we
moved from Subversion to Mercurial (I led both of these initiatives, so
it's somewhat a tradition/curse I find myself in this position yet again).
And so we again find ourselves at the point of realizing that we are not
keeping up with current practices and thus need to evaluate how we can
improve our situation.
# Where we are now
Now it should be realized that we have to sets of users of our development
process: contributors and core developers (the latter whom can play both
roles). If you take a rough outline of our current, recommended process it
goes something like this:
1. Contributor clones a repository from hg.python.org
2. Contributor makes desired changes
3. Contributor generates a patch
4. Contributor creates account on bugs.python.org and signs the
[contributor agreement](https://www.python.org/psf/contrib/contrib-form/)
4. Contributor creates an issue on bugs.python.org (if one does not already
exist) and uploads a patch
5. Core developer evaluates patch, possibly leaving comments through our
[custom version of Rietveld](http://bugs.python.org/review/)
6. Contributor revises patch based on feedback and uploads new patch
7. Core developer downloads patch and applies it to a clean clone
8. Core developer runs the tests
9. Core developer does one last `hg pull -u` and then commits the changes
to various branches
I think we can all agree it works to some extent, but isn't exactly smooth.
There are multiple steps in there -- in full or partially -- that can be
automated. There is room to improve everyone's lives.
And we can't forget the people who help keep all of this running as well.
There are those that manage the SSH keys, the issue tracker, the review
tool, hg.python.org, and the email system that let's use know when stuff
happens on any of these other systems. The impact on them needs to also be
considered.
## Contributors
I see two scenarios for contributors to optimize for. There's the simple
spelling mistake patches and then there's the code change patches. The
former is the kind of thing that you can do in a browser without much
effort and should be a no-brainer commit/reject decision for a core
developer. This is what the GitHub/Bitbucket camps have been promoting
their solution for solving while leaving the cpython repo alone.
Unfortunately the bulk of our documentation is in the Doc/ directory of
cpython. While it's nice to think about moving the devguide, peps, and even
breaking out the tutorial to repos hosting on Bitbucket/GitHub, everything
else is in Doc/ (language reference, howtos, stdlib, C API, etc.). So
unless we want to completely break all of Doc/ out of the cpython repo and
have core developers willing to edit two separate repos when making changes
that impact code **and** docs, moving only a subset of docs feels like a
band-aid solution that ignores the big, white elephant in the room: the
cpython repo, where a bulk of patches are targeting.
For the code change patches, contributors need an easy way to get a hold of
the code and get their changes to the core developers. After that it's
things like letting contributors knowing that their patch doesn't apply
cleanly, doesn't pass tests, etc. As of right now getting the patch into
the issue tracker is a bit manual but nothing crazy. The real issue in this
scenario is core developer response time.
## Core developers
There is a finite amount of time that core developers get to contribute to
Python and it fluctuates greatly. This means that if a process can be found
which allows core developers to spend less time doing mechanical work and
more time doing things that can't be automated -- namely code reviews --
then the throughput of patches being accepted/rejected will increase. This
also impacts any increased patch submission rate that comes from improving
the situation for contributors because if the throughput doesn't change
then there will simply be more patches sitting in the issue tracker and
that doesn't benefit anyone.
# My ideal scenario
If I had an infinite amount of resources (money, volunteers, time, etc.),
this would be my ideal scenario:
1. Contributor gets code from wherever; easiest to just say "fork on GitHub
or Bitbucket" as they would be official mirrors of hg.python.org and are
updated after every commit, but could clone hg.python.org/cpython if they
wanted
2. Contributor makes edits; if they cloned on Bitbucket or GitHub then they
have browser edit access already
3. Contributor creates an account at bugs.python.org and signs the CLA
3. The contributor creates an issue at bugs.python.org (probably the one
piece of infrastructure we all agree is better than the other options,
although its workflow could use an update)
4. If the contributor used Bitbucket or GitHub, they send a pull request
with the issue # in the PR message
5. bugs.python.org notices the PR, grabs a patch for it, and puts it on
bugs.python.org for code review
6. CI runs on the patch based on what Python versions are specified in the
issue tracker, letting everyone know if it applied cleanly, passed tests on
the OSs that would be affected, and also got a test coverage report
7. Core developer does a code review
8. Contributor updates their code based on the code review and the updated
patch gets pulled by bugs.python.org automatically and CI runs again
9. Once the patch is acceptable and assuming the patch applies cleanly to
all versions to commit to, the core developer clicks a "Commit" button,
fills in a commit message and NEWS entry, and everything gets committed (if
the patch can't apply cleanly then the core developer does it the
old-fashion way, or maybe auto-generate a new PR which can be manually
touched up so it does apply cleanly?)
Basically the ideal scenario lets contributors use whatever tools and
platforms that they want and provides as much automated support as possible
to make sure their code is tip-top before and during code review while core
developers can review and commit patches so easily that they can do their
job from a beach with a tablet and some WiFi.
## Where the current proposed solutions seem to fall short
### GitHub/Bitbucket
Basically GitHub/Bitbucket is a win for contributors but doesn't buy core
developers that much. GitHub/Bitbucket gives contributors the easy cloning,
drive-by patches, CI, and PRs. Core developers get a code review tool --
I'm counting Rietveld as deprecated after Guido's comments about the code's
maintenance issues -- and push-button commits **only for single branch
changes**. But for any patch that crosses branches we don't really gain
anything. At best core developers tell a contributor "please send your PR
against 3.4", push-button merge it, update a local clone, merge from 3.4 to
default, do the usual stuff, commit, and then push; that still keeps me off
the beach, though, so that doesn't get us the whole way. You could force
people to submit two PRs, but I don't see that flying. Maybe some tool
could be written that automatically handles the merge/commit across
branches once the initial PR is in? Or automatically create a PR that core
developers can touch up as necessary and then accept that as well?
Regardless, some solution is necessary to handle branch-crossing PRs.
As for GitHub vs. Bitbucket, I personally don't care. I like GitHub's
interface more, but that's personal taste. I like hg more than git, but
that's also personal taste (and I consider a transition from hg to git a
hassle but not a deal-breaker but also not a win). It is unfortunate,
though, that under this scenario we would have to choose only one platform.
It's also unfortunate both are closed-source, but that's not a
deal-breaker, just a knock against if the decision is close.
### Our own infrastructure
The shortcoming here is the need for developers, developers, developers!
Everything outlined in the ideal scenario is totally doable on our own
infrastructure with enough code and time (donated/paid-for infrastructure
shouldn't be an issue). But historically that code and time has not
materialized. Our code review tool is a fork that probably should be
replaced as only Martin von Löwis can maintain it. Basically Ezio Melotti
maintains the issue tracker's code. We don't exactly have a ton of people
constantly going "I'm so bored because everything for Python's development
infrastructure gets sorted so quickly!" A perfect example is that R. David
Murray came up with a nice update for our workflow after PyCon but then ran
out of time after mostly defining it and nothing ever became of it (maybe
we can rectify that at PyCon?). Eric Snow has pointed out how he has
written similar code for pulling PRs from I think GitHub to another code
review tool, but that doesn't magically make it work in our infrastructure
or get someone to write it and help maintain it (no offense, Eric).
IOW our infrastructure can do anything, but it can't run on hopes and
dreams. Commitments from many people to making this happen by a certain
deadline will be needed so as to not allow it to drag on forever. People
would also have to commit to continued maintenance to make this viable
long-term.
# Next steps
I'm thinking first draft PEPs by February 1 to know who's all-in (8 weeks
away), all details worked out in final PEPs and whatever is required to
prove to me it will work by the PyCon language summit (4 months away). I
make a decision by May 1, and
then implementation aims to be done by the time 3.5.0 is cut so we can
switch over shortly thereafter (9 months away). Sound like a reasonable
timeline?
Greetings.
I'm sorry if I'm too insistent, but it's not truly rewarding to
constantly improve a patch that no one appears to need. Again, I
understand people are busy working and/or reviewing critical patches,
but 2 months of inactivity is not right. Yes, I posted a message
yesterday, but no one seemed to be bothered. In any case, I'll respect
your decision about this patch and will never ask for a review of this
patch again.
Regards, Dmitry.
Help with finding tutors for Python, Linux, R, Perl, Octave, MATLAB and/or
Cytoscape for yeast microarray analysis, next generation sequencing and
constructing gene interaction networks
Hi
I am a visually impaired bioinformatics graduate student using microarray
data for my master’s thesis aimed at deciphering the mechanism by which the
yeast wild type can suppress the rise of free reactive oxygen species (ROS)
induced by caloric restriction (CR) but the Atg15 and Erg6 knockout mutant
cannot.
Since my remaining vision is very limited I need very high magnification.
But that makes my visual field very small. Therefore I need somebody to
teach me how to use these programming environments, especially for
microarray analysis, next generation sequencing and constructing gene and
pathway interaction networks. This is very difficult for me to figure out
without assistance because Zoomtext, my magnification and text to speech
software, on which I am depending because I am almost blind, has problems
reading out aloud many programming related websites to me. And even those
websites it can read, it can only read sequentially from left to right and
then from top to bottom. Unfortunately, this way of acquiring, finding,
selecting and processing new information and answering questions is too
tiresome, exhausting, ineffective and especially way too time consuming for
graduating with a PhD in bioinformatics before my funding runs out despite
being severely limited by my visual disability. I would also need help
with writing a good literature review and applying the described techniques
to my own yeast Affimetrix microarray dataset because I cannot see well
enough to find all relevant publications on my own.
Some examples for specific tasks I urgently need help with are:
1. Analyzing and comparing the three publically available microarray
datasets that can be accessed at:
A. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41860
B. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE38635
C. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9217
2. Learning how to use the Affymetrics microarray analysis software for
the Yeast 2 chip, which can be found at
http://www.affymetrix.com/support/technical/libraryfilesmain.affx
3. For Cytoscape I need somebody, who can teach me how to execute the
tutorials at the following links because due to my very limited vision
field I cannot see tutorial and program interface simultaneously.
A.
http://opentutorials.cgl.ucsf.edu/index.php/Tutorial:Introduction_to_Cytosc…
B.
http://opentutorials.cgl.ucsf.edu/index.php/Tutorial:Filtering_and_Editing_…
C.
http://cytoscape.org/manual/Cytoscape2_8Manual.html#Import%20Fixed-Format%2…
D. http://wiki.cytoscape.org/Cytoscape_User_Manual/Network_Formats
4. Learning how to use the TopGo R package to perform statistical
analysis on GO enrichments.
Since I am legally blind the rehab agency is giving me money to pay tutors
for this purpose. Could you please help me getting in touch regarding this
with anybody, who could potentially be interested in teaching me one on one
thus saving me time for acquiring new information and skills, which I need
to finish my thesis on time, so that I can remain eligible for funding to
continue in my bioinformatics PhD program despite being almost blind? The
tutoring can be done remotely via TeamViewer 5 and Skype. Hence, it does
not matter where my tutors are physically located. Currently I have tutors
in Croatia and UK. But since they both work full time jobs while working
on their PhD dissertation they only have very limited time to teach me
online. Could you therefore please forward this request for help to
anybody, who could potentially be interested or, who could connect me to
somebody, who might be, because my graduation and career depend on it? Who
else would you recommend me to contact regarding this? Where else could I
post this because I am in urgent need for help?
Could you please contact me directly via email at Thomas.F.Hahn2(a)gmail.com
and/or Skype at tfh002 because my text to speech software has problems to
read out this website aloud to me?
I thank you very much in advance for your thoughts, ideas, suggestions,
recommendations, time, help, efforts and support.
With very warm regards,
*Thomas Hahn*
1) *Graduate student in the Joint Bioinformatics Program at the
University of Arkansas at Little Rock (UALR) and the University of Arkansas
Medical Sciences (UAMS) &*
2) *Research & Industry Advocate, Founder and Board Member of RADISH
MEDICAL SOLUTIONS, INC. (**http://www.radishmedical.com/thomas-hahn/*
<http://www.radishmedical.com/thomas-hahn/>*) *
*Primary email: **Thomas.F.Hahn2(a)gmail.com* <Thomas.F.Hahn2(a)gmail.com>
*Cell phone: 318 243 3940*
*Office phone: 501 682 1440*
*Office location: EIT 535*
*Skype ID: tfh002*
*Virtual Google Voice phone to reach me while logged into my email (i.e. *
*Thomas.F.Hahn2(a)gmail.com* <Thomas.F.Hahn2(a)gmail.com>*), even when having
no cell phone reception, e.g. in big massive buildings: *(501) 301-4890
<%28501%29%20301-4890>
*Web links: *
1) https://ualr.academia.edu/ThomasHahn
2) https://www.linkedin.com/pub/thomas-hahn/42/b29/42
3) http://facebook.com/Thomas.F.Hahn
<https://www.facebook.com/Thomas.F.Hahn>
4) https://twitter.com/Thomas_F_Hahn
The current memory layout for dictionaries is
unnecessarily inefficient. It has a sparse table of
24-byte entries containing the hash value, key pointer,
and value pointer.
Instead, the 24-byte entries should be stored in a
dense table referenced by a sparse table of indices.
For example, the dictionary:
d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
is currently stored as:
entries = [['--', '--', '--'],
[-8522787127447073495, 'barry', 'green'],
['--', '--', '--'],
['--', '--', '--'],
['--', '--', '--'],
[-9092791511155847987, 'timmy', 'red'],
['--', '--', '--'],
[-6480567542315338377, 'guido', 'blue']]
Instead, the data should be organized as follows:
indices = [None, 1, None, None, None, 0, None, 2]
entries = [[-9092791511155847987, 'timmy', 'red'],
[-8522787127447073495, 'barry', 'green'],
[-6480567542315338377, 'guido', 'blue']]
Only the data layout needs to change. The hash table
algorithms would stay the same. All of the current
optimizations would be kept, including key-sharing
dicts and custom lookup functions for string-only
dicts. There is no change to the hash functions, the
table search order, or collision statistics.
The memory savings are significant (from 30% to 95%
compression depending on the how full the table is).
Small dicts (size 0, 1, or 2) get the most benefit.
For a sparse table of size t with n entries, the sizes are:
curr_size = 24 * t
new_size = 24 * n + sizeof(index) * t
In the above timmy/barry/guido example, the current
size is 192 bytes (eight 24-byte entries) and the new
size is 80 bytes (three 24-byte entries plus eight
1-byte indices). That gives 58% compression.
Note, the sizeof(index) can be as small as a single
byte for small dicts, two bytes for bigger dicts and
up to sizeof(Py_ssize_t) for huge dict.
In addition to space savings, the new memory layout
makes iteration faster. Currently, keys(), values, and
items() loop over the sparse table, skipping-over free
slots in the hash table. Now, keys/values/items can
loop directly over the dense table, using fewer memory
accesses.
Another benefit is that resizing is faster and
touches fewer pieces of memory. Currently, every
hash/key/value entry is moved or copied during a
resize. In the new layout, only the indices are
updated. For the most part, the hash/key/value entries
never move (except for an occasional swap to fill a
hole left by a deletion).
With the reduced memory footprint, we can also expect
better cache utilization.
For those wanting to experiment with the design,
there is a pure Python proof-of-concept here:
http://code.activestate.com/recipes/578375
YMMV: Keep in mind that the above size statics assume a
build with 64-bit Py_ssize_t and 64-bit pointers. The
space savings percentages are a bit different on other
builds. Also, note that in many applications, the size
of the data dominates the size of the container (i.e.
the weight of a bucket of water is mostly the water,
not the bucket).
Raymond