10ᵀᴴ Advanced Scientific Programming in Python
a Summer School by the G-Node and the Municipality of Sithonia
Scientists spend more and more time writing, maintaining, and debugging software. While techniques for doing this efficiently have evolved, only few scientists have been trained to use them. As a result, instead of doing their research, they spend far too much time writing deficient code and reinventing the wheel. In this course we will present a selection of advanced programming techniques and best practices which are standard in the industry, but especially tailored to the needs of a programming scientist. Lectures are devised to be interactive and to give the students enough time to acquire direct hands-on experience with the materials. Students will work in pairs throughout the school and will team up to practice the newly learned skills in a real programming project — an entertaining computer game.
We use the Python programming language for the entire course. Python works as a simple programming language for beginners, but more importantly, it also works great in scientific simulations and data analysis. We show how clean language design, ease of extensibility, and the great wealth of open source libraries for scientific computing and data visualization are driving Python to become a standard tool for the programming scientist.
This school is targeted at Master or PhD students and Post-docs from all areas of science. Competence in Python or in another language such as Java, C/C++, MATLAB, or Mathematica is absolutely required. Basic knowledge of Python and of a version control system such as git, subversion, mercurial, or bazaar is assumed. Participants without any prior experience with Python and/or git should work through the proposed introductory material before the course.
We are striving hard to get a pool of students which is international and gender-balanced.
You can apply online: https://python.g-node.org
Application deadline: 23:59 UTC, May 31, 2017. There will be no deadline extension, so be sure to apply on time ;-)
Be sure to read the FAQ before applying.
Participation is for free, i.e. no fee is charged! Participants however should take care of travel, living, and accommodation expenses by themselves.
Date & Location
August 28—September 2, 2017. Nikiti, Sithonia, Halkidiki, Greece
→ Best Programming Practices
• Best practices for scientific programming
• Version control with git and how to contribute to open source projects with GitHub
• Best practices in data visualization
→ Software Carpentry
• Test-driven development
• Debugging with a debuggger
• Profiling code
→ Scientific Tools for Python
• Advanced NumPy
→ Advanced Python
• Context managers
→ The Quest for Speed
• Writing parallel applications
• Interfacing to C with Cython
• Memory-bound problems and memory profiling
• Data containers: storage and fast access to large data
→ Practical Software Development
• Group project
• Francesc Alted, freelance consultant, author of Blosc, Castelló de la Plana, Spain
• Pietro Berkes, NAGRA Kudelski, Lausanne, Switzerland
• Zbigniew Jędrzejewski-Szmek, Krasnow Institute, George Mason University, Fairfax, VA USA
• Eilif Muller, Blue Brain Project, École Polytechnique Fédérale de Lausanne Switzerland
• Juan Nunez-Iglesias, Victorian Life Sciences Computation Initiative, University of Melbourne, Australia
• Rike-Benjamin Schuppner, Institute for Theoretical Biology, Humboldt-Universität zu Berlin, Germany
• Nicolas P. Rougier, Inria Bordeaux Sud-Ouest, Institute of Neurodegenerative Disease, University of Bordeaux, France
• Bartosz Teleńczuk, European Institute for Theoretical Neuroscience, CNRS, Paris, France
• Stéfan van der Walt, Berkeley Institute for Data Science, UC Berkeley, CA USA
• Nelle Varoquaux, Berkeley Institute for Data Science, UC Berkeley, CA USA
• Tiziano Zito, freelance consultant, Berlin, Germany
For the German Neuroinformatics Node of the INCF (G-Node) Germany:
• Tiziano Zito, freelance consultant, Berlin, Germany
• Zbigniew Jędrzejewski-Szmek, Krasnow Institute, George Mason University, Fairfax, USA
• Jakob Jordan, Institute of Neuroscience and Medicine (INM-6), Forschungszentrum Jülich GmbH, Germany
• Etienne Roesch, Centre for Integrative Neuroscience and Neurodynamics, University of Reading, UK
I just re-read the "Utf-8" manifesto, and it helped me clarify my thoughts:
1) most of it is focused on utf-8 vs utf-16. And that is a strong argument
-- utf-16 is the worst of both worlds.
2) it isn't really addressing how to deal with fixed-size string storage as
needed by numpy.
It does bring up Python's current approach to Unicode:
This lead to software design decisions such as Python’s string O(1) code
point access. The truth, however, is that Unicode is inherently more
complicated and there is no universal definition of such thing as *Unicode
character*. We see no particular reason to favor Unicode code points over
Unicode grapheme clusters, code units or perhaps even words in a language
My thoughts on that-- it's technically correct, but practicality beats
purity, and the character concept is pretty darn useful for at least some
(commonly used in the computing world) languages.
In any case, whether the top-level API is character focused doesn't really
have a bearing on the internal encoding, which is very much an
implementation detail in py 3 at least.
And Python has made its decision about that.
So what are the numpy use-cases?
I see essentially two:
1) Use with/from Python -- both creating and working with numpy arrays.
In this case, we want something compatible with Python's string (i.e. full
Unicode supporting) and I think should be as transparent as possible.
Python's string has made the decision to present a character oriented API
to users (despite what the manifesto says...).
However, there is a challenge here: numpy requires fixed-number-of-bytes
dtypes. And full unicode support with fixed number of bytes matching fixed
number of characters is only possible with UCS-4 -- hence the current
implementation. And this is actually just fine! I know we all want to be
efficient with data storage, but really -- in the early days of Unicode,
when folks thought 16 bits were enough, doubling the memory usage for
western language storage was considered fine -- how long in computer life
time does it take to double your memory? But now, when memory, disk space,
bandwidth, etc, are all literally orders of magnitude larger, we can't
handle a factor of 4 increase in "wasted" space?
Alternatively, Robert's suggestion of having essentially an object array,
where the objects were known to be python strings is a pretty nice idea --
it gives the full power of python strings, and is a perfect one-to-one
match with the python text data model.
But as scientific text data often is 1-byte compatible, a one-byte-per-char
dtype is a fine idea, too -- and we pretty much have that already with the
existing string type -- that could simply be enhanced by enforcing the
encoding to be latin-9 (or latin-1, if you don't want the Euro symbol).
This would get us what scientists expect from strings in a way that is
properly compatible with Python's string type. You'd get encoding errors if
you tried to stuff anything else in there, and that's that.
Yes, it would have to be a "new" dtype for backwards compatibility.
2) Interchange with other systems: passing the raw binary data back and
forth between numpy arrays and other code, written in C, Fortran, or binary
This is a key use-case for numpy -- I think the key to its enormous
success. But how important is it for text? Certainly any data set I've ever
worked with has had gobs of binary numerical data, and a small smattering
of text. So in that case, if, for instance, h5py had to encode/decode text
when transferring between HDF files and numpy arrays, I don't think I'd
ever see the performance hit. As for code complexity -- it would mean more
complex code in interface libs, and less complex code in numpy itself.
(though numpy could provide utilities to make it easy to write the
If we do want to support direct binary interchange with other libs, then we
should probably simply go for it, and support any encoding that Python
supports -- as long as you are dealing with multiple encodings, why try to
decide up front which ones to support?
But how do we expose this to numpy users? I still don't like having
non-fixed-width encoding under the hood, but what can you do? Other than
that, having the encoding be a selectable part of the dtype works fine --
and in that case the number of bytes should be the "length" specifier.
This, however, creates a bit of an impedance mismatch between the
"character-focused" approach of the python string type. And requires the
user to understand something about the encoding in order to even know how
many bytes they need -- a utf-8-100 string will hold a different "length"
of string than a utf-16-100 string.
So -- I think we should address the use-cases separately -- one for
"normal" python use and simple interoperability with python strings, and
one for interoperability at the binary level. And an easy way to convert
between the two.
For Python use -- a pointer to a Python string would be nice.
Then use a native flexible-encoding dtype for everything else.
Thinking out loud -- another option would be to set defaults for the
multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility
with the python string type -- and make folks make an effort to get
One more note: if a user tries to assign a value to a numpy string array
that doesn't fit, they should get an error:
EncodingError if it can't be encoded into the defined encoding.
ValueError if it is too long -- it should not be silently truncated.
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
As you probably know numpy does not deal well with strings in Python3.
The np.string type is actually zero terminated bytes and not a string.
In Python2 this happened to work out as it treats bytes and strings the
same way. But in Python3 this type is pretty hard to work with as each
time you get an item from a numpy bytes array it needs decoding to
receive a string.
The only string type available in Python3 is np.unicode which uses
4-byte utf-32 encoding which is deemed to use too much memory to
actually see much use.
What people apparently want is a string type for Python3 which uses less
memory for the common science use case which rarely needs more than
As we have been told we cannot change the np.string type to actually be
strings as existing programs do interpret its content as bytes despite
this being very broken due to its null terminating property (it will
ignore all trailing nulls).
Also 8 years of working around numpy's poor python3 support decisions in
third parties probably make the 'return bytes' behaviour impossible to
So we need a new dtype that can represent strings in numpy arrays which
is smaller than the existing 4 byte utf-32.
To please everyone I think we need to go with a dtype that supports
multiple encodings via metadata, similar to how datatime supports
E.g.: 'U10[latin1]' are 10 characters in latin1 encoding
Encodings we should support are:
- latin1 (1 bytes):
it is compatible with ascii and adds extra characters used in the
- utf-32 (4 bytes):
can represent every character, equivalent with np.unicode
Encodings we should maybe support:
- utf-16 with explicitly disallowing surrogate pairs (2 bytes):
this covers a very large range of possible characters in a reasonably
- utf-8 (4 bytes):
variable length encoding with minimum size of 1 bytes, but we would need
to assume the worst case of 4 bytes so it would not save anything
compared to utf-32 but may allow third parties replace an encoding step
with trailing null trimming on serialization.
To actually do this we have two options both of which break our ABI when
doing so without ugly hacks.
- Add a new dtype, e.g. npy.realstring
By not modifying an existing type the only break programs using the
NPY_CHAR. The most notable case of this is f2py.
It has the cosmetic disadvantage that it makes the np.unicode dtype
obsolete and is more busywork to implement.
- Modify np.unicode to have encoding metadata
This allows use to reuse of all the type boilerplate so it is more
convenient to implement and by extending an existing type instead of
making one obsolete it results in a much nicer API.
The big drawback is that it will explicitly break any third party that
receives an array with a new encoding and assumes that the buffer of an
array of type np.unicode will a character itemsize of 4 bytes.
To ease this problem we would need to add API's to get the itemsize and
encoding to numpy now so third parties can error out cleanly.
The implementation of it is not that big a deal, I have already created
a prototype for adding latin1 metadata to np.unicode which works quite
well. It is imo realistic to get this into 1.14 should we be able to
make a decision on which way to implement it.
Do you have comments on how to go forward, in particular in regards to
new dtype vs modify np.unicode?
Currently numpy master has a bogus stride that will cause an error when
downstream projects misuse it. That is done in order to help smoke out
errors. Previously that bogus stride has been fixed up for releases, but
that requires a special patch to be applied after each version branch is
made. At this point I'd like to pick one or the other option and make the
development and release branches the same in this regard. The question is:
which option to choose? Keeping the fixup in master will remove some code
and keep things simple, while not fixing up the release will possibly lead
to more folks finding errors. At this point in time I am favoring applying
the fixup in master.
It may be early to discuss dropping support for Python 2.7, but there is a
disturbance in the force that suggests that it might be worth looking
forward to the year 2020 when Python itself will drop support for 2.7.
There is also a website, http://www.python3statement.org, where several
projects in the scientific python stack have pledged to be Python 2.7 free
by that date. Given that, a preliminary discussion of the subject might be
interesting, if only to gather information of where the community currently
we need to deprecate the NPY_CHAR typenumber  in order to enable us
to add new core dtypes without adding ugly hacks to our ABI.
Technically the typenumber was deprecated way back in 1.6 when it
accidentally broke our ABI. But due to lack of time f2py never got
updated to actually follow through.
In order to unblock our dtype development cleanly we want to finally do
the deprecation properly.
As nobody really knows how f2py works and there are no existing unit
tests covering the char dtype the change is very likely to break something.
The change is available here:
It attempts to map the NPY_CHAR dtype to the equivalent NPY_STRING with
itemsize 1. I have only been able to come up with a test that covers one
of the changed places.
So if you have a f2py usecase that in some way involves passing arrays
of strings back and forth between python and fortran, please test that
branch or post a reproducable example here.
I would like to try to reach a consensus about a long standing inconsistent
behavior of reduceat() reported and discussed here
In summary, it seems an elegant and logical design choice, that all users
will expect, for
out = ufunc.reduceat(a, indices)
to produce, for all indices j (except for the last one) the following
out[j] = ufunc.reduce(a[indices[j]:indices[j+1]])
However the current documented and actual behavior is for the case
indices[i] >= indices[i+1]
to return simply
out[j] = a[indices[i]]
I cannot see any application where this behavior is useful or where this
choice makes sense. This seems just a bug that should be fixed.
What do people think?
PS: A quick fix for the current implementation is
out = ufunc.reduceat(a, indices)
out *= np.diff(indices) > 0