Mailman 3 December 2008 - NumPy-Discussion

numpy.test() failures (1.2.1) on Mac OS X
by Nicholas Matsakis Dec. 30, 2008

Dec. 30, 2008

I just installed what I believe to be a completely vanilla installation of numpy on an Intel Mac OS X 10.5.6. Python 2.5 pkg from Python.org, numpy 1.2.1 pkg from scipy.org, nose installed through setup tools. Running "import numpy; numpy.test()" results in the following errors and failures: ERROR: Failure: TypeError (can't multiply sequence by non-int of type 'float') ERROR: test_definition (test_helper.TestFFTShift) ERROR: test_inverse (test_helper.TestFFTShift) ERROR: Test of inplace division FAIL: test_division_int (test_umath.TestDivision) FAIL: test_basic (test_index_tricks.TestUnravelIndex) FAIL: Test of inplace division FAIL: test_inplace_division_misc (test_core.TestMaskedArrayInPlaceArithmetics) FAIL: Test of inplace operations and rich comparisons Is this expected? Should I file a ticket? A complete dump of the test run can be found at: http://nick.matsakis.net/tmp/numpy-1.2.1-tests-12-30-2008.txt Nick Matsakis

1 1

combining recarrays
by ctw Dec. 30, 2008

Dec. 30, 2008

Hi! I'm a bit stumped by the following: suppose I have several recarrays with identical dtypes (identical field names, etc.) and would like to combine them into one rec array, what would be the best way to do that? I tried using np.rec.fromrecords, but that doesn't produce the desired result. As a minimal example consider the following code: desc = np.dtype({'names':['a','b'],'formats':[np.float,np.int]}) rec1 = np.zeros(3,desc) rec2 = np.zeros(3,desc) Now I have two recarrays of shape (3,) that both look like this: array([(0.0, 0), (0.0, 0), (0.0, 0)], dtype=[('a', '<f8'), ('b', '<i4')]) I would like to turn them into one new recarray of shape (6,) that looks like this: array([(0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0), (0.0, 0)], dtype=[('a', '<f8'), ('b', '<i4')]) Any ideas?

2 1

Re: [Numpy-discussion] Thoughts on persistence/object tracking in scientific code
by Luis Pedro Coelho Dec. 29, 2008

Dec. 29, 2008

Hello, I coincidently started my own implementation of a system to manage intermediate results last week, which I called jug. I wasn't planning to make such an alpha version public just now, but it seems to be on topic. The main idea is to use hashes to map function arguments to paths on the filesystem, which store the result (nothing extraordinary here). I also added the capability of having tasks (the basic unit) take the results of other tasks and defining an implicit dependency DAG. A simple locking mechanism enables light-weight task-level parellization (this was the second of my goals: help me make my stuff parallel). A trick that helps is that I don't really use the argument values to hash (which would be unwieldy for big arrays). I use the computation path (e.g., this is the value obtained from f(g('something'),2)). Since, at least in my problems, things tend to always map back into simple file-system paths, the hash computation doesn't even need to load the intermediate results. I will make the git repository publicly available once I figure out how to do that. I append the tutorial I wrote, which explains the system. HTH, Luís Pedro Coelho PhD Student in Computational Biology Carnegie Mellon University ============ Jug Tutorial ============ What is jug? ------------ Jug is a simple way to write easily parallelisable programs in Python. It also handles intermediate results for you. Example ------- This is a simple worked-through example which illustrates what jug does. Problem ~~~~~~~ Assume that I want to do the following to a collection of images: (1) for each image, compute some features (2) cluster these features using k-means. In order to find out the number of clusters, I try several values and pick the best result. For each value of k, because of the random initialisation, I run the clustering 10 times. I could write the following simple code: :: imgs = glob('*.png') features = [computefeatures(img,parameter=2) for img in imgs] clusters = [] bics = [] for k in xrange(2,200): for repeat in xrange(10): clusters.append(kmeans(features,k=k,random_seed=repeat)) bics.append(compute_bic(clusters[-1])) Nr_clusters = argmin(bics) // 10 Very simple and solves the problem. However, if I want to take advantage of the obvious parallelisation of the problem, then I need to write much more complicated code. My traditional approach is to break this down into smaller scripts. I'd have one to compute features for some images, I'd have another to merge all the results together and do some of the clustering, and, finally, one to merge all the results of the different clusterings. These would need to be called with different parameters to explore different areas of the parameter space, so I'd have a couple of scripts just for calling the main computation scripts. Intermediate results would be saved and loaded by the different processes. This has several problems. The biggest are (1) The need to manage intermediate files. These are normally files with long names like *features_for_img_0_with_parameter_P.pp*. (2) The code gets much more complex. There are minor issues with having to issue several jobs (and having the cluster be idle in the meanwhile), or deciding on how to partition the jobs so that they take roughly the same amount of time, but the two above are the main ones. Jug solves all these problems! Tasks ~~~~~ The main unit of jug is a Task. Any function can be used to generate a Task. A Task can depend on the results of other Tasks. The original idea for jug was a Makefile-like environment for declaring Tasks. I have moved beyond that, but it might help you think about what Tasks are. You create a Task by giving it a function which performs the work and its arguments. The arguments can be either literal values or other tasks (in which case, the function will be called with the *result* of those tasks!). Jug also understands lists of tasks (all standard Python containers will be supported in a later version). For example, the following code declares the necessary tasks for our problem: :: imgs = glob('*.png') feature_tasks = [Task(computefeatures,img,parameter=2) for img in imgs] cluster_tasks = [] bic_tasks = [] for k in xrange(2,200): for repeat in xrange(10): cluster_tasks.append(Task(kmeans,feature_tasks,k=k,random_seed=repeat)) bic_tasks.append(Task(compute_bic,cluster_tasks[-1])) Nr_clusters = Task(argmin,bic_tasks) Task Generators ~~~~~~~~~~~~~~~ In the code above, there is a lot of code of the form *Task(function,args)*, so maybe it should read *function(args)*. A simple helper function aids this process: :: from jug.task import Task def TaskGenerator(function): def gen(*args,**kwargs): return Task(function,*args,**kwargs) return gen computefeatures = TaskGenerator(computefeatures) kmeans = TaskGenerator(kmeans) compute_bic = TaskGenerator(compute_bic) @TaskGenerator def Nr_Clusters(bics): return argmin(bics) // 10 imgs = glob('*.png') features = [computefeatures(img,parameter=2) for img in imgs] clusters = [] bics = [] for k in xrange(2,200): for repeat in xrange(10): clusters.append(kmeans(features,k=k,random_seed=repeat)) bics.append(compute_bic(clusters[-1])) Nr_clusters(bics) You can see that this code is almost identical to our original sequential code, except for the declarations at the top and the fact that *Nr_clusters* is now a function (actually a TaskGenerator, look at the use of a declarator). This file is called the jugfile (you should name it *jugfile.py* on the filesystem) and specifies your problem. Of course, *TaskManager* is already a part of jug and those first few lines could have read :: from jug.task import TaskGenerator Jug ~~~ So far, we have achieved seemingly little. We have turned a simple piece of sequential code into something that generates Task objects, but does not actually perform any work. The final piece is jug. Jug takes these Task objects and runs them. It's main loop is basically :: while len(tasks) > 0: for t in tasks: if can_run(t): # ensures that all dependencies have been run if need_to_run(t) and not is_running(t): t.run() tasks.remove(t) If you run jug on the script above, you will simply have reproduced the original code with the added benefit of having all the intermediate results saved. The interesting is what happens when you run several instances of jug at the same time. They will start running Tasks, but each instance will run its own tasks. This allows you to take advantage of multiple processors in a way that keeps the processors all occupied as long as there is work to be done, handles the implicit dependencies, and passes functions the right values. Note also that, unlike more traditional parallel processing frameworks (like MPI), jug has no problems with the number of participating processors varying throughout the job. Behind the scenes, jug is using the filesystem to both save intermediate results (which get passed around) and to lock running tasks so that each task is only run once (the actual main loop is thus a bit more complex than shown above). Intermediate and Final Results ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can obtain the final results of your computation by setting up a task that saves them to disk and loading them from there. If the results of your computation are simple enough, this might be the simplest way. Another way, which is also the way to access the intermediate results if you want them, is to run the jug script and then call the *load()* method on Tasks. For example, :: img = glob('*.png') features = [computefeatures(img,parameter=2) for img in imgs] ... feature_values = [feat.load() for feat in features] If the values are not accessible, this raises an exception. Advantages ---------- jug is an attempt to get something that works in the setting that I have found myself in: code that is *embarissingly parallel* with a couple of points where all the results of previous processing are merged, often in a simple way. It is also a way for me to manage either the explosion of temporary files that plagued my code and the brittleness of making sure that all results from separate processors are merged correctly in my *ad hoc* scripts. Limitations ----------- This is not an attempt to replace MPI in any way. For code that has more merge points, this won't do. It also won't do if the individual tasks are so small that the over-head of managing them swamps out the performance gains of parallelisation. In my code, most of the times, each task takes 20 seconds to a few minutes. Just enough to make the managing time irrelevant, but fast enough that the main job can be broken into thousands of tiny pieces. The system makes it too easy to save all intermediate results and run out of disk space. This is still Python, not a true parallel programming language. The abstraction will sometimes leak through, for example, if you try to pass a Task to a function which expects a real value. Recall how we had to re-write the line *Nr_clusters = argmin(bics) // 10* above. Planned Capabilities -------------------- Here are a couple of simple improvements I plan to make at some point: * jug.py cleanup: removes left-over locks, temporary files, and unsused results. * Stop & re-start. Currently, jug processes will exit if they can't make any progress for a while. In the future, I'd like them to be unblockable by other jug processes. * No result tasks. Task-like objects that don't save intermediate results. * Have tasks be passed inside *sets* and *dictionaries*. Maybe even *numpy* arrays! This will make jug even more like a real parallel programming language. * If the original arguments are files on disk, then jug should check their modification date and invalidate subsequent results.

3 4

Should object arrays have a buffer interface?
by Andreas Klöckner Dec. 29, 2008

Dec. 29, 2008

Hi all, I don't think PyObject pointers should be accessible via the buffer interface. I'd throw an error, but maybe a (silenceable) warning would do. Would have saved me some bug-hunting. >>> import numpy >>> numpy.array([55, (33,)], dtype=object) >>> x = numpy.array([55, (33,)], dtype=object) >>> x array([55, (33,)], dtype=object) >>> buffer(x) <read-only buffer for 0x8496f48, size -1, offset 0 at 0x850b060> >>> str(buffer(x)) '\xb0\x1c\x17\x08l\x89\xd7\xb7' >>> numpy.__version__ '1.1.0' Opinions? Andreas

2 6

NumPy and Python 2.6 on Windows
by Lenard Lindstrom Dec. 27, 2008

Dec. 27, 2008

Hi everyone, I build the Pygame dependencies for Windows. With the next Pygame release, 1.9.0, we would like to include Python 2.6 support. As you already know, Pygame has NumPy bindings. Though NumPy is not required, it is a useful addition. I understand NumPy is built with MinGW on Windows, which I use to with Pygame and its dependencies. I know the problems linking against msvcr90.dll. I am willing to offer what advice I can to get NumPy up and running for Python 2.6. You are welcome to use the Pygame build tools if they will help. I also have a Python 2.6 build of NumPy 1.2.1 for Pygame testing. http://www3.telus.net/len_l/pygame/numpy-1.2.1.win32-py2.6.msi md5sum: b791f5c4b620da21f779b53252b5932e *numpy-1.2.1.win32-py2.6.msi Lenard -- Lenard Lindstrom <len-l(a)telus.net>

4 11

EPD Py2.5 v4.1.30101 Released
by Chris Casey Dec. 27, 2008

Dec. 27, 2008

Greetings, Enthought, Inc. is very pleased to announce the newest release of the Enthought Python Distribution (EPD) Py2.5 v4.1.30101: http://www.enthought.com/epd The size of the installer has be reduced by about half. Also, this is the first release to include a 3.1.0 version of the Enthought Tool Suite (http://code.enthought.com/), featuring Mayavi 3.1.0. This is also the first release to use Enthought's enhanced version of setuptools, Enstaller (http://code.enthought.com/projects/enstaller/). Windows installation enhancements, matplotlib and wx issues, and menu consistency accross platforms are among notable fixes. The full release notes for this release can be found here: https://svn.enthought.com/epd/wiki/Py25/4.1.30101/RelNotes Many thanks to the EPD team for putting this release together, and to the community of folks who have provided all of the valuable tools bundled here. Best Regards, Chris --------- About EPD --------- The Enthought Python Distribution (EPD) is a "kitchen-sink-included" distribution of the Python™ Programming Language, including over 80 additional tools and libraries. The EPD bundle includes NumPy, SciPy, IPython, 2D and 3D visualization, database adapters, and a lot of other tools right out of the box. http://www.enthought.com/products/epd.php It is currently available as an easy, single-click installer for Windows XP (x86), Mac OS X (a universal binary for Intel 10.4 and above) and RedHat EL3 (x86 and amd64). EPD is free for 30-day trial use and for use in degree-granting academic institutions. An annual Subscription and installation support are available for commercial use (http://www.enthought.com/products/epddownload.php ) including an Enterprise Subscription with support for particular deployment environments (http://www.enthought.com/products/enterprise.php ). _______________________________________________ Enthought-dev mailing list Enthought-dev(a)mail.enthought.com https://mail.enthought.com/mailman/listinfo/enthought-dev

4 5

Request for review: dynamic_cpu_branch
by David Cournapeau Dec. 26, 2008

Dec. 26, 2008

Hi, I updated a small branch of mine which is meant to fix a problem on Mac OS X with python 2.6 (see http://projects.scipy.org/pipermail/numpy-discussion/2008-November/038816.h… for the problem) and would like one core numpy developer to review it before I merge it. The problem can be seen with the following test code: #include <python.h> int main() { #ifdef WORDS_BIGENDIAN printf("Big endian macro defined\n"); #else printf("No big endian macro defined\n"); #endif return 0; } If I build the above with python 2.5 on mac os X (intel), then I get the message no big endian. But with my version 2.6 (installed from official binary), I get Big endian, which is obviously wrong for my machine. This is a problem in python, but we can fix it in numpy (which depends on this macro). The fix is simple: set our own NPY_BIG_ENDIAN/NPY_LITTLE_ENDIAN instead of relying on the python header one. More precisely: - a header cpuarch.h has been added: it uses toolchain specific macro to set one of the NPY_TARGET_CPU_* macro. X86, AMD64, PPC, SPARC, S390, and PA_RISC are detected. (I obviously did not tested them all). - NPY_LITTLE_ENDIAN is set for little endian, NPY_BIG_ENDIAN is set for big endian, according to the detected CPU (Or directly using endian.h if available). - NPY_BYTE_ORDER is set to 4321 for big endian, 1234 for little endian (following glibc endian.h convention) - endianess is set in the numpy headers at the time they are read (whenever you include it) - remove any mention of WORDS_BIGENDIAN in the source code (only _signbit.c used it). I don't like so much depending on CPU detection, but OTOH, the only other solution I can see would be to have numpy headers which do not rely on endianness at all, which does not seem possible without breaking some API (the macro which test for endianness: PyArray_ISNBO and all the other ones which depend on it, including PyArray_ISNOTSWAPPED). cheers, David

4 22

Specifying a dtype with RandomState?
by alexwphoto＠gmail.com Dec. 24, 2008

Dec. 24, 2008

I'm generating rather large matrices with a fixed random seed using rs = N.random.RandomState(123456789) U = rs.uniform(low=-0.1 high=self.0.1 size=(480189, 1000)).astype('float32') ... Several other arrays are instantiated as well. Because they are so large, I do all calculations on single-precision arrays. Coercing the output of rs.uniform() into a float32 requires an enormous copy operation (if I understand right). Since I am already hitting the upper limit of the memory space I have, it would be convenient if I could avoid the astype('float32') operation. Is there a way to have a RandomState object output single-precision floats? Thanks, Alex W

1 0

Is it possible with numpy?
by olfa mraihi Dec. 24, 2008

Dec. 24, 2008

Hello Numpy community, I want to know if Numpy could deal with symbolic arrays and lists (by symbolic I mean without specifying the concrete contents of list or array) For example I want to solve a system of equations containing lists and arrays like this solve(x+Sum[A[k],k=i..N]==y+Sum[B[k],k=m..N], j-Length[C]==l-Length[D], z/(c ^ i)==t/(c ^ h), u+1==2*v-3w, v=f(f(w))) (here A and B are arrays; C et D are lists; x,y,z,t,j,l,i,h,u,v,w are variables that could be of type integer or real, c is a constant and f is a function): Thank you very much. Yours faithfully, Olfa MRAIHI

3 2

is there a sortrows
by josef.pktd＠gmail.com Dec. 21, 2008

Dec. 21, 2008

I was looking for a function that sorts a 2-dimensional array by rows. That's what I came up with, is there a more direct way? >>> a array([[1, 2], [0, 0], [1, 0], [0, 2], [2, 1], [1, 0], [1, 0], [0, 0], [1, 0], [2, 2]]) >>> a[np.lexsort(np.fliplr(a).T)] array([[0, 0], [0, 0], [0, 2], [1, 0], [1, 0], [1, 0], [1, 0], [1, 2], [2, 1], [2, 2]]) Note: I needed to flip and transpose, using axis didn't work >>> a.shape (10, 2) >>> np.lexsort(a,axis=1) Traceback (most recent call last): File "<pyshell#76>", line 1, in <module> np.lexsort(a,axis=1) ValueError: axis(=1) out of bounds Specifying individual columns in argument also works, but it's a pain if I don't know how many columns there are: >>> a[np.lexsort((a[:,1],a[:,0]))] array([[0, 0], [0, 0], [0, 2], [1, 0], [1, 0], [1, 0], [1, 0], [1, 2], [2, 1], [2, 2]]) A helper function sortrows would be helpful, I don't know what would be the higher dimensional equivalent. Or did I miss a function that I didn't find in the help file? Thanks, Josef

4 7