Hello everyone!
We have been encountering several deadlocks in a threaded Python
application which calls subprocess.Popen (i.e. fork()) in some of its
threads.
This has occurred on Python 2.4.1 on a 2.4.27 Linux kernel.
Preliminary analysis of the hang shows that the child process blocks
upon entering the execvp function, in which the import_lock is acquired
due to the following line:
def _ execvpe(file, args, env=None):
from errno import ENOENT, ENOTDIR
...
It is known that when forking from a pthreaded application, acquisition
attempts on locks which were already locked by other threads while
fork() was called will deadlock.
Due to these oddities we were wondering if it would be better to extract
the above import line from the execvpe call, to prevent lock
acquisition attempts in such cases.
Another workaround could be re-assigning a new lock to import_lock
(such a thing is done with the global interpreter lock) at PyOS_AfterFork or
pthread_atfork.
We'd appreciate any opinions you might have on the subject.
Thanks in advance,
Yair and Rotem
On Wed, 10 Nov 2004, John P Speno wrote:
Hi, sorry for the delayed response.
> While using subprocess (aka popen5), I came across one potential gotcha. I've had
> exceptions ending like this:
>
> File "test.py", line 5, in test
> cmd = popen5.Popen(args, stdout=PIPE)
> File "popen5.py", line 577, in __init__
> data = os.read(errpipe_read, 1048576) # Exceptions limited to 1 MB
> OSError: [Errno 4] Interrupted system call
>
> (on Solaris 9)
>
> Would it make sense for subprocess to use a more robust read() function
> which can handle these cases, i.e. when the parent's read on the pipe
> to the child's stderr is interrupted by a system call, and returns EINTR?
> I imagine it could catch EINTR and EAGAIN and retry the failed read().
I assume you are using signals in your application? The os.read above is
not the only system call that can fail with EINTR. subprocess.py is full
of other system calls that can fail, and I suspect that many other Python
modules are as well.
I've made a patch (attached) to subprocess.py (and test_subprocess.py)
that should guard against EINTR, but I haven't committed it yet. It's
quite large.
Are Python modules supposed to handle EINTR? Why not let the C code handle
this? Or, perhaps the signal module should provide a sigaction function,
so that users can use SA_RESTART.
Index: subprocess.py
===================================================================
RCS file: /cvsroot/python/python/dist/src/Lib/subprocess.py,v
retrieving revision 1.8
diff -u -r1.8 subprocess.py
--- subprocess.py 7 Nov 2004 14:30:34 -0000 1.8
+++ subprocess.py 17 Nov 2004 19:42:30 -0000
@@ -888,6 +888,50 @@
pass
+ def _read_no_intr(self, fd, buffersize):
+ """Like os.read, but retries on EINTR"""
+ while True:
+ try:
+ return os.read(fd, buffersize)
+ except OSError, e:
+ if e.errno == errno.EINTR:
+ continue
+ else:
+ raise
+
+
+ def _read_all(self, fd, buffersize):
+ """Like os.read, but retries on EINTR, and reads until EOF"""
+ all = ""
+ while True:
+ data = self._read_no_intr(fd, buffersize)
+ all += data
+ if data == "":
+ return all
+
+
+ def _write_no_intr(self, fd, s):
+ """Like os.write, but retries on EINTR"""
+ while True:
+ try:
+ return os.write(fd, s)
+ except OSError, e:
+ if e.errno == errno.EINTR:
+ continue
+ else:
+ raise
+
+ def _waitpid_no_intr(self, pid, options):
+ """Like os.waitpid, but retries on EINTR"""
+ while True:
+ try:
+ return os.waitpid(pid, options)
+ except OSError, e:
+ if e.errno == errno.EINTR:
+ continue
+ else:
+ raise
+
def _execute_child(self, args, executable, preexec_fn, close_fds,
cwd, env, universal_newlines,
startupinfo, creationflags, shell,
@@ -963,7 +1007,7 @@
exc_value,
tb)
exc_value.child_traceback = ''.join(exc_lines)
- os.write(errpipe_write, pickle.dumps(exc_value))
+ self._write_no_intr(errpipe_write, pickle.dumps(exc_value))
# This exitcode won't be reported to applications, so it
# really doesn't matter what we return.
@@ -979,7 +1023,7 @@
os.close(errwrite)
# Wait for exec to fail or succeed; possibly raising exception
- data = os.read(errpipe_read, 1048576) # Exceptions limited to 1 MB
+ data = self._read_all(errpipe_read, 1048576) # Exceptions limited to 1 MB
os.close(errpipe_read)
if data != "":
child_exception = pickle.loads(data)
@@ -1003,7 +1047,7 @@
attribute."""
if self.returncode == None:
try:
- pid, sts = os.waitpid(self.pid, os.WNOHANG)
+ pid, sts = self._waitpid_no_intr(self.pid, os.WNOHANG)
if pid == self.pid:
self._handle_exitstatus(sts)
except os.error:
@@ -1015,7 +1059,7 @@
"""Wait for child process to terminate. Returns returncode
attribute."""
if self.returncode == None:
- pid, sts = os.waitpid(self.pid, 0)
+ pid, sts = self._waitpid_no_intr(self.pid, 0)
self._handle_exitstatus(sts)
return self.returncode
@@ -1049,27 +1093,33 @@
stderr = []
while read_set or write_set:
- rlist, wlist, xlist = select.select(read_set, write_set, [])
+ try:
+ rlist, wlist, xlist = select.select(read_set, write_set, [])
+ except select.error, e:
+ if e[0] == errno.EINTR:
+ continue
+ else:
+ raise
if self.stdin in wlist:
# When select has indicated that the file is writable,
# we can write up to PIPE_BUF bytes without risk
# blocking. POSIX defines PIPE_BUF >= 512
- bytes_written = os.write(self.stdin.fileno(), input[:512])
+ bytes_written = self._write_no_intr(self.stdin.fileno(), input[:512])
input = input[bytes_written:]
if not input:
self.stdin.close()
write_set.remove(self.stdin)
if self.stdout in rlist:
- data = os.read(self.stdout.fileno(), 1024)
+ data = self._read_no_intr(self.stdout.fileno(), 1024)
if data == "":
self.stdout.close()
read_set.remove(self.stdout)
stdout.append(data)
if self.stderr in rlist:
- data = os.read(self.stderr.fileno(), 1024)
+ data = self._read_no_intr(self.stderr.fileno(), 1024)
if data == "":
self.stderr.close()
read_set.remove(self.stderr)
Index: test/test_subprocess.py
===================================================================
RCS file: /cvsroot/python/python/dist/src/Lib/test/test_subprocess.py,v
retrieving revision 1.14
diff -u -r1.14 test_subprocess.py
--- test/test_subprocess.py 12 Nov 2004 15:51:48 -0000 1.14
+++ test/test_subprocess.py 17 Nov 2004 19:42:30 -0000
@@ -7,6 +7,7 @@
import tempfile
import time
import re
+import errno
mswindows = (sys.platform == "win32")
@@ -35,6 +36,16 @@
fname = tempfile.mktemp()
return os.open(fname, os.O_RDWR|os.O_CREAT), fname
+ def read_no_intr(self, obj):
+ while True:
+ try:
+ return obj.read()
+ except IOError, e:
+ if e.errno == errno.EINTR:
+ continue
+ else:
+ raise
+
#
# Generic tests
#
@@ -123,7 +134,7 @@
p = subprocess.Popen([sys.executable, "-c",
'import sys; sys.stdout.write("orange")'],
stdout=subprocess.PIPE)
- self.assertEqual(p.stdout.read(), "orange")
+ self.assertEqual(self.read_no_intr(p.stdout), "orange")
def test_stdout_filedes(self):
# stdout is set to open file descriptor
@@ -151,7 +162,7 @@
p = subprocess.Popen([sys.executable, "-c",
'import sys; sys.stderr.write("strawberry")'],
stderr=subprocess.PIPE)
- self.assertEqual(remove_stderr_debug_decorations(p.stderr.read()),
+ self.assertEqual(remove_stderr_debug_decorations(self.read_no_intr(p.stderr)),
"strawberry")
def test_stderr_filedes(self):
@@ -186,7 +197,7 @@
'sys.stderr.write("orange")'],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
- output = p.stdout.read()
+ output = self.read_no_intr(p.stdout)
stripped = remove_stderr_debug_decorations(output)
self.assertEqual(stripped, "appleorange")
@@ -220,7 +231,7 @@
stdout=subprocess.PIPE,
cwd=tmpdir)
normcase = os.path.normcase
- self.assertEqual(normcase(p.stdout.read()), normcase(tmpdir))
+ self.assertEqual(normcase(self.read_no_intr(p.stdout)), normcase(tmpdir))
def test_env(self):
newenv = os.environ.copy()
@@ -230,7 +241,7 @@
'sys.stdout.write(os.getenv("FRUIT"))'],
stdout=subprocess.PIPE,
env=newenv)
- self.assertEqual(p.stdout.read(), "orange")
+ self.assertEqual(self.read_no_intr(p.stdout), "orange")
def test_communicate(self):
p = subprocess.Popen([sys.executable, "-c",
@@ -305,7 +316,8 @@
'sys.stdout.write("\\nline6");'],
stdout=subprocess.PIPE,
universal_newlines=1)
- stdout = p.stdout.read()
+
+ stdout = self.read_no_intr(p.stdout)
if hasattr(open, 'newlines'):
# Interpreter with universal newline support
self.assertEqual(stdout,
@@ -343,7 +355,7 @@
def test_no_leaking(self):
# Make sure we leak no resources
- max_handles = 1026 # too much for most UNIX systems
+ max_handles = 10 # too much for most UNIX systems
if mswindows:
max_handles = 65 # a full test is too slow on Windows
for i in range(max_handles):
@@ -424,7 +436,7 @@
'sys.stdout.write(os.getenv("FRUIT"))'],
stdout=subprocess.PIPE,
preexec_fn=lambda: os.putenv("FRUIT", "apple"))
- self.assertEqual(p.stdout.read(), "apple")
+ self.assertEqual(self.read_no_intr(p.stdout), "apple")
def test_args_string(self):
# args is a string
@@ -457,7 +469,7 @@
p = subprocess.Popen(["echo $FRUIT"], shell=1,
stdout=subprocess.PIPE,
env=newenv)
- self.assertEqual(p.stdout.read().strip(), "apple")
+ self.assertEqual(self.read_no_intr(p.stdout).strip(), "apple")
def test_shell_string(self):
# Run command through the shell (string)
@@ -466,7 +478,7 @@
p = subprocess.Popen("echo $FRUIT", shell=1,
stdout=subprocess.PIPE,
env=newenv)
- self.assertEqual(p.stdout.read().strip(), "apple")
+ self.assertEqual(self.read_no_intr(p.stdout).strip(), "apple")
def test_call_string(self):
# call() function with string argument on UNIX
@@ -525,7 +537,7 @@
p = subprocess.Popen(["set"], shell=1,
stdout=subprocess.PIPE,
env=newenv)
- self.assertNotEqual(p.stdout.read().find("physalis"), -1)
+ self.assertNotEqual(self.read_no_intr(p.stdout).find("physalis"), -1)
def test_shell_string(self):
# Run command through the shell (string)
@@ -534,7 +546,7 @@
p = subprocess.Popen("set", shell=1,
stdout=subprocess.PIPE,
env=newenv)
- self.assertNotEqual(p.stdout.read().find("physalis"), -1)
+ self.assertNotEqual(self.read_no_intr(p.stdout).find("physalis"), -1)
def test_call_string(self):
# call() function with string argument on Windows
/Peter Åstrand <astrand(a)lysator.liu.se>
This may seem like it's coming out of left field for a minute, but
bear with me.
There is no doubt that Ruby's success is a concern for anyone who
sees it as diminishing Python's status. One of the reasons for
Ruby's success is certainly the notion (originally advocated by Bruce
Tate, if I'm not mistaken) that it is the "next Java" -- the language
and environment that mainstream Java developers are, or will, look to
as a natural next step.
One thing that would help Python in this "debate" (or, perhaps simply
put it in the running, at least as a "next Java" candidate) would be
if Python had an easier migration path for Java developers that
currently rely upon various third-party libraries. The wealth of
third-party libraries available for Java has always been one of its
great strengths. Ergo, if Python had an easy-to-use, recommended way
to use those libraries within the Python environment, that would be a
significant advantage to present to Java developers and those who
would choose Ruby over Java. Platform compatibility is always a huge
motivator for those looking to migrate or upgrade.
In that vein, I would point to JPype (http://jpype.sourceforge.net).
JPype is a module that gives "python programs full access to java
class libraries". My suggestion would be to either:
(a) include JPype in the standard library, or barring that,
(b) make a very strong push to support JPype
(a) might be difficult or cumbersome technically, as JPype does need
to build against Java headers, which may or may not be possible given
the way that Python is distributed, etc.
However, (b) is very feasible. I can't really say what "supporting
JPype" means exactly -- maybe GvR and/or other heavyweights in the
Python community make public statements regarding its existence and
functionality, maybe JPype gets a strong mention or placement on
python.org....all those details are obviously not up to me, and I
don't know the workings of the "official" Python organizations enough
to make serious suggestions.
Regardless of the form of support, I think raising people's awareness
of JPype and what it adds to the Python environment would be a Good
Thing (tm).
For our part, we've used JPype to make PDFTextStream (our previously
Java-only PDF text extraction library) available and supported for
Python. You can read some about it here:
http://snowtide.com/PDFTextStream.Python
And I've blogged about how PDFTextStream.Python came about, and how
we worked with Steve Ménard, the maintainer of JPype, to make it all
happen (watch out for this URL wrapping):
http://blog.snowtide.com/2006/08/21/working-together-pythonjava-open-
sourcecommercial
Cheers,
Chas Emerick
Founder, Snowtide Informatics Systems
Enterprise-class PDF content extraction
cemerick(a)snowtide.com
http://snowtide.com | +1 413.519.6365
Phillip.eby wrote:
> Author: phillip.eby
> Date: Tue Apr 18 02:59:55 2006
> New Revision: 45510
>
> Modified:
> python/trunk/Lib/pkgutil.py
> python/trunk/Lib/pydoc.py
> Log:
> Second phase of refactoring for runpy, pkgutil, pydoc, and setuptools
> to share common PEP 302 support code, as described here:
>
> http://mail.python.org/pipermail/python-dev/2006-April/063724.html
Shouldn't this new module be named "pkglib" to be in line with
the naming scheme used for all the other utility modules, e.g. httplib,
imaplib, poplib, etc. ?
> pydoc now supports PEP 302 importers, by way of utility functions in
> pkgutil, such as 'walk_packages()'. It will properly document
> modules that are in zip files, and is backward compatible to Python
> 2.3 (setuptools installs for Python <2.5 will bundle it so pydoc
> doesn't break when used with eggs.)
Are you saying that the installation of setuptools in Python 2.3
and 2.4 will then overwrite the standard pydoc included with
those versions ?
I think that's the wrong way to go if not made an explicit
option in the installation process or a separate installation
altogether.
I bothered by the fact that installing setuptools actually changes
the standard Python installation by either overriding stdlib modules
or monkey-patching them at setuptools import time.
> What has not changed is that pydoc command line options do not support
> zip paths or other importer paths, and the webserver index does not
> support sys.meta_path. Those are probably okay as limitations.
>
> Tasks remaining: write docs and Misc/NEWS for pkgutil/pydoc changes,
> and update setuptools to use pkgutil wherever possible, then add it
> to the stdlib.
Add setuptools to the stdlib ? I'm still missing the PEP for this
along with the needed discussion touching among other things,
the change of the distutils standard "python setup.py install"
to install an egg instead of a site package.
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Apr 18 2006)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
I'm interested in how builtins could be more efficient. I've read over
some of the PEPs having to do with making global variables more
efficient (search for "global"):
http://www.python.org/doc/essays/pepparade.html
But I think the problem can be simplified by focusing strictly on
builtins.
One of my assumptions is that only a small fractions of modules override
the default builtins with something like:
import mybuiltins
__builtins__ = mybuiltins
As you probably know each access of a builtin requires two hash table
lookups. First, the builtin is not found in the list of globals. It is
then found in the list of builtins.
Why not have a means of referencing the default builtins with some sort
of index the way the LOAD_FAST op code currently works? In other words,
by default each module gets the default set of builtins indexed (where
the index indexes into an array) in a certain order. The version stored
in the pyc file would be bumped each time the set of default builtins
is changed.
I don't have very strong feelings whether things like True = (1 == 1)
would be a syntax error, but assigning to a builtin could just do the
equivalent of STORE_FAST. I also don't have very strong feelings about
whether the array of default builtins would be shared between modules.
To simulate the current behavior where attempting to assign to builtin
actually alters that module's global hashtable a separate array of
builtins could be used for each module.
As to assigning to __builtins__ (like I mentioned at the beginning of
this post) perhaps it could assign to the builtin array for those items
that have a name that matches a default builtin (such as "True" or
"len"). Those items that don't match a default builtin would just
create global variables.
Perhaps what I'm suggesting isn't feasible for reasons that have already
been discussed. But it seems like it should be possible to make "while
True" as efficient as "while 1".
--
-----------------------------------------------------------------------
| Steven Elliott | selliott4(a)austin.rr.com |
-----------------------------------------------------------------------
Should GeneratorExit inherit from Exception or BaseException?
Currently, a generator that catches Exception and continues on to yield
another value can't be closed properly (you get a runtime error pointing out
that the generator ignored GeneratorExit).
The only decent reference I could find to it in the old PEP 348/352
discussions is Guido writing [1]:
> when GeneratorExit or StopIteration
> reach the outer level of an app, it's a bug like all the others that
> bare 'except:' WANTS to catch.
(at that point in the conversation, I believe bare except was considered the
equivalent of "except Exception:")
While I agree with what Guido says about GeneratorExit being a bug if it
reaches the outer level of an app, it seems like a bit of a trap that a
correctly written generator can't write "except Exception:" without preceding
it with an "except GeneratorExit:" that reraises the exception. Isn't that
exactly the idiom we're trying to get rid of for SystemExit and KeyboardInterrupt?
Regards,
Nick.
[1] http://mail.python.org/pipermail/python-dev/2005-August/055173.html
--
Nick Coghlan | ncoghlan(a)gmail.com | Brisbane, Australia
---------------------------------------------------------------
http://www.boredomandlaziness.org
I was looking for a good pairing_heap implementation and came across
one that had apparently been checked in a couple years ago (!). Here
is the full link:
http://svn.python.org/view/sandbox/trunk/collections/pairing_heap.py?rev=40…
I was just wondering about the status of this implementation. The api
looks pretty good to me -- it's great that the author decided to have
the insert method return a node reference which can then be passed to
delete and adjust_key. It's a bit of a pain to implement that
functionality, but it's extremely useful for a number of applications.
If that project is still alive, I have a couple api suggestions:
* Add a method which nondestructively yields the top K elements of the
heap. This would work by popping the top k elements of the heap into a
list, then reinserting those elements in reverse order. By reinserting
the sorted elements in reverse order, the top of the heap is
essentially a sorted linked list, so if the exact operation is
repeated again, the removals take contant time rather than amortized
logarthmic.
* So, for example: if we have a min heap, the topK method would pop
K elements from the heap, say they are {1, 3, 5, 7}, then do
insert(7), followed by insert(5), ... insert(1).
* Even better might be if this operation avoided having to allocate
new heap nodes, and just reused the old ones.
* I'm not sure if adjust_key should throw an exception if the key
adjustment is in the wrong direction. Perhaps it should just fall back
on deleting and reinserting that node?
Paul
I've never liked the "".join([]) idiom for string concatenation; in my
opinion it violates the principles "Beautiful is better than ugly." and
"There should be one-- and preferably only one --obvious way to do it.".
(And perhaps several others.) To that end I've submitted patch #1569040
to SourceForge:
http://sourceforge.net/tracker/index.php?func=detail&aid=1569040&group_id=5…
This patch speeds up using + for string concatenation. It's been in
discussion on c.l.p for about a week, here:
http://groups.google.com/group/comp.lang.python/browse_frm/thread/b8a8f20bc…
I'm not a Python guru, and my initial benchmark had many mistakes. With
help from the community correct benchmarks emerged: + for string
concatenation is now roughly as fast as the usual "".join() idiom when
appending. (It appears to be *much* faster for prepending.) The
patched Python passes all the tests in regrtest.py for which I have
source; I didn't install external packages such as bsddb and sqlite3.
My approach was to add a "string concatenation" object; I have since
learned this is also called a "rope". Internally, a
PyStringConcatationObject is exactly like a PyStringObject but with a
few extra members taking an additional thirty-six bytes of storage.
When you add two PyStringObjects together, string_concat() returns a
PyStringConcatationObject which contains references to the two strings.
Concatenating any mixture of PyStringObjects and
PyStringConcatationObjects works similarly, though there are some
internal optimizations.
These changes are almost entirely contained within
Objects/stringobject.c and Include/stringobject.h. There is one major
externally-visible change in this patch: PyStringObject.ob_sval is no
longer a char[1] array, but a char *. Happily, this only requires a
recompile, because the CPython source is *marvelously* consistent about
using the macro PyString_AS_STRING(). (One hopes extension authors are
as consistent.) I only had to touch two other files (Python/ceval.c and
Objects/codeobject.c) and those were one-line changes. There is one
remaining place that still needs fixing: the self-described "hack" in
Mac/Modules/MacOS.c. Fixing that is beyond my pay grade.
I changed the representation of ob_sval for two reasons: first, it is
initially NULL for a string concatenation object, and second, because it
may point to separately-allocated memory. That's where the speedup came
from--it doesn't render the string until someone asks for the string's
value. It is telling to see my new implementation of
PyString_AS_STRING, as follows (casts and extra parentheses removed for
legibility):
#define PyString_AS_STRING(x) ( x->ob_sval ? x->ob_sval :
PyString_AsString(x) )
This adds a layer of indirection for the string and a branch, adding a
tiny (but measurable) slowdown to the general case. Again, because the
changes to PyStringObject are hidden by this macro, external users of
these objects don't notice the difference.
The patch is posted, and I have donned the thickest skin I have handy.
I look forward to your feedback.
Cheers,
/larry/
PEP: <unassigned>
Title: Adding data-type objects to the standard library
Version: $Revision: $
Last-Modified: $Date: $
Author: Travis Oliphant <oliphant(a)ee.byu.edu>
Status: Draft
Type: Standards Track
Created: 05-Sep-2006
Python-Version: 2.6
Abstract
This PEP proposes adapting the data-type objects from NumPy for
inclusion in standard Python, to provide a consistent and standard
way to discuss the format of binary data.
Rationale
There are many situations crossing multiple areas where an
interpretation is needed of binary data in terms of fundamental
data-types such as integers, floating-point, and complex
floating-point values. Having a common object that carries
information about binary data would be beneficial to many
people. The creation of data-type objects in NumPy to carry the
load of describing what each element of the array contains
represents an evolution of a solution that began with the
PyArray_Descr structure in Python's own array object. These
data-type objects can represent arbitrary byte data. Currently
such information is usually constructed using strings and
character codes which is unwieldy when a data-type consists of
nested structures.
Proposal
Add a PyDatatypeObject in Python (adapted from NumPy's dtype
object which evolved from the PyArray_Descr structure in Python's
array module) that holds information about a data-type. This object
will allow packages to exchange information about binary data in
a uniform way (see the extended buffer protocol PEP for an application
to exchanging information about array data).
Specification
The datatype is an object that specifies how a certain block of
memory should be interpreted as a basic data-type. In addition to
being able to describe basic data-types, the data-type object can
describe a data-type that is itself an array of other data-types
as well as a data-type that contains arbitrary "fields" (structure
members) which are located at specific offsets. In its most basic
form, however, a data-type is of a particular kind (bit, bool,
int, uint, float, complex, object, string, unicode, void) and size.
Datatype objects can be created using either a type-object, a
string, a tuple, a list, or a dictionary according to the following
constructors:
Type-object:
For a select set of type-objects a data-type object describing that
basic type can be described:
Examples:
>>> datatype(float)
datatype('float64')
>>> datatype(int)
datatype('int32') # on 32-bit platform (64 if c-long is 64-bits)
Tuple-object
A tuple of length 2 can be used to specify a data-type that is
an array of another kind of basic data-type (this array always
describes a C-contiguous array).
Examples:
>>> datatype((int, 5))
datatype(('int32', (5,)))
# describes a 5*4=20-byte block of memory laid out as
# a[0], a[1], a[2], a[3], a[4]
>>> datatype((float, (3,2))
datatype(('float64', (3,2))
# describes a 3*2*8=48 byte block of memory that should be
# interpreted as 6 doubles laid out as arr[0,0], arr[0,1],
# ... a[2,0], a[1,2]
String-object:
The basic format is '%s%s%s%d' % (endian, shape, kind, itemsize)
kind : one of the basic array kinds given below.
itemsize : the nubmer of bytes (or bits for 't' kind) for
this data-type.
endian : either '', '=' (native), '|' (doesn't matter),
'>' (big-endian) or '<' (little-endian).
shape : either '', or a shape-tuple describing a data-type that
is an array of the given shape.
A string can also be a comma-separated sequence of basic
formats. The result will be a data-type with default field
names: 'f0', 'f1', ..., 'fn'.
Examples:
>>> datatype('u4')
datatype('uint32')
>>> datatype('f4')
datatype('float32')
>>> datatype('(3,2)f4')
datatype(('float32', (3,2))
>>> datatype('(5,)i4, (3,2)f4, S5')
datatype([('f0', '<i4', (5,)), ('f1', '<f4', (3, 2)), ('f2', '|S5')])
List-object:
A list should be a list of tuples where each tuple describes a
field. Each tuple should contain (name, datatype{, shape}) or
((meta-info, name), datatype{, shape}) in order to specify the
data-type.
This list must fully specify the data-type (no memory holes). If
would would like to return a data-type with memory holes where the
compiler would place them, then pass the keyword align=1 to this
construction. This will result in un-named fields of Void kind of
the correct size interspersed where needed.
Examples:
datatype([( ([1,2],'coords'), 'f4', (3,6)), ('address', 'S30')])
A data-type that could represent the structure
float coords[3*6] /* Has [1,2] associated with this field */
char address[30]
datatype([( 'simple', 'i4'), ('nested', [('name', 'S30'),
('addr', 'S45'),
('amount', 'i4')])])
Can represent the memory layout of
struct {
int simple;
struct nested {
char name[30];
char addr[45];
int amount;
}
There is no formal limit to the nesting that is possible.
datatype('i2, i4, i1, f8', align=1)
datatype([('f0', '<i2'), ('', '|V2'), ('f1', '<i4'),
('f2', '|i1'), ('', '|V3'), ('f3', '<f8')])
# Notice the padding bytes placed in the structure to make sure
# f1 and f8 are aligned correctly for the 32-bit system.
Dictionary-object:
Sometimes, you are only concerned about a few fields in a larger
memory structure. The dictionary object allows specification of
a data-type with fields using a dictionary with names as keys and
tuples as values. The value tuples are
(data-type, offset{, meta-info}). The offset is the offset in
bytes (or bits when data-type is 't') from the beginning of the
structure to the field data-type.
Example:
datatype({'f3' : ('f8', 12), 'f2': ('i1', 8)})
type([('', '|V8'), ('f2', '|i1'), ('', '|V3'), ('f3', '<f8')])
Attributes
byteorder -- returns the byte-order of this data-type
isnative -- returns True if this data-type is in correct byte-order
for the platform.
descr -- returns an description of this data-type as a list of
tuples (name or (name, meta), datatype{, shape})
itemsize -- returns the total size of the data-type.
kind -- returns the basic "kind" of the data-type. The basic kinds
are:
't' - bit,
'b' - bool,
'i' - signed integer,
'u' - unsigned integer,
'f' - floating point,
'c' - complex floating point,
'S' - string (fixed-length sequence of char),
'U' - fixed length sequence of UCS4,
'O' - pointer to PyObject,
'V' - Void (anything else).
names -- returns a list of names (keys to the fields dictionary) in
offset-order.
fields -- returns a read-only dictionary indicating the fields or
None if this data-type has no fields. The dictionary
is keyed by the field name and each entry contains
a tuple of (data-type, offset{, meta-object}). The
offset indicates the byte-offset (or bit-offset for 't')
from the beginning of the data-type to the data-type
indicated.
hasobject -- returns True if this data-type is an "object" data-type
or has "object" fields.
name -- returns a 'name'-bitwidth description of data-type.
base -- returns self unless this data-type is an array of some
other data-type and then it returns that basic
data-type.
shape -- returns the shape of this data-type (for data-types
that are arrays of other data-types) or () if there
is no array.
str -- returns the type-string of this data-type which is the
basic kind followed by the number of bytes (or bits
for 't')
alignment -- returns alignment needed for this data-type on platform
as determined by the compiler.
Methods
newbyteorder ({endian})
create a new data-type with byte-order changed in any and all
fields (including deeply nested ones), to {endian}. If endian is
not given, then swap all byte-orders.
__len__(self)
equivalent to len(self.field)
__getitem__(self, name)
get the field named [name]. Equivalent to self.field[name].
C-functions : These are function pointers attached in a C-structure
connected with the data-type object that perform specific
functions.
setitem (PyObject *datatype, void *data, PyObject *obj)
set a Python object into memory of this data-type
at the given memory location.
getitem (PyObject *datatype, void *data)
get a Python object from memory of this data-type.
Implementation
A reference implementation (with more features than are proposed
here) is available in NumPy and will be adapted if this PEP is
accepted.
Questions:
There should probably be a limited C-API so that data-type objects
can be returned and sent through the extended buffer protocol (see
extended buffer protocol PEP).
Should bit-fields be handled by re-interpreting the offsets as
bit-values, use some other mechanism for handling the offset, or
should they be unsupported?
NumPy supports "string" and "unicode" data-types. The unicode
data-type in NumPy always means UCS4 (but it is translated
back-and forth to Python unicode scalars as needed for narrow
builds). With Python 3.0 looming, we should probably support
different encodings as data-types and drop the string type for a
bytes type. Some help in understanding what to do here is
appreciated.
Copyright
This PEP is placed in the public domain
Attached is my PEP for extending the buffer protocol to allow array data
to be shared.
PEP: <unassigned>
Title: Extending the buffer protocol to include the array interface
Version: $Revision: $
Last-Modified: $Date: $
Author: Travis Oliphant <oliphant(a)ee.byu.edu>
Status: Draft
Type: Standards Track
Created: 28-Aug-2006
Python-Version: 2.6
Abstract
This PEP proposes extending the tp_as_buffer structure to include
function pointers that incorporate information about the intended
shape and data-format of the provided buffer. In essence this will
place something akin to the array interface directly into Python.
Rationale
Several extensions to Python utilize the buffer protocol to share
the location of a data-buffer that is really an N-dimensional
array. However, there is no standard way to exchange the
additional N-dimensional array information so that the data-buffer
is interpreted correctly. The NumPy project introduced an array
interface (http://numpy.scipy.org/array_interface.shtml) through a
set of attributes on the object itself. While this approach
works, it requires attribute lookups which can be expensive when
sharing many small arrays.
One of the key reasons that users often request to place something
like NumPy into the standard library is so that it can be used as
standard for other packages that deal with arrays. This PEP
provides a mechanism for extending the buffer protocol (which
already allows data sharing) to add the additional information
needed to understand the data. This should be of benefit to all
third-party modules that want to share memory through the buffer
protocol such as GUI toolkits, PIL, PyGame, CVXOPT, PyVoxel,
PyMedia, audio libraries, video libraries etc.
Proposal
Add a bf_getarrayinfo function pointer to the buffer protocol to
allow objects to share additional information about the returned
memory pointer. Add the TP_HAS_EXT_BUFFER flag to types that
define the extended buffer protocol.
Specification:
static int
bf_getarrayinfo (PyObject *obj, Py_intptr_t **shape,
Py_intptr_t **strides, PyObject **dataformat)
Inputs:
obj -- The Python object being questioned.
Outputs:
[function result] -- the number of dimensions (n)
*shape -- A C-array of 'n' integers indicating the
shape of the array. Can be NULL if n==0.
*strides -- A C-array of 'n' integers indicating
the number of bytes to jump to get to the next
element in each dimension. Can be NULL if the
array is C-contiguous (or n==0).
*dataformat -- A Python object describing the data-format
each element of the array should be
interpreted as.
Discussion Questions:
1) How is data-format information supposed to be shared? A companion
proposal suggests returning a data-format object which carries the
information about the buffer area.
2) Should the single function pointer call be extended into
multiple calls or should it's arguments be compressed into a structure
that is filled?
3) Should a C-API function(s) be created which wraps calls to this function
pointer much like is done now with the buffer protocol? What should
the interface of this function (or these functions) be.
4) Should a mask (for missing values) be shared as well?
Reference Implementation
Supplied when the PEP is accepted.
Copyright
This document is placed in the public domain.