I've received some enthusiastic emails from someone who wants to
revive restricted mode. He started out with a bunch of patches to the
CPython runtime using ctypes, which he attached to an App Engine bug:
http://code.google.com/p/googleappengine/issues/detail?id=671
Based on his code (the file secure.py is all you need, included in
secure.tar.gz) it seems he believes the only security leaks are
__subclasses__, gi_frame and gi_code. (I have since convinced him that
if we add "restricted" guards to these attributes, he doesn't need the
functions added to sys.)
I don't recall the exploits that Samuele once posted that caused the
death of rexec.py -- does anyone recall, or have a pointer to the
threads?
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
Alright, I will re-submit with the contents pasted. I never use double
backquotes as I think them rather ugly; that is the work of an editor
or some automated program in the chain. Plus, it also messed up my
line formatting and now I have lines with one word on them... Anyway,
the contents of PEP 3145:
PEP: 3145
Title: Asynchronous I/O For subprocess.Popen
Author: (James) Eric Pruitt, Charles R. McCreary, Josiah Carlson
Type: Standards Track
Content-Type: text/plain
Created: 04-Aug-2009
Python-Version: 3.2
Abstract:
In its present form, the subprocess.Popen implementation is prone to
dead-locking and blocking of the parent Python script while waiting on data
from the child process.
Motivation:
A search for "python asynchronous subprocess" will turn up numerous
accounts of people wanting to execute a child process and communicate with
it from time to time reading only the data that is available instead of
blocking to wait for the program to produce data [1] [2] [3]. The current
behavior of the subprocess module is that when a user sends or receives
data via the stdin, stderr and stdout file objects, dead locks are common
and documented [4] [5]. While communicate can be used to alleviate some of
the buffering issues, it will still cause the parent process to block while
attempting to read data when none is available to be read from the child
process.
Rationale:
There is a documented need for asynchronous, non-blocking functionality in
subprocess.Popen [6] [7] [2] [3]. Inclusion of the code would improve the
utility of the Python standard library that can be used on Unix based and
Windows builds of Python. Practically every I/O object in Python has a
file-like wrapper of some sort. Sockets already act as such and for
strings there is StringIO. Popen can be made to act like a file by simply
using the methods attached the the subprocess.Popen.stderr, stdout and
stdin file-like objects. But when using the read and write methods of
those options, you do not have the benefit of asynchronous I/O. In the
proposed solution the wrapper wraps the asynchronous methods to mimic a
file object.
Reference Implementation:
I have been maintaining a Google Code repository that contains all of my
changes including tests and documentation [9] as well as blog detailing
the problems I have come across in the development process [10].
I have been working on implementing non-blocking asynchronous I/O in the
subprocess.Popen module as well as a wrapper class for subprocess.Popen
that makes it so that an executed process can take the place of a file by
duplicating all of the methods and attributes that file objects have.
There are two base functions that have been added to the subprocess.Popen
class: Popen.send and Popen._recv, each with two separate implementations,
one for Windows and one for Unix based systems. The Windows
implementation uses ctypes to access the functions needed to control pipes
in the kernel 32 DLL in an asynchronous manner. On Unix based systems,
the Python interface for file control serves the same purpose. The
different implementations of Popen.send and Popen._recv have identical
arguments to make code that uses these functions work across multiple
platforms.
When calling the Popen._recv function, it requires the pipe name be
passed as an argument so there exists the Popen.recv function that passes
selects stdout as the pipe for Popen._recv by default. Popen.recv_err
selects stderr as the pipe by default. "Popen.recv" and "Popen.recv_err"
are much easier to read and understand than "Popen._recv('stdout' ..." and
"Popen._recv('stderr' ..." respectively.
Since the Popen._recv function does not wait on data to be produced
before returning a value, it may return empty bytes. Popen.asyncread
handles this issue by returning all data read over a given time
interval.
The ProcessIOWrapper class uses the asyncread and asyncwrite functions to
allow a process to act like a file so that there are no blocking issues
that can arise from using the stdout and stdin file objects produced from
a subprocess.Popen call.
References:
[1] [ python-Feature Requests-1191964 ] asynchronous Subprocess
http://mail.python.org/pipermail/python-bugs-list/2006-December/
036524.html
[2] Daily Life in an Ivory Basement : /feb-07/problems-with-subprocess
http://ivory.idyll.org/blog/feb-07/problems-with-subprocess
[3] How can I run an external command asynchronously from Python? - Stack
Overflow
http://stackoverflow.com/questions/636561/how-can-i-run-an-external-
command-asynchronously-from-python
[4] 18.1. subprocess - Subprocess management - Python v2.6.2 documentation
http://docs.python.org/library/subprocess.html#subprocess.Popen.wait
[5] 18.1. subprocess - Subprocess management - Python v2.6.2 documentation
http://docs.python.org/library/subprocess.html#subprocess.Popen.kill
[6] Issue 1191964: asynchronous Subprocess - Python tracker
http://bugs.python.org/issue1191964
[7] Module to allow Asynchronous subprocess use on Windows and Posix
platforms - ActiveState Code
http://code.activestate.com/recipes/440554/
[8] subprocess.rst - subprocdev - Project Hosting on Google Code
http://code.google.com/p/subprocdev/source/browse/doc/subprocess.rst?spec=s…
[9] subprocdev - Project Hosting on Google Code
http://code.google.com/p/subprocdev
[10] Python Subprocess Dev
http://subdev.blogspot.com/
Copyright:
This P.E.P. is licensed under the Open Publication License;
http://www.opencontent.org/openpub/.
On Tue, Sep 8, 2009 at 22:56, Benjamin Peterson <benjamin(a)python.org> wrote:
> 2009/9/7 Eric Pruitt <eric.pruitt(a)gmail.com>:
>> Hello all,
>>
>> I have been working on adding asynchronous I/O to the Python
>> subprocess module as part of my Google Summer of Code project. Now
>> that I have finished documenting and pruning the code, I present PEP
>> 3145 for its inclusion into the Python core code. Any and all feedback
>> on the PEP (http://www.python.org/dev/peps/pep-3145/) is appreciated.
>
> Hi Eric,
> One of the reasons you're not getting many response is that you've not
> pasted the contents of the PEP in this message. That makes it really
> easy for people to comment on various sections.
>
> BTW, it seems like you were trying to use reST formatting with the
> text PEP layout. Double backquotes only mean something in reST.
>
>
> --
> Regards,
> Benjamin
>
Which I noticed since it's cited in the BeOpen license we still refer
to in LICENSE. Since pythonlabs.com itself is still up, it probably
isn't much work to make the logos.html URI work again, but I don't know
who maintains that page.
cheer,
Georg
--
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.
Hello everyone.
I see several problems with the two hex-conversion function pairs that
Python offers:
1. binascii.hexlify and binascii.unhexlify
2. bytes.fromhex and bytes.hex
Problem #1:
bytes.hex is not implemented, although it was specified in PEP 358.
This means there is no symmetrical function to accompany bytes.fromhex.
Problem #2:
Both pairs perform the same function, although The Zen Of Python suggests
that
"There should be one-- and preferably only one --obvious way to do it."
I do not understand why PEP 358 specified the bytes function pair although
it mentioned the binascii pair...
Problem #3:
bytes.fromhex may receive spaces in the input string, although
binascii.unhexlify may not.
I see no good reason for these two functions to have different features.
Problem #4:
binascii.unhexlify may receive both input types: strings or bytes, whereas
bytes.fromhex raises an exception when given a bytes parameter.
Again there is no reason for these functions to be different.
Problem #5:
binascii.hexlify returns a bytes type - although ideally, converting to hex
should
always return string types and converting from hex should always return
bytes.
IMO there is no meaning of bytes as an output of hexlify, since the output
is a
representation of other bytes.
This is also the suggested behavior of bytes.hex in PEP 358
Problems #4 and #5 call for a decision about the input and output of the
functions being discussed:
Option A : Strict input and output
unhexlify (and bytes.fromhex) may only receives string and may only return
bytes
hexlify (and bytes.hex) may only receives bytes and may only return strings
Option B : Robust input and strict output
unhexlify (and bytes.fromhex) may receive bytes and strings and may only
return bytes
hexlify (and bytes.hex) may receive bytes or strings and may only return
strings
Of course we may also consider a third option, which will allow the return
type of
all functions to be robust (perhaps specified in a keyword argument), but as
I wrote in
the description of problem #5, I see no sense in that.
Note that PEP 3137 describes: "... the more strict definitions of encoding
and decoding in
Python 3000: encoding always takes a Unicode string and returns a bytes
sequence, and decoding
always takes a bytes sequence and returns a Unicode string." - suggesting
option A.
To repeat problems #4 and #5, the current behavior does not match any
option:
* The return type of binascii.hexlify should be string, and this is not the
current behavior.
As for the input:
* Option A is not the current behavior because binascii.unhexlify may
receive both input types.
* Option B is not the current behavior because bytes.fromhex does not allow
bytes as input.
To fix these issues, three changes should be applied:
1. Deprecate bytes.fromhex. This fixes the following problems:
#4 (go with option B and remove the function that does not allow bytes
input)
#2 (the binascii functions will be the only way to "do it")
#1 (bytes.hex should not be implemented)
2. In order to keep the functionality that bytes.fromhex has over unhexlify,
the latter function should be able to handle spaces in its input (fix #3)
3. binascii.hexlify should return string as its return type (fix #5)
Hello there,
The last couple of days I've been working on an experimental rewrite of
the GIL. Since the work has been turning out rather successful (or, at
least, not totally useless and crashing!) I thought I'd announce it
here.
First I want to stress this is not about removing the GIL. There still
is a Global Interpreter Lock which serializes access to most parts of
the interpreter. These protected parts haven't changed either, so Python
doesn't become really better at extracting computational parallelism out
of several cores.
Goals
-----
The new GIL (which is also the name of the sandbox area I've committed
it in, "newgil") addresses the following issues :
1) Switching by opcode counting. Counting opcodes is a very crude way of
estimating times, since the time spent executing a single opcode can
very wildly. Litterally, an opcode can be as short as a handful of
nanoseconds (think something like "... is not None") or as long as a
fraction of second, or even longer (think calling a heavy non-GIL
releasing C function, such as re.search()). Therefore, releasing the GIL
every 100 opcodes, regardless of their length, is a very poor policy.
The new GIL does away with this by ditching _Py_Ticker entirely and
instead using a fixed interval (by default 5 milliseconds, but settable)
after which we ask the main thread to release the GIL and let another
thread be scheduled.
2) GIL overhead and efficiency in contended situations. Apparently, some
OSes (OS X mainly) have problems with lock performance when the lock is
already taken: the system calls are heavy. This is the "Dave Beazley
effect", where he took a very trivial loop, therefore made of very short
opcodes and therefore releasing the GIL very often (probably 100000
times a second), and runs it in one or two threads on an OS with poor
lock performance (OS X). He sees a 50% increase in runtime when using
two threads rather than one, in what is admittedly a pathological case.
Even on better platforms such as Linux, eliminating the overhead of many
GIL acquires and releases (since the new GIL is released on a fixed time
basis rather than on an opcode counting basis) yields slightly better
performance (read: a smaller performance degradation :-)) when there are
several pure Python computation threads running.
3) Thread switching latency. The traditional scheme merely releases the
GIL for a couple of CPU cycles, and reacquires it immediately.
Unfortunately, this doesn't mean the OS will automatically switch to
another, GIL-awaiting thread. In many situations, the same thread will
continue running. This, with the opcode counting scheme, is the reason
why some people have been complaining about latency problems when an I/O
thread competes with a computational thread (the I/O thread wouldn't be
scheduled right away when e.g. a packet arrives; or rather, it would be
scheduled by the OS, but unscheduled immediately when trying to acquire
the GIL, and it would be scheduled again only much later).
The new GIL improves on this by combinating two mechanisms:
- forced thread switching, which means that when the switching interval
is terminated (mentioned in 1) and the GIL is released, we will force
any of the threads waiting on the GIL to be scheduled instead of the
formerly GIL-holding thread. Which thread exactly is an OS decision,
however: the goal here is not to have our own scheduler (this could be
discussed but I wanted the design to remain simple :-) After all,
man-years of work have been invested in scheduling algorithms by kernel
programming teams).
- priority requests, which is an option for a thread requesting the GIL
to be scheduled as soon as possible, and forcibly (rather than any other
threads). This is meant to be used by GIL-releasing methods such as
read() on files and sockets. The scheme, again, is very simple: when a
priority request is done by a thread, the GIL is released as soon as
possible by the thread holding it (including in the eval loop), and then
the thread making the priority request is forcibly scheduled (by making
all other GIL-awaiting threads wait in the meantime).
Implementation
--------------
The new GIL is implemented using a couple of mutexes and condition
variables. A {mutex, condition} pair is used to protect the GIL itself,
which is a mere variable named `gil_locked` (there are a couple of other
variables for bookkeeping). Another {mutex, condition} pair is used for
forced thread switching (described above). Finally, a separate mutex is
used for priority requests (described above).
The code is in the sandbox:
http://svn.python.org/view/sandbox/trunk/newgil/
The file of interest is Python/ceval_gil.h. Changes in other files are
very minimal, except for priority requests which have been added at
strategic places (some methods of I/O modules). Also, the code remains
rather short, while of course being less trivial than the old one.
NB : this is a branch of py3k. There should be no real difficulty
porting it back to trunk, provided someone wants to do the job.
Platforms
---------
I've implemented the new GIL for POSIX and Windows (tested under Linux
and Windows XP (running in a VM)). Judging by what I can read in the
online MSDN docs, the Windows support should include everything from
Windows 2000, and probably recent versions of Windows CE.
Other platforms aren't implemented, because I don't have access to the
necessary hardware. Besides, I must admit I'm not very motivated in
working on niche/obsolete systems. I've e-mailed Andrew MacIntyre in
private to ask him if he'd like to do the OS/2 support.
Supporting a new platform is not very difficult: it's a matter of
writing the 50-or-so lines of necessary platform-specific macros at the
beginning of Python/ceval_gil.h.
The reason I couldn't use the existing thread support
(Python/thread_*.h) is that these abstractions are too poor. Mainly,
they don't provide:
- events, conditions or an equivalent thereof
- the ability to acquire a resource with a timeout
Measurements
------------
Before starting this work, I wrote ccbench (*), a little benchmark
script ("ccbench" being a shorthand for "concurrency benchmark") which
measures two things:
- computation throughput with one or several concurrent threads
- latency to external events (I use an UDP socket) when there is zero,
one, or several background computation threads running
(*) http://svn.python.org/view/sandbox/trunk/ccbench/
The benchmark involves several computation workloads with different GIL
characteristics. By default there are 3 of them:
A- one pure Python workload (computation of a number of digits of pi):
that is, something which spends its time in the eval loop
B- one mostly C workload where the C implementation doesn't release the
GIL (regular expression matching)
C- one mostly C workload where the implementation does release the GIL
(bz2 compression)
In the ccbench directory you will find benchmark results, under Linux,
for two different systems I have here. The new GIL shows roughly similar
but slightly better throughput results than the old one. And it is much
better in the latency tests, especially in workload B (going down from
almost a second of average latency with the old GIL, to a couple of
milliseconds with the new GIL). This is the combined result of using a
time-based scheme (rather than opcode-based) and of forced thread
switching (rather than relying on the OS to actually switch threads when
we speculatively release the GIL).
As a sidenote, I might mention that single-threaded performance is not
degraded at all. It is, actually, theoretically a bit better because the
old ticker check in the eval loop becomes simpler; however, this goes
mostly unnoticed.
Now what remains to be done?
Having other people test it would be fine. Even better if you have an
actual multi-threaded py3k application. But ccbench results for other
OSes would be nice too :-)
(I get good results under the Windows XP VM but I feel that a VM is not
an ideal setup for a concurrency benchmark)
Of course, studying and reviewing the code is welcome. As for
integrating it into the mainline py3k branch, I guess we have to answer
these questions:
- is the approach interesting? (we could decide that it's just not worth
it, and that a good GIL can only be a dead (removed) GIL)
- is the patch good, mature and debugged enough?
- how do we deal with the unsupported platforms (POSIX and Windows
support should cover most bases, but the fate of OS/2 support depends on
Andrew)?
Regards
Antoine.
Hi,
recently I wrote an algorithm, in which very often I had to get an arbitrary
element from a set without removing it.
Three possibilities came to mind:
1.
x = some_set.pop()
some_set.add(x)
2.
for x in some_set:
break
3.
x = iter(some_set).next()
Of course, the third should be the fastest. It nevertheless goes through all
the iterator creation stuff, which costs some time. I wondered, why the builtin
set does not provide a more direct and efficient way for retrieving some element
without removing it. Is there any reason for this?
I imagine something like
x = some_set.get()
or
x = some_set.pop(False)
and am thinking about providing a patch against setobject.c (preferring the
.get() solution being a stripped down pop()).
Before, I would like to know whether I have overlooked something or whether
this can be done in an already existing way.
Thanks,
wr
Hello,
Since the addition of PEP 370, (per-user site packages), site.py and
distutils/command/install.py are *both* providing the various
installation directories for Python,
depending on the system and the Python version.
We have also started to discuss lately in various Mailing Lists the
addition of new schemes for IronPython and Jython, meaning that we
might add some more in both places.
I would like to suggest a simplification by adding a dedicated module
to manage these installation schemes in one single place in the
stdlib.
This new independant module would be used by site.py and distutils and
would also make it easier for third party code to work with these
schemes.
Of course this new module would be rather simple and not add any new
import statement to avoid any overhead when Python starts and loads
site.py
Regards
Tarek
Is there any possibility of backporting support for the nonlocal keyword
into a 2.x release? I see it's not in 2.6, but I don't know if that was an
intentional design choice or due to a lack of demand / round tuits. I'm
also not sure if this would fall under the scope of the proposed moratorium
on new language features (although my first impression was that it could be
allowed since it already exists in python 3.
One of my motivations for asking is a recent blog post by Fernando Perez of
IPython fame that describes an interesting decorator-based idiom inspired by
Apple's Grand Central Dispatch which would allow many interesting
possibilities for expressing parallelization and other manipulations of
execution context for blocks of python code. Unfortunately, using the
technique to its fullest extent requires the nonlocal keyword.
The blog post is here:
https://cirl.berkeley.edu/fperez/py4science/decorators.html
Mike
I propose the following PEP for inclusion to Python 3.1.
Please comment.
Regards,
Martin
Abstract
========
Namespace packages are a mechanism for splitting a single Python
package across multiple directories on disk. In current Python
versions, an algorithm to compute the packages __path__ must be
formulated. With the enhancement proposed here, the import machinery
itself will construct the list of directories that make up the
package.
Terminology
===========
Within this PEP, the term package refers to Python packages as defined
by Python's import statement. The term distribution refers to
separately installable sets of Python modules as stored in the Python
package index, and installed by distutils or setuptools. The term
vendor package refers to groups of files installed by an operating
system's packaging mechanism (e.g. Debian or Redhat packages install
on Linux systems).
The term portion refers to a set of files in a single directory (possibly
stored in a zip file) that contribute to a namespace package.
Namespace packages today
========================
Python currently provides the pkgutil.extend_path to denote a package as
a namespace package. The recommended way of using it is to put::
from pkgutil import extend_path
__path__ = extend_path(__path__, __name__)
int the package's ``__init__.py``. Every distribution needs to provide
the same contents in its ``__init__.py``, so that extend_path is
invoked independent of which portion of the package gets imported
first. As a consequence, the package's ``__init__.py`` cannot
practically define any names as it depends on the order of the package
fragments on sys.path which portion is imported first. As a special
feature, extend_path reads files named ``*.pkg`` which allow to
declare additional portions.
setuptools provides a similar function pkg_resources.declare_namespace
that is used in the form::
import pkg_resources
pkg_resources.declare_namespace(__name__)
In the portion's __init__.py, no assignment to __path__ is necessary,
as declare_namespace modifies the package __path__ through sys.modules.
As a special feature, declare_namespace also supports zip files, and
registers the package name internally so that future additions to sys.path
by setuptools can properly add additional portions to each package.
setuptools allows declaring namespace packages in a distribution's
setup.py, so that distribution developers don't need to put the
magic __path__ modification into __init__.py themselves.
Rationale
=========
The current imperative approach to namespace packages has lead to
multiple slightly-incompatible mechanisms for providing namespace
packages. For example, pkgutil supports ``*.pkg`` files; setuptools
doesn't. Likewise, setuptools supports inspecting zip files, and
supports adding portions to its _namespace_packages variable, whereas
pkgutil doesn't.
In addition, the current approach causes problems for system vendors.
Vendor packages typically must not provide overlapping files, and an
attempt to install a vendor package that has a file already on disk
will fail or cause unpredictable behavior. As vendors might chose to
package distributions such that they will end up all in a single
directory for the namespace package, all portions would contribute
conflicting __init__.py files.
Specification
=============
Rather than using an imperative mechanism for importing packages, a
declarative approach is proposed here, as an extension to the existing
``*.pkg`` mechanism.
The import statement is extended so that it directly considers ``*.pkg``
files during import; a directory is considered a package if it either
contains a file named __init__.py, or a file whose name ends with
".pkg".
In addition, the format of the ``*.pkg`` file is extended: a line with
the single character ``*`` indicates that the entire sys.path will
be searched for portions of the namespace package at the time the
namespace packages is imported.
Importing a package will immediately compute the package's __path__;
the ``*.pkg`` files are not considered anymore after the initial import.
If a ``*.pkg`` package contains an asterisk, this asterisk is prepended
to the package's __path__ to indicate that the package is a namespace
package (and that thus further extensions to sys.path might also
want to extend __path__). At most one such asterisk gets prepended
to the path.
extend_path will be extended to recognize namespace packages according
to this PEP, and avoid adding directories twice to __path__.
No other change to the importing mechanism is made; searching
modules (including __init__.py) will continue to stop at the first
module encountered.
Discussion
==========
With the addition of ``*.pkg`` files to the import mechanism, namespace
packages can stop filling out the namespace package's __init__.py.
As a consequence, extend_path and declare_namespace become obsolete.
It is recommended that distributions put a file <distribution>.pkg
into their namespace packages, with a single asterisk. This allows
vendor packages to install multiple portions of namespace package
into a single directory, with no risk of overlapping files.
Namespace packages can start providing non-trivial __init__.py
implementations; to do so, it is recommended that a single distribution
provides a portion with just the namespace package's __init__.py
(and potentially other modules that belong to the namespace package
proper).
The mechanism is mostly compatible with the existing namespace
mechanisms. extend_path will be adjusted to this specification;
any other mechanism might cause portions to get added twice to
__path__.
Copyright
=========
This document has been placed in the public domain.
Another summit, another potential time to see if people want to change
anything about the issue tracker. I would bring up:
- Dropping Stage in favor of some keywords (e.g. 'needs unit test', 'needs
docs')
- Adding a freestyle text box to delineate which, if any, stdlib module is
the cause of a bug and tie that into Misc/maintainers.rst; would potentially
scale back the Component box
-Brett