I've received some enthusiastic emails from someone who wants to
revive restricted mode. He started out with a bunch of patches to the
CPython runtime using ctypes, which he attached to an App Engine bug:
http://code.google.com/p/googleappengine/issues/detail?id=671
Based on his code (the file secure.py is all you need, included in
secure.tar.gz) it seems he believes the only security leaks are
__subclasses__, gi_frame and gi_code. (I have since convinced him that
if we add "restricted" guards to these attributes, he doesn't need the
functions added to sys.)
I don't recall the exploits that Samuele once posted that caused the
death of rexec.py -- does anyone recall, or have a pointer to the
threads?
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
Alright, I will re-submit with the contents pasted. I never use double
backquotes as I think them rather ugly; that is the work of an editor
or some automated program in the chain. Plus, it also messed up my
line formatting and now I have lines with one word on them... Anyway,
the contents of PEP 3145:
PEP: 3145
Title: Asynchronous I/O For subprocess.Popen
Author: (James) Eric Pruitt, Charles R. McCreary, Josiah Carlson
Type: Standards Track
Content-Type: text/plain
Created: 04-Aug-2009
Python-Version: 3.2
Abstract:
In its present form, the subprocess.Popen implementation is prone to
dead-locking and blocking of the parent Python script while waiting on data
from the child process.
Motivation:
A search for "python asynchronous subprocess" will turn up numerous
accounts of people wanting to execute a child process and communicate with
it from time to time reading only the data that is available instead of
blocking to wait for the program to produce data [1] [2] [3]. The current
behavior of the subprocess module is that when a user sends or receives
data via the stdin, stderr and stdout file objects, dead locks are common
and documented [4] [5]. While communicate can be used to alleviate some of
the buffering issues, it will still cause the parent process to block while
attempting to read data when none is available to be read from the child
process.
Rationale:
There is a documented need for asynchronous, non-blocking functionality in
subprocess.Popen [6] [7] [2] [3]. Inclusion of the code would improve the
utility of the Python standard library that can be used on Unix based and
Windows builds of Python. Practically every I/O object in Python has a
file-like wrapper of some sort. Sockets already act as such and for
strings there is StringIO. Popen can be made to act like a file by simply
using the methods attached the the subprocess.Popen.stderr, stdout and
stdin file-like objects. But when using the read and write methods of
those options, you do not have the benefit of asynchronous I/O. In the
proposed solution the wrapper wraps the asynchronous methods to mimic a
file object.
Reference Implementation:
I have been maintaining a Google Code repository that contains all of my
changes including tests and documentation [9] as well as blog detailing
the problems I have come across in the development process [10].
I have been working on implementing non-blocking asynchronous I/O in the
subprocess.Popen module as well as a wrapper class for subprocess.Popen
that makes it so that an executed process can take the place of a file by
duplicating all of the methods and attributes that file objects have.
There are two base functions that have been added to the subprocess.Popen
class: Popen.send and Popen._recv, each with two separate implementations,
one for Windows and one for Unix based systems. The Windows
implementation uses ctypes to access the functions needed to control pipes
in the kernel 32 DLL in an asynchronous manner. On Unix based systems,
the Python interface for file control serves the same purpose. The
different implementations of Popen.send and Popen._recv have identical
arguments to make code that uses these functions work across multiple
platforms.
When calling the Popen._recv function, it requires the pipe name be
passed as an argument so there exists the Popen.recv function that passes
selects stdout as the pipe for Popen._recv by default. Popen.recv_err
selects stderr as the pipe by default. "Popen.recv" and "Popen.recv_err"
are much easier to read and understand than "Popen._recv('stdout' ..." and
"Popen._recv('stderr' ..." respectively.
Since the Popen._recv function does not wait on data to be produced
before returning a value, it may return empty bytes. Popen.asyncread
handles this issue by returning all data read over a given time
interval.
The ProcessIOWrapper class uses the asyncread and asyncwrite functions to
allow a process to act like a file so that there are no blocking issues
that can arise from using the stdout and stdin file objects produced from
a subprocess.Popen call.
References:
[1] [ python-Feature Requests-1191964 ] asynchronous Subprocess
http://mail.python.org/pipermail/python-bugs-list/2006-December/
036524.html
[2] Daily Life in an Ivory Basement : /feb-07/problems-with-subprocess
http://ivory.idyll.org/blog/feb-07/problems-with-subprocess
[3] How can I run an external command asynchronously from Python? - Stack
Overflow
http://stackoverflow.com/questions/636561/how-can-i-run-an-external-
command-asynchronously-from-python
[4] 18.1. subprocess - Subprocess management - Python v2.6.2 documentation
http://docs.python.org/library/subprocess.html#subprocess.Popen.wait
[5] 18.1. subprocess - Subprocess management - Python v2.6.2 documentation
http://docs.python.org/library/subprocess.html#subprocess.Popen.kill
[6] Issue 1191964: asynchronous Subprocess - Python tracker
http://bugs.python.org/issue1191964
[7] Module to allow Asynchronous subprocess use on Windows and Posix
platforms - ActiveState Code
http://code.activestate.com/recipes/440554/
[8] subprocess.rst - subprocdev - Project Hosting on Google Code
http://code.google.com/p/subprocdev/source/browse/doc/subprocess.rst?spec=s…
[9] subprocdev - Project Hosting on Google Code
http://code.google.com/p/subprocdev
[10] Python Subprocess Dev
http://subdev.blogspot.com/
Copyright:
This P.E.P. is licensed under the Open Publication License;
http://www.opencontent.org/openpub/.
On Tue, Sep 8, 2009 at 22:56, Benjamin Peterson <benjamin(a)python.org> wrote:
> 2009/9/7 Eric Pruitt <eric.pruitt(a)gmail.com>:
>> Hello all,
>>
>> I have been working on adding asynchronous I/O to the Python
>> subprocess module as part of my Google Summer of Code project. Now
>> that I have finished documenting and pruning the code, I present PEP
>> 3145 for its inclusion into the Python core code. Any and all feedback
>> on the PEP (http://www.python.org/dev/peps/pep-3145/) is appreciated.
>
> Hi Eric,
> One of the reasons you're not getting many response is that you've not
> pasted the contents of the PEP in this message. That makes it really
> easy for people to comment on various sections.
>
> BTW, it seems like you were trying to use reST formatting with the
> text PEP layout. Double backquotes only mean something in reST.
>
>
> --
> Regards,
> Benjamin
>
Hi,
Python code should not depend upon the ordering of items in a dict.
Unfortunately it seems that a number of tests in the standard library do
just that.
Changing PyDict_MINSIZE from 8 to either 4 or 16 causes the following
tests to fail:
test_dis test_email test_inspect test_nntplib test_packaging
test_plistlib test_pprint test_symtable test_trace
test_sys also fails, but this is a legitimate failure in sys.getsizeof()
Changing the collision resolution function from f(n) = 5n + 1 to
f(n) = n + 1 results in the same failures, except for test_packaging and
test_symtable which pass.
Finally, changing the seed in unicode_hash() from (implicit) 0 to an
arbitrary value (12345678) causes the above tests to fail plus:
test_json test_set test_ttk_textonly test_urllib test_urlparse
I think this is a real issue as the unicode_hash() function is likely to
change soon due to http://bugs.python.org/issue13703.
Should I:
1. Submit one big bug report?
2. Submit a bug report for each "failing" test separately?
3. Ignore it, since the tests only fail when I start messing about?
Cheers,
Mark.
In reviewing a fix for the metaclass calculation in __build_class__
[1], I realised that PEP 3115 poses a potential problem for the common
practice of using "type(name, bases, ns)" for dynamic class creation.
Specifically, if one of the base classes has a metaclass with a
significant __prepare__() method, then the current idiom will do the
wrong thing (and most likely fail as a result), since "ns" will
probably be an ordinary dictionary instead of whatever __prepare__()
would have returned.
Initially I was going to suggest making __build_class__ part of the
language definition rather than a CPython implementation detail, but
then I realised that various CPython specific elements in its
signature made that a bad idea.
Instead, I'm thinking along the lines of an
"operator.prepare(metaclass, bases)" function that does the metaclass
calculation dance, invoking __prepare__() and returning the result if
it exists, otherwise returning an ordinary dict. Under the hood we
would refactor this so that operator.prepare and __build_class__ were
using a shared implementation of the functionality at the C level - it
may even be advisable to expose that implementation via the C API as
PyType_PrepareNamespace().
The correct idiom for dynamic type creation in a PEP 3115 world would then be:
from operator import prepare
cls = type(name, bases, prepare(type, bases))
Thoughts?
Cheers,
Nick.
[1] http://bugs.python.org/issue1294232
--
Nick Coghlan | ncoghlan(a)gmail.com | Brisbane, Australia
Hi,
Now that issue 13703 has been largely settled,
I want to propose my new dictionary implementation again.
It is a little more polished than before.
https://bitbucket.org/markshannon/hotpy_new_dict
Object-oriented benchmarks use considerably less memory and are
sometimes faster (by a small amount).
(I've only benchmarked on my old 32bit machine)
E.g 2to3 No speed change -28% memory
GCbench +10% speed -47% memory
Other benchmarks show little or no change in behaviour,
mainly minor memory savings.
If an application is OO and uses lots of memory
the new dict will save a lot of memory and maybe boost performance.
Other applications will be largely unaffected.
It passes all the tests.
(I had to change a couple that relied on dict repr() ordering)
Cheers,
Mark.
Hi,
I'm working on the hash collision issue since 2 or 3 weeks. I
evaluated all solutions and I think that I have now a good knowledge
of the problem and how it should be solved. The major issue is to have
a minor or no impact on applications (don't break backward
compatibility). I saw three major solutions:
- use a randomized hash
- use two hashes, a randomized hash and the actual hash kept for
backward compatibility
- count collisions on dictionary lookup
Using a randomized hash does break a lot of tests (e.g. tests relying
on the representation of a dictionary). The patch is huge, too big to
backport it directly on stable versions. Using a randomized hash may
also break (indirectly) real applications because the application
output is also somehow "randomized". For example, in the Django test
suite, the HTML output is different at each run. Web browsers may
render the web page differently, or crash, or ... I don't think that
Django would like to sort attributes of each HTML tag, just because we
wanted to fix a vulnerability.
Randomized hash has also a major issue: if the attacker is able to
compute the secret, (s)he can easily compute collisions and exploit
the hash collision vulnerability again. I don't know exactly how
complex it is to compute the secret, but our hash function is weak (it
is far from being cryptographic, it is really simple to run it
backward). If someone writes a fast function to compute the secret, we
will go back to the same point.
IMO using two hashes has the same disavantages of the randomized hash
solution, whereas it is more complex to implement.
The last solution is very simple: count collision and raise an
exception if it hits a limit. The path is something like 10 lines
whereas the randomized hash is more close to 500 lines, add a new
file, change Visual Studio project file, etc. First I thaught that it
would break more applications than the randomized hash, but I tried on
Django: the test suite fails with a limit of 20 collisions, but not
with a limit of 50 collisions, whereas the patch uses a limit of 1000
collisions. According to my basic tests, a limit of 35 collisions
requires a dictionary with more than 10,000,000 integer keys to raise
an error. I am not talking about the attack, but valid data.
More details about my tests on the Django test suite:
http://bugs.python.org/issue13703#msg151620
--
I propose to solve the hash collision vulnerability by counting
collisions because it does fix the vulnerability with a minor or no
impact on applications or backward compatibility. I don't see why we
should use a different fix for Python 3.3. If counting collisons
solves the issue for stable versions, it is also enough for Python
3.3. We now know all issues of the randomized hash solution, and I
think that there are more drawbacks than advantages. IMO the
randomized hash is overkill to fix the hash collision issue.
I just have some requests on Marc Andre Lemburg patch:
- the limit should be configurable: a new function in the sys module
should be enough. It may be private (or replaced by an environment
variable?) in stable versions
- the set type should also be patched (I didn't check if it is
vulnerable or not using the patch)
- the patch has no test! (a class with a fixed hash should be enough
to write a test)
- the limit must be documented somwhere
- the exception type should be different than KeyError
Victor
Hello everyone,
In effort to get a fix out before Perl 6 goes mainstream, Barry and I
have decided to pronounce on what we want for our stable releases.
What we have decided is that
1. Simple hash randomization is the way to go. We think this has the
best chance of actually fixing the problem while being fairly
straightforward such that we're comfortable putting it in a stable
release.
2. It will be off by default in stable releases and enabled by an
envar at runtime. This will prevent code breakage from dictionary
order changing as well as people depending on the hash stability.
--
Regards,
Benjamin
Hi,
In issues #13882 and #11457, I propose to add an argument to functions
returning timestamps to choose the timestamp format. Python uses float
in most cases whereas float is not enough to store a timestamp with a
resolution of 1 nanosecond. I added recently time.clock_gettime() to
Python 3.3 which has a resolution of a nanosecond. The (first?) new
timestamp format will be decimal.Decimal because it is able to store
any timestamp in any resolution without loosing bits. Instead of
adding a boolean argument, I would prefer to support more formats. My
last patch provides the following formats:
- "float": float (used by default)
- "decimal": decimal.Decimal
- "datetime": datetime.datetime
- "timespec": (sec, nsec) tuple # I don't think that we need it, it
is just another example
The proposed API is:
time.time(format="datetime")
time.clock_gettime(time.CLOCK_REALTIME, format="decimal")
os.stat(path, timestamp="datetime)
etc.
This API has an issue: importing the datetime or decimal object is
implicit, I don't know if it is really an issue. (In my last patch,
the import is done too late, but it can be fixed, it is not really a
matter.)
Alexander Belopolsky proposed to use
time.time(format=datetime.datetime) instead.
--
The first step would be to add an argument to functions returning
timestamps. The second step is to accept these new formats (Decimal?)
as input, for datetime.datetime.fromtimestamp() and os.utime() for
example.
(Using decimal.Decimal, we may remove os.utimens() and use the right
function depending on the timestamp resolution.)
--
I prefer Decimal over a dummy tuple like (sec, nsec) because you can
do arithmetic on it: t2-t1, a+b, t/k, etc. It stores also the
resolution of the clock: time.time() and time.clock_gettime() have for
example different resolution (sec, ms, us for time.time() and ns for
clock_gettime()).
The decimal module is still implemented in Python, but there is
working implementation in C which is much faster. Store timestamps as
Decimal can be a motivation to integrate the C implementation :-)
--
Examples with the time module:
$ ./python
Python 3.3.0a0 (default:52f68c95e025+, Jan 26 2012, 21:54:31)
>>> import time
>>> time.time()
1327611705.948446
>>> time.time('decimal')
Decimal('1327611708.988419')
>>> t1=time.time('decimal'); t2=time.time('decimal'); t2-t1
Decimal('0.000550')
>>> t1=time.time('float'); t2=time.time('float'); t2-t1
5.9604644775390625e-06
>>> time.clock_gettime(time.CLOCK_MONOTONIC, 'decimal')
Decimal('1211833.389740312')
>>> time.clock_getres(time.CLOCK_MONOTONIC, 'decimal')
Decimal('1E-9')
>>> time.clock()
0.12
>>> time.clock('decimal')
Decimal('0.120000')
Examples with os.stat:
$ ./python
Python 3.3.0a0 (default:2914ce82bf89+, Jan 30 2012, 23:07:24)
>>> import os
>>> s=os.stat("setup.py", timestamp="datetime")
>>> s.st_mtime - s.st_ctime
datetime.timedelta(0)
>>> print(s.st_atime - s.st_ctime)
52 days, 1:44:06.191293
>>> os.stat("setup.py", timestamp="timespec").st_ctime
(1323458640, 702327236)
>>> os.stat("setup.py", timestamp="decimal").st_ctime
Decimal('1323458640.702327236')
Victor
Hi everyone,
I think Py3.3 would be a good milestone for cleaning up the stdlib support
for XML. Note upfront: you may or may not know me as the maintainer of
lxml, the de-facto non-stdlib standard Python XML tool. This (lengthy) post
was triggered by the following kind of conversation that I keep having with
new XML users in Python (mostly on c.l.py), which hints at some serious
flaw in the stdlib.
User: I'm trying to do XML stuff XYZ in Python and have problem ABC.
Me: What library are you using? Could you show us some code?
User: My code looks like this snippet: ...
Me: You are using minidom which is known to be hard to use, slow and uses
lots of memory. Use the xml.etree.ElementTree package instead, or rather
its C implementation cElementTree, also in the stdlib.
User (coming back after a while): thanks, that was exactly what [I didn't
know] I was looking for.
What does this tell us?
1) MiniDOM is what new users find first. It's highly visible because there
are still lots of ancient "Python and XML" web pages out there that date
back from the time before Python 2.5 (or rather something like 2.2), when
it was the only XML tree library in the stdlib. It's also the first hit
from the top when you search for "XML" on the stdlib docs page and contains
the (to some people) familiar word "DOM", which lets users stop their
search and start writing code, not expecting to find a separate alternative
in the same stdlib, way further down. And the description as "mini",
"simple" and "lightweight" suggests to users that it's going to be easy to
use and efficient.
2) MiniDOM is not what users want. It leads to complicated, unpythonic code
and lots of problems. It is neither easy to use, nor efficient, nor
"lightweight", "simple" or "mini", not in absolute numbers (see
http://bugs.python.org/issue11379#msg148584 and following for a recent
discussion). It's also badly maintained in the sense that its performance
characteristics could likely be improved, but no-one is seriously
interested in doing that, because it would not lead to something that
actually *is* fast or memory friendly compared to any of the 'real'
alternatives that are available right now.
3) ElementTree is what users should use, MiniDOM is not. ElementTree was
added to the stdlib in Py2.5 on popular demand, exactly because it is very
easy to use, very fast, and very memory friendly. And because users did not
want to use MiniDOM any more. Today, ElementTree has a rather straight
upgrade path towards lxml.etree if more XML features like validation or
XSLT are needed. MiniDOM has nothing like that to offer. It's a dead end.
4) In the stdlib, cElementTree is independent of ElementTree, but totally
hidden in the documentation. In conversations like the above, it's
unnecessarily complex to explain to users that there is ElementTree (which
is documented in the stdlib), but that what they want to use is really
cElementTree, which has the same API but does not have a stdlib
documentation page that I can send them to. Note that the other Python
implementations simply provide cElementTree as an alias for ElementTree.
That leaves CPython as the only Python implementation that really has these
two separate modules.
So, there are many problems here. And I think they make it unnecessarily
complicated for users to process XML in Python and that the current
situation helps in turning away new users from Python as a language for XML
processing. Python does have impressively great tools for working with XML.
It's just that the stdlib and its documentation do not reflect or even
appreciate that.
What should change?
a) The stdlib documentation should help users to choose the right tool
right from the start. Instead of using the totally misleading wording that
it uses now, it should be honest about the performance characteristics of
MiniDOM and should actively suggest that those who don't know what to
choose (or even *that* they can choose) should not use MiniDOM in the first
place. I created a ticket (issue11379) for a minor step in this direction,
but given the responses, I'm rather convinced that there's a lot more that
can be done and should be done, and that it should be done now, right for
the next release.
b) cElementTree should finally loose it's "special" status as a separate
library and disappear as an accelerator module behind ElementTree. This has
been suggested a couple of times already, and AFAIR, there was some
opposition because 1) ET was maintained outside of the stdlib and 2) the
APIs of both were not identical. However, getting ET 1.3 into Py2.7 and 3.2
was a U-turn. Today, ET is *only* being maintained in the stdlib by Florent
Xicluna (who is doing a good job with it), and ET 1.3 has basically made
the APIs of both implementations compatible again. So, 3.3 would be the
right milestone for fixing the "two libs for one" quirk.
Given that this is the third time during the last couple of years that I'm
suggesting to finally fix the stdlib and its documentation, I won't provide
any further patches before it has finally been accepted that a) this is a
problem and b) it should be fixed, thus allowing the patches to actually
serve a purpose. If we can agree on that, I'll happily help in making this
change happen.
Stefan
Hello,
I've noted that distutils manages depends in a way I cannot understand.
Suppose I have a minimal setup.py:
from distutils.core import setup, Extension
setup(
name='foo',
version='1.0',
ext_modules=[
Extension('foo',
sources=['foo.c'],
depends=['fop.conf'] # <---- note the typo foo->fop
),
]
)
Now setup.py will rebuild all every time, this is because the policy of
newer_group in build_extension is to consider 'newer' any missing file.
http://bit.ly/build_ext_471
def build_extension(self, ext):
...
depends = sources + ext.depends
if not (self.force or newer_group(depends, ext_path, 'newer')):
logger.debug("skipping '%s' extension (up-to-date)", ext.name)
return
else:
logger.info("building '%s' extension", ext.name)
...
Can someone suggest me the reason of this choice instead of
missing='error' (at least for ext.depends)?
Cheers,
Matteo