I've received some enthusiastic emails from someone who wants to
revive restricted mode. He started out with a bunch of patches to the
CPython runtime using ctypes, which he attached to an App Engine bug:
http://code.google.com/p/googleappengine/issues/detail?id=671
Based on his code (the file secure.py is all you need, included in
secure.tar.gz) it seems he believes the only security leaks are
__subclasses__, gi_frame and gi_code. (I have since convinced him that
if we add "restricted" guards to these attributes, he doesn't need the
functions added to sys.)
I don't recall the exploits that Samuele once posted that caused the
death of rexec.py -- does anyone recall, or have a pointer to the
threads?
--
--Guido van Rossum (home page: http://www.python.org/~guido/)
I propose the following PEP for inclusion to Python 3.1.
Please comment.
Regards,
Martin
Abstract
========
Namespace packages are a mechanism for splitting a single Python
package across multiple directories on disk. In current Python
versions, an algorithm to compute the packages __path__ must be
formulated. With the enhancement proposed here, the import machinery
itself will construct the list of directories that make up the
package.
Terminology
===========
Within this PEP, the term package refers to Python packages as defined
by Python's import statement. The term distribution refers to
separately installable sets of Python modules as stored in the Python
package index, and installed by distutils or setuptools. The term
vendor package refers to groups of files installed by an operating
system's packaging mechanism (e.g. Debian or Redhat packages install
on Linux systems).
The term portion refers to a set of files in a single directory (possibly
stored in a zip file) that contribute to a namespace package.
Namespace packages today
========================
Python currently provides the pkgutil.extend_path to denote a package as
a namespace package. The recommended way of using it is to put::
from pkgutil import extend_path
__path__ = extend_path(__path__, __name__)
int the package's ``__init__.py``. Every distribution needs to provide
the same contents in its ``__init__.py``, so that extend_path is
invoked independent of which portion of the package gets imported
first. As a consequence, the package's ``__init__.py`` cannot
practically define any names as it depends on the order of the package
fragments on sys.path which portion is imported first. As a special
feature, extend_path reads files named ``*.pkg`` which allow to
declare additional portions.
setuptools provides a similar function pkg_resources.declare_namespace
that is used in the form::
import pkg_resources
pkg_resources.declare_namespace(__name__)
In the portion's __init__.py, no assignment to __path__ is necessary,
as declare_namespace modifies the package __path__ through sys.modules.
As a special feature, declare_namespace also supports zip files, and
registers the package name internally so that future additions to sys.path
by setuptools can properly add additional portions to each package.
setuptools allows declaring namespace packages in a distribution's
setup.py, so that distribution developers don't need to put the
magic __path__ modification into __init__.py themselves.
Rationale
=========
The current imperative approach to namespace packages has lead to
multiple slightly-incompatible mechanisms for providing namespace
packages. For example, pkgutil supports ``*.pkg`` files; setuptools
doesn't. Likewise, setuptools supports inspecting zip files, and
supports adding portions to its _namespace_packages variable, whereas
pkgutil doesn't.
In addition, the current approach causes problems for system vendors.
Vendor packages typically must not provide overlapping files, and an
attempt to install a vendor package that has a file already on disk
will fail or cause unpredictable behavior. As vendors might chose to
package distributions such that they will end up all in a single
directory for the namespace package, all portions would contribute
conflicting __init__.py files.
Specification
=============
Rather than using an imperative mechanism for importing packages, a
declarative approach is proposed here, as an extension to the existing
``*.pkg`` mechanism.
The import statement is extended so that it directly considers ``*.pkg``
files during import; a directory is considered a package if it either
contains a file named __init__.py, or a file whose name ends with
".pkg".
In addition, the format of the ``*.pkg`` file is extended: a line with
the single character ``*`` indicates that the entire sys.path will
be searched for portions of the namespace package at the time the
namespace packages is imported.
Importing a package will immediately compute the package's __path__;
the ``*.pkg`` files are not considered anymore after the initial import.
If a ``*.pkg`` package contains an asterisk, this asterisk is prepended
to the package's __path__ to indicate that the package is a namespace
package (and that thus further extensions to sys.path might also
want to extend __path__). At most one such asterisk gets prepended
to the path.
extend_path will be extended to recognize namespace packages according
to this PEP, and avoid adding directories twice to __path__.
No other change to the importing mechanism is made; searching
modules (including __init__.py) will continue to stop at the first
module encountered.
Discussion
==========
With the addition of ``*.pkg`` files to the import mechanism, namespace
packages can stop filling out the namespace package's __init__.py.
As a consequence, extend_path and declare_namespace become obsolete.
It is recommended that distributions put a file <distribution>.pkg
into their namespace packages, with a single asterisk. This allows
vendor packages to install multiple portions of namespace package
into a single directory, with no risk of overlapping files.
Namespace packages can start providing non-trivial __init__.py
implementations; to do so, it is recommended that a single distribution
provides a portion with just the namespace package's __init__.py
(and potentially other modules that belong to the namespace package
proper).
The mechanism is mostly compatible with the existing namespace
mechanisms. extend_path will be adjusted to this specification;
any other mechanism might cause portions to get added twice to
__path__.
Copyright
=========
This document has been placed in the public domain.
Hello everyone!
We have been encountering several deadlocks in a threaded Python
application which calls subprocess.Popen (i.e. fork()) in some of its
threads.
This has occurred on Python 2.4.1 on a 2.4.27 Linux kernel.
Preliminary analysis of the hang shows that the child process blocks
upon entering the execvp function, in which the import_lock is acquired
due to the following line:
def _ execvpe(file, args, env=None):
from errno import ENOENT, ENOTDIR
...
It is known that when forking from a pthreaded application, acquisition
attempts on locks which were already locked by other threads while
fork() was called will deadlock.
Due to these oddities we were wondering if it would be better to extract
the above import line from the execvpe call, to prevent lock
acquisition attempts in such cases.
Another workaround could be re-assigning a new lock to import_lock
(such a thing is done with the global interpreter lock) at PyOS_AfterFork or
pthread_atfork.
We'd appreciate any opinions you might have on the subject.
Thanks in advance,
Yair and Rotem
hey, has anyone investigated compiling python2.5 using winegcc, under wine?
i'm presently working my way through it, just for kicks, and was
wondering if anyone would like to pitch in or stare at the mess under
a microscope.
it's not as crazed as it sounds. cross-compiling python2.5 for win32
with mingw32 is an absolute miserable bitch of a job that goes
horribly wrong when you actually try to use the minimalist compiler to
do any real work.
so i figured that it would be easier to get python compiled using
wine. i _have_ got some success - a python script and a python.exe.so
(which is winegcc's friendly way of telling you you have something
that stands a chance of working) as well as a libpython25.dll.so.
what i _don't_ yet have is an _md5.dll (or should it be _md5.lib?)
i.e. the standard modules are a bit... iffy. the _winreg.o is
compiled; the _md5.o is compiled; the winreg.lib is not. whoops.
plus, it's necessary to enable nt_dl.c which is in PC/ _not_ in
Modules/.
one of the key issues that's a bit of a bitch is that python is
compiled up for win32 with a hard-coded pyconfig.h which someone went
to a _lot_ of trouble to create by hand instead of using autoconf. oh
- and it uses visualstudio so there's not even a Makefile. ignoring
that for the time-being was what allowed me to get as far as actually
having a python interpreter (with no c-based modules).
so there's a whole _stack_ of stuff that needs dragging kicking and
screaming into the 21st century.
there _is_ a reason why i want to do this. actually, there's two.
firstly, i sure as shit do _not_ want to buy, download, install _or_
run visual studio. i flat-out refuse to run an MS os and visual
studio runs like a dog under wine.
secondly, i want a python25.lib which i can use to cross-compile
modules for poor windows users _despite_ sticking to my principles and
keeping my integrity as a free software developer.
thirdly i'd like to cross-compile pywebkitgtk for win32
fourthly i'd like to compile and link applications to the extremely
successful and well wicked MSHTML.DLL... in the _wine_ project :) not
the one in windows (!) i want to experiment with DOM model
manipulation - from python - similar to the OLPC HulaHop project -
_but_ i want to compile or cross-compile everything from linux, not
windows (see 1 above)
fifthly i'd like to see COM (DCOM) working and pywin32 compiled and
useable under wine, even if it means having to get a license to use
dcom98 and oleauth.lib and oleauth.h etc. and all the developer files
needed to link DCOM applications under windows. actually what i'd
_really_ like to see is FreeDCE's DCOM work actually damn well
finished, it's only been eight years since wez committed the first
versions of the IDL and header files, and it's only been over fifteen
years since microsoft began its world domination using COM and DCOM.
... but that's another story :)
so that's ... five reasons not two. if anyone would like to
collaborate on a crazed project with someone who can't count, i'm
happy to make available what i've got up to so far, on github.org.
l.
Hi all,
In an attempt to figure out some twisted.web code, I was reading
through the Python Standard Library’s mimetypes module today, and
was shocked at the poor quality of the code. I wonder how the
mimetypes code made it into the standard library, and whether anyone
has ever bothered to read it or update it: it is an embarrassment.
Much of the code is redundant, portions fail to execute, control
flow is routed through a horribly confusing mess of spaghetti, and
most of the complexity has no clear benefit as far as I can tell. I
probably should drop the subject and get back to work, but as a good
citizen, it’s hard to just ignore this sort of thing.
mimetypes.py stores its types in a pair of dictionaries, one for
"strict" use, and the other for "non-standard types". It creates the
strict dictionary by default out of apache's mime.types file, and
then overrides the entries it finds with a set of exceptions. Then
it creates the non-standard dictionary, which is set to match if the
strict parameter is set to False when guessing types. Just in this
basic design, and in the list of types in the file, there are
several problems:
* Various apache mime types files are read, if found, but the
ordering of the files is such that older versions of apache are
sometimes read after newer ones, overriding updated mime types
with out-of-date versions if multiple versions of apache are
installed on the system.
* The vast majority of types declared in mimetypes.py are
duplicates of types already declared by Apache. In a few cases
this is to change the apache default (make an exception, that
is), but in most cases the mime type and extension are
completely identical. This huge number of redundant types makes
the file substantially harder to follow. No comments are
provided to explain why various sets of exceptions are made to
Apache's default mime types, and in several cases mimetypes.py
seems to just be out of date as compared to recent versions of
Apache, for instance not knowing about the 'text/troff' type
which was registered in January 2006 in RFC 4263.
* The 'non-standard' type dictionary is nearly useless, because
all of the types it declares are already in apache's mime.types
file, meaning that types are, as far as I can tell trying to
follow ugly program flow, *never* drawn from the non-strict
dictionary, except in the improbable situation where the
mimetypes module is initialized with a custom set of
apache-mime.types–like files, which does not include those
'non-standard' types. I personally cannot see a use case for
initializing the module with a custom set of mime types, but
then leaving the very few types included as non-strict to the
defaults: this seems like a fragile and pathological use case.
Given this, I don’t see any benefit to dragging the 'strict'
parameter along all the way through the code, and would advise
getting rid of it altogether. Does anyone know of any code that
uses the mimetypes module with strict set to False, where the
non-strict code path ever *actually* is executed?
But though these problems, which affect actual use of the code and
are therefore probably most important, are significant, they really
pale in comparison to the awful quality of implementation. I'll try
to briefly outline my understanding of how code flows in
mimetypes.py, and what the problems are. I haven't stepped through
the code in a debugger, this is just from reading it, so I apologize
in advance if I get something wrong. This is, however, some of the
worst code I’ve seen in the standard library or anywhere else.
* It defines __all__: I didn’t even realize __all__ could be used
for single-file modules (w/o submodules), but it definitely
shouldn’t be here. This specific __all__ oddly does not include
all of the documented variables and functions in the mimetypes
class. It’s not clear why someone calling import * here wouldn’t
want the bits not included.
* It creates a _default_mime_types() function which declares a
bunch of global variables, and then immediately calls
_default_mime_types() below the definition. There is literally
no difference in result between this and just putting those
variables at the top level of the file, so I have no idea why
this function exists, except to make the code more confusing.
* It allows command line usage: I don’t think this is necessary
for a part of the standard library like this. There are better
tools for finding mime types from the command line which ship
with most operating systems.
* Its API is pretty poorly designed. It offers 6 functions when
about 3 are needed, and it takes a couple reads-through of the
code to figure out exactly what any of them are supposed to do.
* The operation is crazy: It defines a MimeTypes class which
actually stores the type mappings, but this class is designed to
be a singleton. The way that such a design is enforced is
through the use of the module-global 'init' function, which
makes an instance of the class, and then maps all of the
functions in the module global namespace to instance methods.
But confusingly, all such functions are also defined
independently of the init function, with definitions such as:
def guess_type(url, strict=True):
if not inited:
init()
return guess_type(url, strict)
I’d be amazed if anyone could guess what that code was trying to
do. I did a double-take when I saw it.
Of course, that return call is only ever reached the first time
this function is called, if init() has not happened yet. This
was all presumably done for lazy initialization, so that the
type information would only be loaded when needed. Needless to
say, there are more pythonic ways to accomplish such a goal.
Oh, also, the other good one here is that it means that someone
who writes `from mimetypes import guess_types` gets something
different than someone who writes:
`import mimetypes; guess_types = mimetypes.guess_types`. In the
former case, this wrapper function is saved as guess_type, which
each time just calls the (changed after init())
mimetypes.guess_types function. This caused a performance
nightmare before March of this year, when there was no check for
`if not inited` before running init() (amazing!?).
* Because the type datastore is set up to be a singleton, any time
init() is called in one section of code, it resets any types
which have been added manually: this means that if init() is
called by different pieces of code in the same python program,
they will interfere with each-others’ type databases, and break
each-other. This is extremely fragile and, in my opinion, crazy.
It is hard for me to imagine any use case that would benefit
from this ability to clobber custom type mappings, and I very
much doubt that any code calling the mimetypes module realizes
that the contract of the API is so flimsy by definition. In
practice, I would not advise consumers of this API to ever call
init() manually, or to ever add custom mime type mappings,
because they are setting themselves up for hard-to-track bugs
down the line.
* The 'inited' flag is a documented part of the interface, in the
standard library documentation. I cannot imagine any reason to
set this flag manually: setting it to false when it was true
will have no effect, because the top-level functions have
already been replaced by instance methods of the 'db' MimeTypes
instance. Setting it to true when it was false will make the
code just break outright.
* In python 3, this has been changed a bit. There’s still an
inited flag, and it still in the docs, but now awful code from
above has been changed slightly, to:
def guess_type(url, strict=True):
if _db is None:
init()
return _db.guess_type(url, strict)
Which is still embarrassingly confusing. On the upside, the
inited flag now does literally nothing, but remains defined, and
in the docs.
* The 'types_map' and 'common_types' (for 'strict' and
'common' types, respectively) dictionaries are also a documented
part of the interface. When init() is called, a new MimeTypes
instance makes a (different) types_map which is a tuple of two
dictionaries, for 'strict' and 'common' types. Then this
instance reads the apache mime.types files and adds the types to
its pair of self.types_map dictionaries, and then after that
looks at the global types_map and common_types dictionaries and
adds *those* types to its self.types_map. Then at the end it
replaces the global types_map with self.types_map[True] and
replaces common_types with self.types_map[False]. Unfortunately,
while changing these dictionaries will have an effect on the
operation of the library, it will not update the types_map_inv
mapping, so inverse lookups will not behave as the changer
expects. If these dictionaries are going to remain documented,
the documentation should be clear to describe them as read only
to avoid very confusing bugs.
* Speaking of these dictionaries, .copy() is called on those two
and a few other inside MimeTypes.__init__(), which happens every
time the global init() function is called, but then init() puts
the copies back in the global namespace, meaning that the
original is discarded. Basically the only reason for the .copy()
is to make sure that the correct updates are applied to the
apache mimetype defaults, but the code will gladly re-read all
of the apache files even after its mapped types are already in
these dictionaries, essentially making re-initializing a (very
expensive) no-op. All we’re doing is a lot of unnecessary extra
disk reads and memory allocations and deallocations. The only
time this has any effect is when a non-singleton MimeTypes
instance is created, as in the read_mime_types function.
* And that read_mime_types function is a doozy. It tries to open a
filename, spits back None if there’s an IOError (instead of
raising the exception as it should), and then creates a new
MimeTypes instance (remember, this is identical to the singleton
MimeTypes instance because it starts itself from that one’s
mappings), adds any new types it finds in the file with that
name, and then returns the 'strict' types_map from it. I’m not
sure whether any sane user of this API would expect it to return
the existing type mappings *plus* the extra ones in the provided
filename, but I really can’t imagine this function ever being
particularly useful: it requires you are reading mime types in
apache format, but not the apache mime type files you already
looked at, and then the only way to find out what new mappings
were defined is to take the difference of the default mappings
with the result of the function.
* The code itself, on a line-by-line basis, is unpythonic and
unnecessarily verbose, confusing, and slow. The code should be
rewritten to use python 2.3–2.6 features: even leaving its
functionality identical it could be cut to about half the number
of lines, and made clearer.
In case the above doesn’t make this clear: this code is extremely
confusing. Trying to read it has caused all the people around me to
look up as I shout "what the fuck??!" at the screen every few
minutes, as each new revelation gives another surprise. I’m not
convinced that I completely understand what the code does, because
it has been quite effectively obfuscated, but I understand enough to
want to throw the whole thing out, and start essentially from
scratch.
So the question is, what should be done about this? I’d like to hear
how people use the mimetypes module, and how they expect it to work,
to figure out the sanest possible mostly-backwards-compatible
replacement which could be dropped in (ideally this would just allow
the use of default mimetypes and rip out the ability to alter the
default datastore: or is there some easy way to change this away
from a singleton without breaking code which calls these methods?),
and then extend that replacement to support a somewhat saner model
for anyone who actually wants to extend the set of mappings. My
guess is that replacement code could actually fix subtle bugs in
existing uses of this module, by people who had a sane expectation
of how it was supposed to work.
At the very least, the parts about figuring out exactly which
exceptions to Apache’s set of default types are useful would be a
good idea, and I’d maybe even recommend including an up-to-date copy
of Apache’s mime.types file in the Python Standard Library, and then
only overriding its definitions for future versions of Apache (and
then overriding the combination of both of those with further
exceptions deemed useful for python, with comments explaining why
each exception), so that we’re not bothering to look up horribly
out-of-date types in multiple locations from Apache 1, 1.2, 1.3,
etc. I’d also recommend making the API for overriding definitions be
the same as the code used to declare the default overrides, because
as it is there are three ways do define types: a) in a mime.types
formatted file, b) in a python dictionary that gets initialized with
a confusing bit of code, and c) through the add_type function.
Does anyone else have thoughts about this, or maybe some good (it
had better be *really* good) explanations why this code is the way
it is? I'd be happy to try to rewrite it, but I think I’d need a bit
of help figuring out how to make the rewrite backwards-compatible.
Note: someone else has had fun with this module:
<http://lucumr.pocoo.org/2009/3/1/the-1000-speedup-or-the-stdlib-sucks>
<http://lucumr.pocoo.org/2009/7/24/singletons-and-their-problems-in-python>
Cheers,
Jacob Rus
Please review this, I'm worried that there are cases where convertitem() is
returning a string that really should be overridden by the argument "help
string". However, I'm worried that this change will get rid of useful
messages (via the format "; help string"), when there otherwise wouldn't
be.
PyArg_ParseTuple when handling a string format "s", raises a TypeError
when the passed string contains a NUL. However, the TypeError doesn't
contain useful information.
For example:
syslog.syslog('hello\0there')
TypeError: [priority,] message string
This seems to be a thinko in Python/getargs.c at line 331:
msg = convertitem(PyTuple_GET_ITEM(args, i), &format, p_va,
flags, levels, msgbuf,
sizeof(msgbuf), &freelist);
if (msg) {
seterror(i+1, msg, levels, fname, message); <<< Line 331
return cleanreturn(0, freelist);
}
This also applies to Python 3 trunk in line 390.
I think that's supposed to be "msg" instead of "message" in the last
argument. If I change it, I get:
>>> import syslog; syslog.syslog('hello\0there')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: must be string without null bytes, not str
I think it's safe to change "message" to "msg" because:
"message" is the "help" string "[priority,] message string",
but "msg" contains much more useful information.
"msg" is in the "fall back if the last argument doesn't contain useful
information" argument position, but "msg" is never NULL.
"message" only is NULL if the format string doesn't contain ";".
In every case I can find in the code, convertitem() is returning a
much more useful string than the help string.
Or perhaps it should do something like:
if (msg) {
seterror(i+1, msg, levels, fname, '%s (%s)' % ( msg, message ));
Pardon my mixed C+Python, but you get the idea.
Thoughts?
Thanks,
Sean
--
[...] Premature optimization is the root of all evil.
-- Donald Knuth
Sean Reifschneider, Member of Technical Staff <jafo(a)tummy.com>
tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability
> On Tue, 28 Jul 2009 04:06:45 am Eric Pruitt wrote:
> > I am implementing the file wrapper using changes to subprocess.Popen
> > that also make it asynchronous and non-blocking so implementing "r+"
> > should be trivial to do. How about handling stderr? I have the
> > following ideas: leave out support for reading from stderr, make it
> > so that there is an optional additional argument like "outputstderr =
> > False", create another function that toggles / sets whether stderr or
> > stdout is returned or mix the two outputs.
>
> Leaving it out is always an option.
>
> As I see it, fundamentally you can either read from stdout and sterr as
> two different streams, or you can interleave (mix) them. To me, that
> suggests three functions:
>
> ProcessIOWrapper() # read from stdout (or write to stdin etc.)
> ProcessIOWrapperStdErr() # read/write from stderr
> ProcessIOWrapper2() # read from mixed stdout and stderr
>
> I don't like a function to toggle between one and the other: that smacks
> of relying on a global setting in a bad way. I suppose you could add an
> optional argument to ProcessIOWrapper() to select between stdout,
> stderr, or both together.
>
>
>
> --
> Steven D'Aprano
How would having a toggle function rely on a global setting? Each class would
simply have its own member variable like "self.readsstderr." It's a moot point
though as I've decided to use separate functions as you suggested. With
separate functions, the user doesn't have to worry about modifying the mode
keyword if stderr is needed and as an added bonus, it also makes the code
a lot more readable.
Eric
I'm writing a C Python extension that needs to generate PyTypeObjects
dynamically. Unfortunately, the Py_TPFLAGS_HEAPTYPE flag is overloaded
in a way that makes it difficult to achieve this goal.
The documentation for Pt_TPFLAGS_HEAPTYPE says:
Py_TPFLAGS_HEAPTYPE
This bit is set when the type object itself is allocated
on the heap. In this case, the ob_type field of its
instances is considered a reference to the type, and the
type object is INCREF’ed when a new instance is created,
and DECREF’ed when an instance is destroyed (this does not
apply to instances of subtypes; only the type referenced
by the instance’s ob_type gets INCREF’ed or DECREF’ed).
This sounds like exactly what I want. I want my type object INCREF'd
and DECREF'd by its instances so it doesn't leak or get deleted
prematurely. If this were all that Py_TPFLAGS_HEAPTYPE did, it would
work great for me.
Unfortunately, Py_TPFLAGS_HEAPTYPE is also overloaded to mean
"user-defined type" (as opposed to a built-in type). It controls
numerous subtle behaviors such as:
- whether the type's name is module.type or just type.
- whether you're allowed to set __name__, __module__, or __bases__ on the type.
- whether you're allowed to set __class__ on instances of this type.
- whether the module name comes from the type name or the __module__ attribute.
- whether it will use type->tp_doc as the docstring
- whether its repr() calls it a "class" or a "type".
- whether you can set attributes of the type.
- whether someone is attempting the Carlo Verre hack.
So I'm stuck with an unenviable choice. I think the lesser of two evils
is to *not* specify Py_TPFLAGS_HEAPTYPE, because the worst that will
happen is that my types will leak. This is not as bad as having someone
set __class__ on one of my instances, or set attributes on my type, etc.
Ideally the interpreter would have a separate flag like
Py_TPFLAGS_BUILTIN that would trigger all of the above behaviors, but
still make it possible to have dynamically generated built-in types get
garbage collected appropriately.
At the very least, the documentation I cited above should make it clear
that Py_TPFLAGS_HEAPTYPE controls more than just whether the type gets
INCREF'd and DECREF'd. Based on the list of behaviors I discovered
above, it is almost certainly not correct for a C exension type to be
declared with Py_TPFLAGS_HEAPTYPE.
Josh