[Python-ideas] Module aliases and/or "real names"

Sun Jan 9 07:39:24 CET 2011

On Sat, Jan 8, 2011 at 7:06 PM, Ron Adam <rrr at ronadam.com> wrote:
> On 01/06/2011 09:28 PM, Nick Coghlan wrote:
>> My original suggestion was along those lines, but I've come to the
>> conclusion that it isn't sufficiently granular - when existing code
>> tinkers with "__module__" it tends to do it at the object level rather
>> than by modifying __name__ in the module globals.
>
> What do you mean by *tinkers with "__module__"* ?
>
> Do you have an example where/when that is needed?

>>> from inspect import getsource
>>> from functools import partial
>>> partial.__module__
'functools'
>>> getsource(partial)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/inspect.py", line 689, in getsource
    lines, lnum = getsourcelines(object)
  File "/usr/lib/python2.6/inspect.py", line 678, in getsourcelines
    lines, lnum = findsource(object)
  File "/usr/lib/python2.6/inspect.py", line 552, in findsource
    raise IOError('could not find class definition')
IOError: could not find class definition

partial is actually implemented in C in the _functools module, hence
the failure of the getsource call. However, it officially lives in
functools for pickling purposes (other implementations aren't obliged
to provide _functools at all), so __module__ is adjusted
appropriately.

The other examples I have been using are the _datetime C acceleration
module and the unittest pseudo-package.

>> 1. Implicit configuration of __module__ attributes is updated to check
>> for a definition of "__import_name__" at the module level. If found,
>> then this is used as the value for the __module__ attribute.
>> Otherwise, __module__ is set to __name__ as usual.
>
> If __import_name__ is going to match __module__ everywhere else, why not
> just call it __module__ every where?

Because the module level attributes for identifying the module don't
serve the same purpose as the attributes identifying where functions
and classes are defined. That said, calling it "__module__" would
probably work, and make the naming logic a bit more intuitive. The
precedent for that attribute name to refer to a string rather than a
module object was set a long time ago, after all.

> Would __package__ be changed in any way?

To look for __module__ before checking __name__? No, since doing that
would make it unnecessarily difficult to use relative imports inside
pseudo-packages.

>> 2. Any code that currently sets a __module__ attribute (i.e. function
>> and class definitions) will also set an __impl_module__ attribute.
>> This attribute will always be set to the value of __name__.
>
> So we will have:  __package__, __module__, __import_name__, __impl_name__,
>  and if you also include __file__ and __path__, that makes six different
> attributes for describing where something came from.
>
> I don't know about you, but this bothers me a bit. :-/

It bothers me a lot, since I probably could have avoided at least some
of it by expanding the scope of PEP 366. However, it does help to
split them out into the different contexts and look at how each of
them are used, since it makes it clear that there are a lot of
attributes because there is a fair bit of information that is used in
different ways.

Module level attributes relating to location in the external environment:
  __file__: typically refers to a source file, but is not required to
(see PEP 302)
  __path__: package attribute used to identify the directory (or
directories) searched for submodules
  __loader__: PEP 302 loader reference (may not exist for ordinary
filesystem imports)
  __cached__: if it exists, refers to a compiled bytecode file (see PEP 3149)

  It is important to understand that ever since PEP 302, *there is no
loader independent mapping* between any of these external environment
related attributes and the module namespace. Some Python standard
library code (i.e. multiprocessing) currently assumes such a mapping
exists and it is broken on windows right now as a direct result of
that incorrect assumption (other code explicitly disclaims support for
PEP 302 loaded modules and only works with actual files and
directories).

Module level attributes relating to location within the module namespace:
  __name__: actual name of current module in the current interpreter
instance. Best choice for introspection of the current interpreter.
  __module__ (*new*): "official" portable name for module contents
(components should never include leading underscores). Best choice for
information that should be portable to other interpreters (e.g. for
pickling and other serialisation formats)
  __package__: optional attribute used specifically to control
handling of relative imports. May be explicitly set (e.g. by runpy),
otherwise implicitly set to "__name__.rpartion('.')[0]" by the first
relative import.

  Most of the time, __name__ is consistent across all 3 use cases, in
which case __package__ and __import_name__ are redundant. However,
when __name__ is wrong for some reason (e.g. including an
implementation detail, or adjusted to "__main__" for execution as a
script), then __package__ allows relative imports to be fixed, while
__import_name__ will allow pickling and other operations that should
hide implementation details to be fixed.

Object level attributes relating to location of class and function definitions:
  __module__ (*updated*): refers to __module__ from originating module
(if defined) and to __name__, otherwise
  __impl_module__ (*new*): refers to __name__ from originating module

Looking at that write-up, I do quite like the idea of reusing
__module__ for the new module level attribute.

> Also consider having virtual modules, where objects in it may have come from
> different *other* locations. A virtual module would need a way to keep track
> of that. (I'm not sure this is a good idea.)

It's too late, code already does that. This is precisely the use case
I am trying to fix (objects like functools.partial that deliberately
lie in their __module__ attribute), so that this can be done *right*
(i.e. without having to choose which use cases to support and which
ones to break).

That basic problem is that __module__ currently tries to serve two masters:
1. use cases like inspect.getsource, where we want to know where the
object came from in the current interpreter
2. use cases like pickle, where we want the "official" portable
location, with any implementation details (like the _functools module)
hidden.

Currently, the default behaviour of the interpreter is to support use
case 1 and break use case 2 if any objects are defined in a different
module from where they claim to live (e.g. see the pickle
compatibility breakage with the 3.2 unittest implementation layout
changes). The only tool currently available to module authors is to
override __module__ (as functools.partial and the datetime
acceleration module do), which is correct for use case 2, but breaks
use case 1 (leading to misleading error messages in the C acceleration
module case, and breaking otherwise valid introspection in the
unittest case).

My proposed changes will:
a) make overriding __module__ significantly easier to do
b) allow the introspection use cases access to the information they
need so they can do the right thing when confronted with an overridden
__module__ attribute

> Does this fit some of problems you are thinking of where the granularity may
> matter?
>
> It would take two functions to do this.  One to create the virtual module,
> and another to pre-load it's initial objects.  For those objects, the loader
> would set obj.__module__ to the virtual module name, and also set
> obj.__original_module__ to the original module name.  These would only be
> seen on objects in virtual modules.  A lookup on obj.__module__ will tell
> you it's in a virtual module.  Then a lookup with obj.__original_module__
> would give you the actual location info it came from.

That adds a lot of complexity though - far simpler to define a new
__impl_module__ attribute on every object, retroactively fixing
introspection of existing code that adjusts __module__ to make
pickling work properly across different versions and implementations.

> By doing it that way, most people will never need to know how these things
> work or even see them.  ie... It's advance/expert Python foo. ;-)

Most people will never need to care or worry about the difference
between __module__ and __impl_module__ either - it will be hidden
inside libraries like inspect, pydoc and pickle.

> Any way, I hope this gives you some ideas, I know you can figure out the
> details much better than I can.

Yeah, the idea of reusing the __module__ attribute name at the top
level is an excellent one.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia