Python-dev summary for 2002-09-01 to 2002-09-15

17 Sep 2002 12:50:05 -0700

This is a summary of traffic on the `python-dev mailing list`_ between
September 01, 2002 and September 15, 2002 (exclusive).  It is intended
to inform the wider Python community of ongoing developments on the
list.  To comment on anything mentioned here, just post to
python-list@python.org or comp.lang.python in the usual way. Give your
posting a meaningful subject line, and if it's about a PEP, include
the PEP number (e.g. Subject: PEP 201 - Lockstep iteration) All
python-dev members are interested in seeing ideas discussed by the
community, so don't hesitate to take a stance on a PEP (or anything
else for that matter) if you have an opinion.

This is the second summary written by Brett Cannon (hopefully my
sophomoric performance will be better then most sophomore music
albums).

Summaries by me (2002-09-15 to ... when I burn out) are archived at:
    http://www.ocf.berkeley.edu/~bac/python-dev/summaries/index.php
You can find summaries by Michael Hudson (2002-02-01 to 2001-07-05)
at:
    http://starship.python.net/crew/mwh/summaries/index.html
Summaries by A.M. Kuchling (2000-12-01 to 2001-01-31) are at:
    http://www.amk.ca/python/dev/

Please note that this summary is written using reStructuredText_ which
can be found at http://docutils.sourceforge.net/rst.html .  Any
unfamiliar punctuation is probably markup for reST; you can safely
ignore it (although I suggest learning reST; its nice and is accepted
for PEP markup).  Also, because of the wonders of reformatting thanks
to whatever you are using to read this, I cannot guarantee you will be
able to run this text through DocUtils as-is.  If you want to do that,
get the original text from the archive.

I am considering keeping a list of names that people are often
referred to in emails.  This would serve a dual purpose: allows people
who read emails from the list to have a reference to be able to figure
out who is who and makes the summaries easier for me because I can
then make reference to people by the names I know them by.  =)  Any
comments on this idea are appreciated.

.. _python-dev mailing list:
http://mail.python.org/mailman/listinfo/python-dev
.. _reST:
.. _reStructuredText: http://docutils.sf.net/

============================
`To commit or not commit`_
============================

Walter Dorwald asked if there were "any objections against committing
the patch" for implementing `PEP 293`_ (Codec Error Handling
Callbacks).  Guido asked what Martin V. Lowis and M.A. Lemburg had to
say about it.  MAL responded that he was +1 on the patch.  Martin was
"concerned about the massive amounts of C code, most of which could be
expressed way more compact in Python code", but "Walter convinced
[MvL] that this does have a real performance impact for real data" so
he would live with it.  In the end he gave it his vote.

Walter said he would check it in (and he has).  The PEP has now been
moved to the finished PEP list.

.. _To commit or not commit:
http://mail.python.org/pipermail/python-dev/2002-September/028502.html
.. _PEP 293: http://www.python.org/peps/pep-0293.html

=======================================
`Proposed Mixins for Wide Interfaces`_
=======================================

Raymond Hettinger suggested adding mixin classes that automatically
implement magic methods when certain basic magic methods were already
implemented (e.g., "given an __eq__ method in a subclass, adds a
__ne__ method").  David Abrahams said that he thought "these are a
great idea, *in the context of* an understanding of what we want
interfaces to be, say, and do."  Guido brought up some points about
the initial suggestions Raymond made.  He then said that he thought
that there wasn't "enough here to warrant putting this into the
standard library"; the issue will be revisited when a standard type or
interface hierarchy is added to Python (not in 2.3).

.. _Proposed Mixins for Wide Interfaces:
http://mail.python.org/pipermail/python-dev/2002-September/028543.html

===================================
`mysterious hangs in socket code`_
===================================

Jeremy Hylton wrote some threaded code to fetch some web pages that
hung when performing a slow DNS operation.  Apparently, in Python 2.1
"it produces a steady stream of output -- urls and the time it took to
load them".  In Python 2.2 and 2.3, though, "it produces little bursts
of output, then pauses for a long time, then repeats".  Jeremy guessed
that it *might* have something to do with Linux's getaddrinfo() being
thread-safe by allowing only a single lookup at a time.  Aahz said
that "gethostbyname() IIRC has frequently been non-reentrant".

.. _mysterious hangs in socket code:
http://mail.python.org/pipermail/python-dev/2002-September/028555.html

========================================
`Two random and nearly unrelated ideas`_
========================================

Skip Montanaro had two ideas; one was to make the info in `Misc/NEWS`_
(which is a summary of what has been changed in Python for each
release) a web page and the other was "to get rid of the ticker
altogether in systems with proper signal support" (see the `2002-08-16
- 2002-09-01 summary`_ for an explanation of what the ticker is). 
That would get rid of the polling of the ticker and thus reduce the
overhead on threads.

For the first idea, Guido asked Skip to try seeing what it would look
like with reST_ markup and what the resulting page would look like.

In response to the second idea, Oren Tirosh said it couldn't be done
until "all Python I/O calls are converted to be EINTR-safe"
(EINTER-safe means to be able to handle the EINTER signal which what
is raised "When an I/O operation is interrupted by an unmasked
signal").  That "requires a lot of work in some of the hairiest places
in the Python codebase."  Fredrik Lundh said that this "sounds like a
good topic for a "here's what I learned when trying to fix this
problem" PEP.  This is most likely in reference to Skip writing the
patch to make the ticker global instead of a per-thread issue.  Guido
said, in terms of signals, to "just say no"; "it is impossible to
write correct code in the presense of signals".  Guido, in a later
email, gave this whole idea a vote of -1,000,000; so it ain't ever
going to happen.  Some discussion on signals ensued, but Guido never
budged from his position.

Oren pointed out that if some C code used signals and people didn't
handle it in their Python code by checking if IOError was caused by
EINTER (as shown below by Oren's code)::

    while 1:
        try:
            <code>
        except IOError, exc:
            if exc.errno == errno.EINTR:
                continue
            else:
                raise

, it would not restart properly even though there was no reason for it
to have stopped.  Oren said that Python could add the loop in the C
code of the core where EINTR might be raised ("Only low-level
functions like os.read_ and os.write_ that map directly to stdio
functions should ever return EINTR").  The proposed idea was to wrap
functions that might raise this that can be re-entered safely.

.. _Two random and nearly unrelated ideas:
http://mail.python.org/pipermail/python-dev/2002-September/028555.html
.. _Misc/NEWS: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Misc/NEWS
.. _2002-08-16 - 2002-09-01 summary:
http://www.ocf.berkeley.edu/~bac/python-dev/summaries/2002-08-16--2002-09-01.html
.. _os.read:
.. _os.write: http://www.python.org/dev/doc/devel/lib/os-fd-ops.html

================================================
`Should KeyError use repr() on its arguments?`_
================================================

Originally, when an exception was raised and you passed in an optional
object to act as a description of why the exception was raised (such
as ``KeyError("there is no spoon")`` where ``there is no spoon`` is
the optional argument bound to ``<exception>.args``), it just returned
what args was bound to when you called; ``str(<exception>) ==
<exception>.args``.  Now it calls repr() on what args is bound to;
``str(<exception>) == str(<exception>.args)``.  Much better. =)

.. _Should KeyError use repr() on its arguments?:
http://mail.python.org/pipermail/python-dev/2002-September/028545.html

==========================================
`New 'spambayes' project on SourceForge`_
==========================================

Thanks to great work done by Tim Peters and several other
contributors, Barry Warsaw started an SF project to host the spambayes
code.  It can be found at http://sf.net/projects/spambayes .  There
are two mailing lists:
http://mail.python.org/mailman-21/listinfo/spambayes and
http://mail.python.org/mailman-21/listinfo/spambaye-checkins (yes,
that is Mailman 2.1, and yes, you will "help be a guinea pig for
Mailman 2.1").

.. _New 'spambayes' project on SourceForge:
http://mail.python.org/pipermail/python-dev/2002-September/028626.html

=========================
`Subsecond time stamps`_
=========================

Martin V. Lowis wanted to introduce subsecond timestamps on platforms
that supported it.  He suggested adding another field to stat, create
a new type, or make st_mtime a floating point.  The first one option
is easy, the second has the usual problems of defining a new type, and
the third does not guarantee enough accuracy.

Paul Svensson and Guido said that the last option (turning st_mtime
into a float) was the most Pythonic.  MvL agreed, but worried about
breaking code that expected an int.  Guido then suggested that maybe
the new field is the way to go; define something like st_mtimef that
will contain the float if available or contain an int otherwise.  Tim
Peters also weighed in with his `IEEE 754`_ voodoo about how a float
can hold enough info to be accurate up to 100 nanoseconds if you only
span a 33 years.  That causes an issue starting in 2003 since that is
33 years past the epoch (1970).

But then MvL discovered that st_mtime was already a float on the Mac;
had that caused  issues?  Jack Jansen of course chimed in on this by
saying that it caused him a headache about once a year in the form of
a failing test (other issues caused by timestamps is the Classic Macs
having the epoch at 1904 and not using UTC time).  He said he would
prefer to see the timestamp as a cookie that was passed into a
function that spit out "something guaranteed to be of your liking".

To address the other issues that Jack mentioned, Guido suggested that
all timestamps be converted to UTC time with the epoch at 1970.

MvL has `SF patch 606592`_ up on SF that has already been closed that
makes all the relevant changes to have timestamps return floats.

.. _Subsecond time stamps:
http://mail.python.org/pipermail/python-dev/2002-September/028648.html
.. _IEEE 754: http://grouper.ieee.org/groups/754/
.. _SF patch 606592: http://www.python.org/sf/606592

=================================
`64-bit process optimization 1`_
=================================

Bob Ledwith posted a simple patch for `Include/object.h`_ that changed
the order of certain parts of the PyObject_HEAD macros, affecting
PyObject and PyVarObject.  This was for a 64-bit platform performance
boost (40% for large data sets according to Bob).  The reordering
eliminated some padding in the struct and allows more Python objects
to fit in the L2 cache, or at least that is what Bob thinks is going
on.

Guido pointed out that this would save 8 bytes per object; he thought
all of this was "Interesting!".  But alas, using this patch would
break binary compatibility.  Guido was not sure, though, whether it
had been broken yet between Python 2.2 and 2.3 and thus he might be
"being too conservative here" in terms of saying that it should be
held back for now.

A problem Guido pointed out for 64-bit systems, is that theoretically
the reference count for an object could go negative with enough
references as things stand now.  Guido then suggested that perhaps
refcnt (struct item that holds the reference count) should be a
``long``.  And while dealing with that, Guido suggested that anything
that stores a length should store that number in a ``long``.

Chime in Tim Peters.  He pointed out that it was agreed upon years ago
to move refcnt to ``long`` but no one had bothered to do it.  Heck,
even Guido thought for a long time that it was a long when it wasn't;
it required Tim to "beat that out of [Guido] <wink>" to stop him from
saying that it was a ``long``.  He then pointed out that Win64 was
still only 4 bytes for a ``long``; what was really desired was for it
to be ``Py_intptr_t`` which is the Python way for spelling the C99
type that we wanted.  Apparently C99 has a way to specify that things
be a specific byte length (now if everyone just had a C99 compiler we
wouldn't need these macros; oh, to dream...).

Tim also pointed out that what we wanted for the type that held a
length argument to be size_t since that is what strlen() and malloc()
are restricted by.  He said that he writes all of his "string-slinging
code as using size_t vars now".

Tim pointed out that the issue then became "Whether it's worth the
pain to change this stuff" which "depends on whether we think 64-bit
boxes are just another passing fad like the Internet <wink>".  =)

Martin V. Lowis agreed with the changing of refcnt to a long but had
reservations about using size_t for the length field (ob_size).  He
pointed out that some objects put negative values into that field.

Frederik suggested that the proposed changes be default on 64-bit
systems since the chances that they are willing to recompile is higher
then people on 32-bit systems.  He also suggested making it a compiler
option.  Guido thought it was a good idea.  But then Mats Wichmann
discovered that the switch to long killed the performance boost.  So
Guido re-iterated that he thinks it should be a compiler option only
on 64-bit systems; have "compat", "optimal", and "right" compiler
options.

As of yet nothing has done about this.

.. _64-bit process optimization 1:
http://mail.python.org/pipermail/python-dev/2002-September/028677.html
.. _Include/object.h:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Include/object.h

==========================================
`Weeding out obsolete modules and Demos`_
==========================================

Jack Jansen noticed that there demos for some of the SGI-specific
modules that use severely outdated systems and hardware (stuff
discontinued 8 to 12 years ago).  Guido gave the go-ahead to yank them
from CVS.

This has yet to be done.

.. _Weeding out obsolete modules and Demos:
http://mail.python.org/pipermail/python-dev/2002-September/028718.html

==============
`utf8 issue`_
==============

(This thread actually started in August)  There was a bug in Python
2.2 that raised a UnicodeError when trying to decode a lone surrogate
(explanation of surrogates to follow this summary).  This caused
issues in importing .pyc files that contained a lone surrogate because
marshal_ (which is what is used to create .pyc files) encodes Unicode_
literals in UTF-8.  This has all been fixed in Python 2.3, but Guido
was wondering how to backport this for Python 2.2.2.

The option of bumping the magic number for .pyc files was raised and
instantly thrown out by Guido; "Bumping MAGIC is a no-no between dot
releases".  So M.A. Lemburg suggested to either fix the Unicode
encoder or change the Unicode decoder to handle the malformed Unicode.
 MAL wasn't sure, though, if some security issue would be raised by
the latter option.

Guido said go for the latter and didn't see any possible security
issue since "If someone you don't trust can write your .pyc files,
they can cause your interpreter to crash by inserting bogus bytecode".

Explanation of lone surrogates:

    In Unicode, a surrogate pair is when you create the representation
of a character by using two values. So, for instance, UTF-32 can cover
the entire Unicode space (since Unicode is 20 bits, although MvL says
it is really more like 21 bits), but UTF-16 can't.  To solve the issue
for an encoding that cannot cover all possible characters in a single
value a character can be represented as a pair of UTF-16 values.  The
high surrogate cover the high 10 bits while the low surrogate cover
the lower 10 bits.  High and low surrogates can never be the same
since they are defined by a range of possible values and those ranges
do not overlap.  So with the proper high and low surrogate paired
together you can make any possible Unicode character.

    The problem in Python 2.2.1 is that when there is only a lone
surrogate (instead of there being a pair of surrogates), the encoder
for UTF-8 messes up and leaves off a UTF-8 value.  The following line
is an example:

	>>> u'\ud800'.encode('utf-8')
	'\xa0\x80'  #In Python 2.2.1
	'\xed\xa0\x80'  #In Python 2.3a0

    Notice how in Python 2.3a0 the extra value is inserted so as to
make the representation a complete Unicode character instead of only
encoding the half of the surrogate pair that the encode was given.

    You can read http://216.239.37.100/search?q=cache:Dk12BZNt6skC:uk.geocities.com/BabelStone1357/Software/surrogates.html
for more info.  Thanks goes to Frederik for the link and Guido for
some clarification.

.. _utf8 issue: http://mail.python.org/pipermail/python-dev/2002-August/028254.html
.. _marshal: http://www.python.org/dev/doc/devel/lib/module-marshal.html
.. _Unicode: http://www.unicode.org/

=====================================
`Documentation inconsistency in re`_
=====================================

Christopher Craig noticed that the docs for the re_ module for the
``\b`` metacharacter was incorrect; it says that "the end of a word is
indicated by whitespace or a non-alphanumeric character".  That would
indicate that an underscore would be the end of a word, which turns
out to be false.  Frederik said that "\b is defined in terms of \w and
\W" and thus allows underscore to be a alphanumeric character.  The
documentaiton has been fixed.

.. _Documentation inconsistency in re:
http://mail.python.org/pipermail/python-dev/2002-September/028644.html
.. _re: http://www.python.org/dev/doc/devel/lib/module-re.html

=======================
`Codecs lookup order`_
=======================

Francois Pinard discovered that for the codecs_ module "one should be
careful about **not** [altered emphasis] naming a module after the
encoding name, when closely following the documentation in the Library
Reference manual".  This is because the codecs module first searches
the registry of codecs, then searches for a module with the same name
and use that module.  The issue comes up when the module does not
contain a function named getregentry(); "\`encodings.lookup()` expects
a \`getregentry` function in that module, does not find it, and raises
a CodecRegistryError, not leaving a chance to subsequent codec search
functions to be used".

M.A. Lemburg said that this has been fixed in Python 2.3 and will be
in 2.2.2 by having encodings.lookup() return None if getregentry() is
not found and thus allowing the search to continue.

.. _Codecs lookup order:
http://mail.python.org/pipermail/python-dev/2002-September/028676.html
.. _codecs: http://www.python.org/dev/doc/devel/lib/module-codecs.html

=================================
`raw headers in rfc822.Message`_
=================================

John Spurling provided a two-line hack to keep the raw headers in an
rfc822.Message_ .  Barry responded that email.Message.Message_ keeps
the raw headers around.

But the reason I am summarizing this is what this thread quickly
changed to is how to properly generate a patch.  Patches should be
generated using UNIX diff, either the -c or -u option with preference
for -c (using cvs diff -c is even better; puts the version of the file
you are diffing with in the output); Mac folk can send MPW diffs, but
UNIX diff is the definitely preference.  Always put the order of the
files ``diff -c OLD_FILE NEW_FILE`` .  And always post the patches_ to
SourceForge_!  Getting random patches, no matter how small, on the
list is annoying (at least to me) because the point of the list is to
discuss the design and implementation of Python, not to patch Python. 
SF is used so that Python-dev does not need to be bothered with
mundame problems like applying patches (and to annoy Aahz with SF's UI
sucking in Lynx_ =).  So please, for my sake and everyone else on
Python-dev, use SF!

For a funny email from Raymond Hettinger about developing for Python
read http://mail.python.org/pipermail/python-dev/2002-September/028725.html
.

.. _raw headers in rfc822.Message:
http://mail.python.org/pipermail/python-dev/2002-September/028682.html
.. _rfc822.Message: http://www.python.org/dev/doc/devel/lib/message-objects.html
.. _email.Message.Message:
http://www.python.org/dev/doc/devel/lib/module-email.Message.html
.. _patches: http://sourceforge.net/patch/?group_id=5470
.. _SourceForge: http://www.sourceforge.net/
.. _Lynx: http://lynx.browser.org/

===================
`type categories`_
===================

Yes, the `same thread`_ from the `last summary`_ is back.  This thread
has become the bane of my summarizing existence.  =)

Aahz asked "why wouldn't we simply use attributes to hold" interfaces
that a class implemented (think of __slots__).  David Abrahams then
brought up the idea of just adding interfaces to the __class__
attribute.

Guido then chimed in on the attributes idea.  He pointed out that this
is how Zope does it, using the __inherits__ attribute.  The limitation
is that "it isn't automatically merged properly on multiple
inheritance, and adding one new interface to it means you have to copy
or reference the base class __inherits__ attribute".  And as for
David's idea of just adding to __class__, that doesn't work because
there is no way to limit the interface; you need "Something like
private inheritance" for when an interface is broken by some inherited
class.  David subsequently added the issue of being able to disinherit
when an interface is not valid but is inherited by default as another
problem for using inheritence for interfaces.

David then brought up the issue of having Python being so dynamic that
you could inject an interface if you used __class__ like he suggested
through black magic code.  If the injected interface didn't work
because of the inheritence chain, then you have a problem.

Barry Warsaw brought in his objections.  He tried playing Devil's
Advocate by saying that Guido had said that inheritance would not be
the only way to handle interfaces, but that it would be the
predominent way.  But this duality would complicate any
conformsto()-like function since it would have to handle two different
ways for a class to get an interface.  Barry then brought up the
objection that he didn't like the idea of using straight inheritence
because he wanted a syntactic way to separate out interfaces.

As a side note, Guido pointed out that __slots__ is provisional; nicer
syntax will eventually surface when Guido gets over his "fear of
adding new keywords".

.. _type categories:
http://mail.python.org/pipermail/python-dev/2002-September/028738.html
.. _same thread: http://www.ocf.berkeley.edu/~bac/python-dev/summaries/2002-08-16--2002-09-01.html#type-categories
.. _last summary: http://www.ocf.berkeley.edu/~bac/python-dev/summaries/2002-08-16--2002-09-01.html

=======================================
`flextype.c  -- extended type system`_
=======================================

Christian Tismer has come up with a replacement for the etype which is
"a hidden structure that extends types when they are allocated on the
heap" (you can find it in `Objects/typeobject.c`_ in the CVS_).  There
is a limitation with the etype where it could not be extended by
metatypes.  Well, Chris worked his magic and came up with a new
flextype that allows overriding of methods.  So with Christian's code
you would be able to override methods in a type without having to hack
something together to handle the overriding correctly; it would be
handled automatically.

Through some clarification from Christian and Guido, it was pointed
out to me (as of this moment I am the only one to make any noise on
this thread, and it was for this summary) that this simplifies an
esoteric issue; note the use of the words "metatype" above.  This is
type/metatype black magic hacking.  Spiffy, but something most of us
"normal" folk will not have to worry about.

.. _flextype.c  -- extended type system:
http://mail.python.org/pipermail/python-dev/2002-September/028736.html
.. _Objects/typeobject.c:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Objects/typeobject.c
.. _CVS: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/#dirlist