[Python-Dev] Python-dev summary for 2002-09-01 to 2002-09-15

Brett Cannon drifty@bigfoot.com
Sun, 15 Sep 2002 21:31:53 -0700 (PDT)


Since posting here first before posting to c.l.py worked out rather nicely
last time, I am going to do it again.  Basically everyone gets 24 hours to
reply an corrections for this summary.

Now please note that the links that point to where I am going to keep the
summaries are not up yet (but they will be by the time I post to c.l.py).

Enjoy.

==============================

This is a summary of traffic on the `python-dev mailing list`_ between
September 01, 2002 and September 15, 2002 (exclusive).  It is intended to
inform the wider Python community of ongoing developments on the list;
everything from new features of the language to how to handle discovered
bugs that might affect the general Python programmer.  To comment on
anything mentioned here, just post to python-list@python.org or
comp.lang.python in the usual way. Give your posting a meaningful subject
line, and if it's about a PEP, include the PEP number (e.g. Subject: PEP
201 - Lockstep iteration) All python-dev members are interested in seeing
ideas discussed by the community, so don't hesitate to take a stance on a
PEP (or anything else for that matter) if you have an opinion.

::

 This is the second summary written by Brett Cannon (hopefully my
sophomoric performance will be better then most sophomore music albums).
 Summaries by me (2002-09-15 to ... when I burn out) are archived at
http://www.ocf.berkeley.edu/~bac/python-dev/summaries/index.php .
 You can find summaries by Michael Hudson (2002-02-01 to 2001-07-05) at
http://starship.python.net/crew/mwh/summaries/index.html .
 Summaries by A.M. Kuchling (2000-12-01 to 2001-01-31) are at
http://www.amk.ca/python/dev/ .

Please note that this summary is written using reST_ which can be found at
http://docutils.sourceforge.net/rst.html .  If there is some markup in the
summary that seems odd, chances are it is because of reST.

Also, I am considering keeping a list of names that people are often
referred to in emails.  This would serve a dual purpose: allows people who
read emails from the list to have a reference to be able to figure out who
is who and makes the summaries easier for me because I reference these
people in my head by their nicknames.  =)  Any comments on that idea are
appreciated.

.. _python-dev mailing list:
http://mail.python.org/mailman/listinfo/python-dev

============================
`To commit or not commit`_
============================

Walter Dorwald asked if there were "any objections against committing the
patch" for implementing `PEP 293`_ (Codec Error Handling Callbacks).
Guido asked what Martin V. Lowis and M.A. Lemburg had to say about it.
MAL responded that he was +1 on the patch.  Martin was "concerned about
the massive amounts of C code, most of which could be expressed way more
compact in Python code", but "Walter convinced [MvL] that this does have a
real performance impact for real data" so he would live with it.  In the
end he gave it his vote.

Walter said he would check it in (and he has).  The PEP has now been moved
to the finished PEP list.

.. _To commit or not commit:
http://mail.python.org/pipermail/python-dev/2002-September/028502.html
.. _PEP 293: http://www.python.org/peps/pep-0293.html

=======================================
`Proposed Mixins for Wide Interfaces`_
=======================================

Raymond Hettinger suggested adding mixin classes that automatically
implement magic methods when certain basic magic methods were already
implemented (e.g., "given an __eq__ method in a subclass, adds a __ne__
method").  David Abrahams said that he thought "these are a great idea,
*in the context of* an understanding of what we want interfaces to be,
say, and do."  Guido brought up some points about the initial suggestions
Raymond made.  He then said that he thought that there wasn't "enough here
to warrant putting this into the standard library"; the issue will be
revisited when a standard type or interface hierarchy is added to Python
(not in 2.3).

.. _Proposed Mixins for Wide Interfaces:
http://mail.python.org/pipermail/python-dev/2002-September/028543.html

===================================
`mysterious hangs in socket code`_
===================================

Jeremy Hylton wrote some threaded code to fetch some web pages that hung
when performing a slow DNS operation.  Apparently, in Python 2.1 "it
produces a steady stream of output -- urls and the time it took to load
them".  In Python 2.2 and 2.3, though, "it produces little bursts of
output, then pauses for a long time, then repeats".  Jeremy guessed that
it *might* have something to do with Linux's getaddrinfo() being
thread-safe by allowing only a single lookup at a time.  Aahz said that
"gethostbyname() IIRC has frequently been non-reentrant".  Skip ran the
code in question under strace and said that "it seems mostly to be sitting
in select() calls and rt_sigsuspend() which [Skip] guess[es] is a wrapper
around sigsuspend()."

.. _mysterious hangs in socket code:
http://mail.python.org/pipermail/python-dev/2002-September/028555.html

========================================
`Two random and nearly unrelated ideas`_
========================================

Skip Montanaro had two ideas; one was to make the info in `Misc/NEWS`_
(which is a summary of what has been changed in Python for each release)
and "to get rid of the ticker altogether in systems with proper signal
support" (see the `2002-08-16 - 2002-09-01 summary`_ for an explanation of
what the ticker is).  That would get rid of the polling of the ticker and
thus reduce the overhead on threads.

For the first idea, Guido asked Skip to try seeing what it would look like
with reST_ markup and what the resulting page would look like.

In response to the second idea, Oren Tirosh said it couldn't be done until
"all Python I/O calls are converted to be EINTR-safe" (EINTER-safe means
to be able to handle the EINTER signal which what is raised "When an I/O
operation is interrupted by an unmasked signal").  That "requires a lot of
work in some of the hairiest places in the Python codebase."  Fredrik
Lundh said that this "sounds like a good topic for a "here's what I
learned when trying to fix this problem" PEP.  This is most likely in
reference to Skip writing the patch to make the ticker global instead of a
per-thread issue.  Guido said, in terms of signals, to "just say no"; "it
is impossible to write correct code in the presense of signals".  Guido,
in a later email, gave this whole idea a vote of -1,000,000; so it ain't
ever going to happen.  Some discussion on signals ensued, but Guido never
budged from his position.

Oren pointed out that if some C code used signals and people didn't handle
it in their Python code by checking if IOError was caused by EINTER (as
shown below by Oren's code)::

    while 1:
        try:
            <code>
        except IOError, exc:
            if exc.errno == errno.EINTR:
                continue
            else:
                raise

, it would not restart properly even though there was no reason for it to
have stopped.  Oren said that Python could add the loop in the C code of
the core where EINTR might be raised ("Only low-level functions like
os.read_ and os.write_ that map directly to stdio functions should ever
return EINTR").  The proposed idea was to wrap functions that might raise
this that can be re-entered safely.

.. _Two random and nearly unrelated ideas:
http://mail.python.org/pipermail/python-dev/2002-September/028555.html
.. _Misc/NEWS:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Misc/NEWS
.. _2002-08-16 - 2002-09-01 summary:
http://www.ocf.berkeley.edu/~bac/python-dev/summaries/2002-08-16--2002-09-01.html
.. _reST: http://docutils.sourceforge.net/rst.html
.. _os.read:
.. _os.write: http://www.python.org/dev/doc/devel/lib/os-fd-ops.html

================================================
`Should KeyError use repr() on its arguments?`_
================================================

Originally, when an exception was raised and you passed in an optional
object to act as a description of why the exception was raised (such as
``KeyError("there is no spoon")`` where ``there is now spoon`` is the
optional argument bound to ``<exception>.args``), it just returned what
args was bound to when you called; ``str(<exception>) ==
<exception>.args``.  Now it calls repr() on what args is bound to;
``str(<exception>) == str(<exception>.args)``.  Much better. =)

.. _Should KeyError use repr() on its arguments?:
http://mail.python.org/pipermail/python-dev/2002-September/028545.html

==========================================
`New 'spambayes' project on SourceForge`_
==========================================

Thanks to great work done by Tim Peters and several other contributors,
Barry Warsaw started an SF project to host the spambayes code.  It can be
found at http://sf.net/projects/spambayes .  There are two mailing lists:
http://mail.python.org/mailman-21/listinfo/spambayes and
http://mail.python.org/mailman-21/listinfo/spambaye-checkins (yes, that is
Mailman 2.1, and yes, you will "help be a guinea pig for Mailman 2.1").

.. _New 'spambayes' project on SourceForge:
http://mail.python.org/pipermail/python-dev/2002-September/028626.html

=========================
`Subsecond time stamps`_
=========================

Martin V. Lowis wanted to introduce subsecond timestamps on platforms that
supported it.  He suggested adding another field to stat, create a new
type, or make st_mtime a floating point.  The first one option is easy,
the second has the usual problems of defining a new type, and the third
does not guarantee enough accuracy.

Paul Svensson and Guido said that the last option (turning st_mtime into a
float) was the most Pythonic.  MvL agreed, but worried about breaking code
that expected an int.  Guido then suggested that maybe the new field is
the way to go; define something like st_mtimef that will contain the float
if available or contain an int otherwise.  Tim Peters also weighed in with
his `IEEE 754`_ voodoo about how a float can hold enough info to be
accurate up to 100 nanoseconds if you only span a single year; issues
start to come up once you try to go past a year's worth of seconds.

But then MvL discovered that st_mtime was already a float on the Mac; had
that caused  issues?  Jack Jansen of course chimed in on this by saying
that it caused him a headache about once a year in the form of a failing
test (other issues caused by timestamps is the Classic Macs having the
epoch at 1904 and not using UTC time).  He said he would prefer to see the
timestamp as a cookie that was passed into a function that spit out
"something guaranteed to be of your liking".

To address the other issues that Jack mentioned, Guido suggested that all
timestamps be converted to UTC time with the epoch at 1970.

MvL has `SF patch 606592`_ up on SF that has already been closed that
makes all the relevant changes to have timestamps return floats.

.. _Subsecond time stamps:
http://mail.python.org/pipermail/python-dev/2002-September/028648.html
.. _IEEE 754: http://grouper.ieee.org/groups/754/
.. _SF patch 606592: http://www.python.org/sf/606592

=================================
`64-bit process optimization 1`_
=================================

Bob Ledwith posted a simple patch for `Include/object.h`_ that changed the
order of certain parts of the PyObject_HEAD macros, affecting PyObject and
PyVarObject.  This was for a 64-bit platform performance boost (40% for
large data sets according to Bob).  The reordering eliminated some padding
in the struct and allows more Python objects to fit in the L2 cache, or at
least that is what Bob thinks is going on.

Guido pointed out that this would save 8 bytes per object; he thought all
of this was "Interesting!".  But alas, using this patch would break binary
compatibility.  Guido was not sure, though, whether it had been broken yet
between Python 2.2 and 2.3 and thus he might be "being too conservative
here" in terms of saying that it should be held back for now.

A problem Guido pointed out for 64-bit systems, is that theoretically the
reference count for an object could go negative with enough references as
things stand now.  Guido then suggested that perhaps refcnt (struct item
that holds the reference count) should be a ``long``.  And while dealing
with that, Guido suggested that anything that stores a length should store
that number in a ``long``.

Chime in Tim Peters.  He pointed out that it was agreed upon years ago to
move refcnt to ``long`` but no one had bothered to do it.  Heck, even
Guido thought for a long time that it was a long when it wasn't; it
required Tim to "beat that out of [Guido] <wink>" to stop him from saying
that it was a ``long``.  He then pointed out that Win64 was still only 4
bytes for a ``long``; what was really desired was for it to be
``Py_intptr_t`` which is the Python way for spelling the C99 type that we
wanted.  Apparently C99 has a way to specify that things be a specific
byte length (now if everyone just had a C99 compiler we wouldn't need
these macros; oh, to dream...).

Tim also pointed out that what we wanted for the type that held a length
argument to be size_t since that is what strlen() and malloc() are
restricted by.  He said that he writes all of his "string-slinging code as
using size_t vars now".

Tim pointed out that the issue then became "Whether it's worth the pain to
change this stuff" which "depends on whether we think 64-bit boxes are
just another passing fad like the Internet <wink>".  =)

Martin V. Lowis agreed with the changing of refcnt to a long but had
reservations about using size_t for the length field (ob_size).  He
pointed out that some objects put negative values into that field.

Frederik suggested that the proposed changes be default on 64-bit systems
since the chances that they are willing to recompile is higher then people
on 32-bit systems.  He also suggested making it a compiler option.  Guido
thought it was a good idea.  But then Mats Wichmann discovered that the
switch to long killed the performance boost.  So Guido re-iterated that he
thinks it should be a compiler option only on 64-bit systems; have
"compat", "optimal", and "right" compiler options.

.. _64-bit process optimization 1:
http://mail.python.org/pipermail/python-dev/2002-September/028677.html
.. _Include/object.h:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Include/object.h

==========================================
`Weeding out obsolete modules and Demos`_
==========================================

Jack Jansen noticed that there demos for some of the SGI-specific modules
that use severely outdated systems and hardware (stuff discontinued 8 to
12 years ago).  Guido gave the go-ahead to yank them from CVS.  So the
demos are now history.

.. _Weeding out obsolete modules and Demos:
http://mail.python.org/pipermail/python-dev/2002-September/028718.html

==============
`utf8 issue`_
==============

(This thread actually started in August)  There was a bug in Python 2.2
that raised a UnicodeError when trying to decode a lone surrogate
(explanation of surrogates to follow this summary).  This caused issues in
importing .pyc files that contained a lone surrogate because marshal_
(which is what is used to create .pyc files) encodes Unicode_ literals in
UTF-8.  This has all been fixed in Python 2.3, but Guido was wondering how
to backport this for Python 2.2.2.

The option of bumping the magic number for .pyc files was raised and
instantly thrown out by Guido; "Bumping MAGIC is a no-no between dot
releases".  So M.A. Lemburg suggested to either fix the Unicode encoder or
change the Unicode decoder to handle the malformed Unicode.  MAL wasn't
sure, though, if some security issue would be raised by the latter option.

Guido said go for the latter and didn't see any possible security issue
since "If someone you don't trust can write your .pyc files, they can
cause your interpreter to crash by inserting bogus bytecode".

Explanation of lone surrogates:

    In Unicode, a surrogate pair is when you create the representation of
a character by using two values. So, for instance, UTF-32 can cover the
entire Unicode space (since Unicode is 20 bits, although MvL says it is
really more like 21 bits), but UTF-16 can't.  To solve the issue for an
encoding that cannot cover all possible characters in a single value a
character can be represented as a pair of UTF-16 values.  The high
surrogate cover the high 10 bits while the low surrogate cover the lower
10 bits.  High and low surrogates can never be the same since they are
defined by a range of possible values and those ranges do not overlap.  So
with the proper high and low surrogate paired together you can make any
possible Unicode character.

    The problem in Python 2.2.1 is that when there is only a lone
surrogate (instead of there being a pair of surrogates), the encoder for
UTF-8 messes up and leaves off a UTF-8 value.  The following line is an
example:

	>>> u'\ud800'.encode('utf-8')
	'\xa0\x80'  #In Python 2.2.1
	'\xed\xa0\x80'  #In Python 2.3a0

    Notice how in Python 2.3a0 the extra value is inserted so as to make
the representation a complete Unicode character instead of only encoding
the half of the surrogate pair that the encode was given.

    You can read
http://216.239.37.100/search?q=cache:Dk12BZNt6skC:uk.geocities.com/BabelStone1357/Software/surrogates.html
for more info.  Thanks goes to Frederik for the link and Guido for some
clarification.

.. _utf8 issue:
http://mail.python.org/pipermail/python-dev/2002-August/028254.html
.. _marshal: http://www.python.org/dev/doc/devel/lib/module-marshal.html
.. _Unicode: http://www.unicode.org/

=====================================
`Documentation inconsistency in re`_
=====================================

Christopher Craig noticed that the docs for the re_ module for the \b
metacharacter was incorrect; it says that "the end of a word is indicated
by whitespace or a non-alphanumeric character".  That would indicate that
an underscore would be the end of a word, which turns out to be false.
Frederik said that "\b is defined in terms of \w and \W" and thus allows
underscore to be a alphanumeric character.  The documentaiton has been
fixed.

.. _Documentation inconsistency in re:
http://mail.python.org/pipermail/python-dev/2002-September/028644.html
.. _re: http://www.python.org/dev/doc/devel/lib/module-re.html

=======================
`Codecs lookup order`_
=======================

Francois Pinard discovered that for the codecs_ module "one should be
careful about **not** [altered emphasis] naming a module after the
encoding name, when closely following the documentation in the Library
Reference manual".  This is because the codecs module first searches the
registry of codecs, then searches for a module with the same name and use
that module.  The issue comes up when the module does not contain a
function named getregentry(); "\`encodings.lookup()\` expects a
\`getregentry\` function in that module, does not find it, and raises a
CodecRegistryError, not leaving a chance to subsequent codec search
functions to be used".

M.A. Lemburg said that this has been fixed in Python 2.3 and will be in
2.2.2 by having encodings.lookup() return None if getregentry() is not
found and thus allowing the search to continue.

.. _Codecs lookup order:
http://mail.python.org/pipermail/python-dev/2002-September/028676.html
.. _codecs: http://www.python.org/dev/doc/devel/lib/module-codecs.html

=================================
`raw headers in rfc822.Message`_
=================================

John Spurling provided a two-line hack to keep the raw headers in an
rfc822.Message_ .  Barry responded that email.Message.Message_ keeps the
raw headers around.

But the reason I am summarizing this is what this thread quickly changed
to is how to properly generate a patch.  Patches should be generated using
UNIX diff, either the -c or -u option with preference for -c (using cvs
diff -c is even better; puts the version of the file you are diffing with
in the output); Mac folk can send MPW diffs, but UNIX diff is the
definitely preference.  Always put the order of the files `diff -c
OLD_FILE NEW_FILE` .  And always post the patches_ to SourceForge_!
Getting random patches, no matter how small, on the list is annoying (at
least to me) because the point of the list is to discuss the design and
implementation of Python, not to patch Python.  SF is used so that
Python-dev does not need to be bothered with mundame problems like
applying patches (and to annoy Aahz with SF's UI sucking in Lynx_ =).  So
please, for my sake and everyone else on Python-dev, use SF!

For a funny email from Raymond Hettinger about developing for Python read
http://mail.python.org/pipermail/python-dev/2002-September/028725.html .

.. _raw headers in rfc822.Message:
http://mail.python.org/pipermail/python-dev/2002-September/028682.html
.. _rfc822.Message:
http://www.python.org/dev/doc/devel/lib/message-objects.html
.. _email.Message.Message:
http://www.python.org/dev/doc/devel/lib/module-email.Message.html
.. _patches: http://sourceforge.net/patch/?group_id=5470
.. _SourceForge: http://www.sourceforge.net/
.. _Lynx: http://lynx.browser.org/

===================
`type categories`_
===================

Yes, the `same thread`_ from the `last summary`_ is back.  This thread has
become the bane of my summarizing existence.  =)

Aahz asked "why wouldn't we simply use attributes to hold" interfaces that
a class implemented (think of __slots__).  David Abrahams then brought up
the idea of just adding interfaces to the __class__ attribute.

Guido then chimed in on the attributes idea.  He pointed out that this is
how Zope does it, using the __inherits__ attribute.  The limitation is
that "it isn't automatically merged properly on multiple inheritance, and
adding one new interface to it means you have to copy or reference the
base class __inherits__ attribute".  And as for David's idea of just
adding to __class__, that doesn't work because there is no way to limit
the interface; you need "Something like private inheritance" for when an
interface is broken by some inherited class.  David subsequently added the
issue of being able to disinherit when an interface is not valid but is
inherited by default as another problem for using inheritence for
interfaces.

David then brought up the issue of having Python being so dynamic that you
could inject an interface if you used __class__ like he suggested through
black magic code.  If the injected interface didn't work because of the
inheritence chain, then you have a problem.

Barry Warsaw brought in his objections.  He tried playing Devil's Advocate
by saying that Guido had said that inheritance would not be the only way
to handle interfaces, but that it would be the predominent way.  But this
duality would complicate any conformsto()-like function since it would
have to handle two different ways for a class to get an interface.  Barry
then brought up the objection that he didn't like the idea of using
straight inheritence because he wanted a syntactic way to separate out
interfaces.

As a side note, Guido pointed out that __slots__ is provisional; nicer
syntax will eventually surface when Guido gets over his "fear of adding
new keywords".

.. _type categories:
http://mail.python.org/pipermail/python-dev/2002-September/028738.html
.. _same thread:
http://www.ocf.berkeley.edu/~bac/python-dev/summaries/2002-08-16--2002-09-01.html#type-categories
.. _last summary:
http://www.ocf.berkeley.edu/~bac/python-dev/summaries/2002-08-16--2002-09-01.html

=======================================
`flextype.c  -- extended type system`_
=======================================

Christian Tismer has come up with a replacement for the etype which is "a
hidden structure that extends types when they are allocated on the heap"
(you can find it in `Objects/typeobject.c`_ in the CVS_).  There is a
limitation with the etype where it could not be extended by metatypes.
Well, Chris worked his magic and came up with a new flextype that allows
overriding of methods.  So with Christian's code you would be able to
override methods in a type without having to hack something together to
handle the overriding correctly; it would be handled automatically.

Through some clarification from Christian and Guido, it was pointed out to
me (as of this moment I am the only one to make any noise on this thread,
and it was for this summary) that this simplifies an esoteric issue; note
the use of the words "metatype" above.  This is type/metatype black magic
hacking.  Spiffy, but something most of us "normal" folk will not have to
worry about.

.. _flextype.c  -- extended type system:
http://mail.python.org/pipermail/python-dev/2002-September/028736.html
.. _Objects/typeobject.c:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/dist/src/Objects/typeobject.c
.. _CVS: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/#dirlist