From  Thu Aug  1 00:48:21 2002
From: (Greg Ewing)
Date: Thu, 01 Aug 2002 11:48:21 +1200 (NZST)
Subject: [Python-Dev] pre-PEP: The Safe Buffer Interface
In-Reply-To: <>
Message-ID: <>

Scott Gilbert <>:

> getreadbufferproc bf_getreadbuffer;
> getwritebufferproc bf_getwritebuffer;
> acquirereadbufferproc bf_acquirereadbuffer;
> acquirewritebufferproc bf_acquirewritebuffer;

Is there really a need for both "get" and "acquire"
methods? Surely if an object requires locking, it
always requires locking, so why can't the "get"
functions simply include the locking operation
if they need it?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug  1 00:52:53 2002
From: (Greg Ewing)
Date: Thu, 01 Aug 2002 11:52:53 +1200 (NZST)
Subject: [Python-Dev] pre-PEP: The Safe Buffer Interface
In-Reply-To: <0d2b01c238ab$0e892ff0$e000a8c0@thomasnotebook>
Message-ID: <>

Thomas Heller <>:

> The consequence: mmap objects need a 'buffer lock counter',
> and cannot be closed while the count is >0. Which exception
> is raised then?

Maybe instead of raising an exception at all, the
closing could simply be deferred until the lock
count reaches 0?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug  1 00:51:36 2002
From: (Mark Hammond)
Date: Thu, 1 Aug 2002 09:51:36 +1000
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: <>
Message-ID: <>

> "Mark Hammond" <> writes:

> > IMO, the Python debugger "interface" should include function entry.
> There goes the time machine: it does.  I just think everyone ignores
> 'call' messages because they're a bit redundant today (because of the
> matter under discussion).

Yes, I should have said "continue to include function entry".

I understood that a patch under discussion may have *removed* this facility
from the debugger.  While I agree it is redundant and most debuggers will
choose to ignore it, I believe removing it from the low level debugger hooks
would be a mistake.


From  Thu Aug  1 01:14:19 2002
From: (Barry A. Warsaw)
Date: Wed, 31 Jul 2002 20:14:19 -0400
Subject: [Python-Dev] PEP 298 - the Fixed Buffer Interface
References: <04da01c237ef$c103ac30$e000a8c0@thomasnotebook>
Message-ID: <>

>>>>> "TH" == Thomas Heller <> writes:

    TH> I've changed PEP 298 to incorporate the latest changes.
    TH> Barry has not yet run pep2html (and I don't want to bother
    TH> him too much with this)

Not a bother.  I had to wait until I got home, but I just pushed it


From  Thu Aug  1 01:17:49 2002
From: (Scott Gilbert)
Date: Wed, 31 Jul 2002 17:17:49 -0700 (PDT)
Subject: [Python-Dev] pre-PEP: The Safe Buffer Interface
In-Reply-To: <>
Message-ID: <>

--- Greg Ewing <> wrote:
> Scott Gilbert <>:
> > getreadbufferproc bf_getreadbuffer;
> > getwritebufferproc bf_getwritebuffer;
> >
> > acquirereadbufferproc bf_acquirereadbuffer;
> > acquirewritebufferproc bf_acquirewritebuffer;
> Is there really a need for both "get" and "acquire"
> methods? Surely if an object requires locking, it
> always requires locking, so why can't the "get"
> functions simply include the locking operation
> if they need it?

That is the proposal.  The get methods are the legacy (non-fixed)


Do You Yahoo!?
Yahoo! Health - Feel better, live better

From  Thu Aug  1 02:17:22 2002
From: (David Goodger)
Date: Wed, 31 Jul 2002 21:17:22 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process PEPs
Message-ID: <>


Pursuant to PEP 287, one of the deliverables of the just-released
Docutils 0.2 ( is a processing system for
reStructuredText-format PEPs as an alternative to the current PEP
processing.  Here are examples of new-style PEPs (processed to HTML,
with links to the source text as usual):

- (latest)
- (as a proof of concept
  because of its special processing)

Compare to the old-style PEPs:

- (update pending)

Existing old-style PEPs can coexist with reStructuredText PEPs
indefinitely.  What to do with new PEPs is a policy decision that
doesn't have to be made immediately.  PEP 287 puts forward a detailed
rationale for reStructuredText PEPs; especially see the "Questions &
Answers" section, items 4 through 7.

In earlier correspondence Guido critiqued some style issues (since
corrected) and said "I'm sure you can fix all these things with a
simple style sheet change, and then I'm all for allowing Docutils for
PEPs."  I'd appreciate more critiques/suggestions on PEP formatting
issues, no matter how small.  Especially, please point out any
HTML/stylesheet issues with the various browsers.

I hereby formally request permission to deploy Docutils for PEPs on  Here's a deployment plan for your consideration:

- Install the Docutils-modified version of Fredrik Lundh's
  nondist/peps/ script into CVS, along with ancillary
  files.  The modified auto-detects old-style and
  new-style PEPs and processes accordingly.

- Install Docutils 0.2 on the server that does the PEP processing.  I
  don't think it's necessary to put Docutils into Python's CVS.

- Make up a README for the "peps" directory with instructions for
  installing Docutils and running the modified

- Modify PEP 1 (PEP Purpose and Guidelines) and PEP 9 (Sample PEP
  Template) with the new formatting instructions.

- Make an announcement to the Python community.

- I will maintain the software, convert current meta-PEPs to the new
  format as desired, handle PEP conversion updates, and assist other
  PEP authors to convert their PEPs if they wish.

If this is acceptable, to begin I will need write access to CVS and
shell access to the server (however that works; please let
me know what I need to do).  Once I have the necessary access, I will
try to ensure a near-zero impact on the PythonLabs crew.

Feedback is most welcome.

David Goodger  <>  Open-source projects:
  - Python Docutils:
    (includes reStructuredText:
  - The Go Tools Project:

From  Thu Aug  1 03:28:53 2002
From: (Barry A. Warsaw)
Date: Wed, 31 Jul 2002 22:28:53 -0400
Subject: [Python-Dev] split('') revisited
References: <>
Message-ID: <>

>>>>> "AK" == Andrew Koenig <> writes:

    AK> Back in February, there was a thread in comp.lang.python (and,
    AK> I think, also on Python-Dev) that asked whether the following
    AK> behavior:

    >> 'abcde'.split('')
    |         Traceback (most recent call last):
    |           File "<stdin>", line 1, in ?
    |         ValueError: empty separator

    AK> was a bug or a feature.  The prevailing opinion at the time
    AK> seemed to be that there was not a sensible, unique way of
    AK> defining this operation, so rejecting it was a feature.

    AK> That answer didn't bother me particularly at the time, but
    AK> since then I have learned a new fact (or perhaps an old fact
    AK> that I didn't notice at the time) that has changed my mind:
    AK> Section 4.2.4 of the library reference says that the 'split'
    AK> method of a regular expression object is defined as

    AK>         Identical to the split() function, using the compiled
    AK> pattern.

    AK> This claim does not appear to be correct:

Actually, I believe what it's saying is that


is the same as

    re.split('', 'abcde')

not that re...split() has anything to do with the split() string


From  Thu Aug  1 05:16:19 2002
From: (Ka-Ping Yee)
Date: Wed, 31 Jul 2002 21:16:19 -0700 (PDT)
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to process PEPs
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0207312104140.7588-100000@ziggy>

On Wed, 31 Jul 2002, David Goodger wrote:
> I hereby formally request permission to deploy Docutils for PEPs on
>  Here's a deployment plan for your consideration:

I have just read the specification:

It took a long time.  Perhaps it seems not so big to others, but
my personal opinion would be to recommend against this proposal
until the specification fits in, say, 1000 lines and can be absorbed
in ten minutes.  For me, it violates the fits-in-my-brain principle:
the spec is 2500 lines long, and supports six different kinds of
references and five different kinds of lists (even lists with roman
numerals!).  It also violates the one-way-to-do-it principle:
for example, there are a huge variety of ways to do headings,
and two different syntaxes for drawing a table.

I am not against structured text processing systems in general.
I think that something of this flavour would be a great solution
for PEPs and docstrings, and that David has done an impressive
job on RST.  It's just that RST is much too big (for me).

-- ?!ng

"This code is better than any code that doesn't work has any right to be."
    -- Roger Gregory, on Xanadu

From  Thu Aug  1 05:31:07 2002
From: (Eric S. Raymond)
Date: Thu, 1 Aug 2002 00:31:07 -0400
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to process PEPs
In-Reply-To: <Pine.LNX.4.44.0207312104140.7588-100000@ziggy>
References: <> <Pine.LNX.4.44.0207312104140.7588-100000@ziggy>
Message-ID: <>

Ka-Ping Yee <>:
> I am not against structured text processing systems in general.
> I think that something of this flavour would be a great solution
> for PEPs and docstrings, and that David has done an impressive
> job on RST.  It's just that RST is much too big (for me).

And if we're going to pay the transition costs to move to a
heavyweight markup, it ought to be DocBook, same direction GNOME and 
KDE and the Linux kernel and FreeBSD and PHP are going.
		<a href="">Eric S. Raymond</a>

From  Thu Aug  1 05:42:48 2002
From: (Aahz)
Date: Thu, 1 Aug 2002 00:42:48 -0400
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to process PEPs
In-Reply-To: <>
References: <> <Pine.LNX.4.44.0207312104140.7588-100000@ziggy> <>
Message-ID: <>

On Thu, Aug 01, 2002, Eric S. Raymond wrote:
> Ka-Ping Yee <>:
>> I am not against structured text processing systems in general.
>> I think that something of this flavour would be a great solution
>> for PEPs and docstrings, and that David has done an impressive
>> job on RST.  It's just that RST is much too big (for me).
> And if we're going to pay the transition costs to move to a
> heavyweight markup, it ought to be DocBook, same direction GNOME and 
> KDE and the Linux kernel and FreeBSD and PHP are going.

Well, reST can generate DocBook easily enough.  The problem I see with
DocBook is the creation/editing side: XML is painful.  Having written
one presentation in pure XML/PythonPoint and another presentation in my
home-grown structured text system that then got converted to XML for
processing by PythonPoint, I'm a big believer in the *concept* of reST.

What remains to be seen is whether reST works well enough in the Real
World [tm].
Aahz (           <*>

Project Vote Smart:

From  Thu Aug  1 07:21:15 2002
From: (Martin v. Loewis)
Date: 01 Aug 2002 08:21:15 +0200
Subject: [Python-Dev] split('') revisited
In-Reply-To: <>
References: <>
Message-ID: <>

Andrew Koenig <> writes:

> It seems to me that there are four reasonable courses of action:
>    1) Do nothing -- the problem is too trivial to worry about.
>    2) Change string split (and its documentation) to match regexp split.
>    3) Change regexp split (and its documentation) to match string split.
>    4) Change both string split and regexp split to do something else :-)

There is another option:

     5) Change the documentation of re.split to match the implemented

Not that I could say what the implemented behaviour is, though :-(


From  Thu Aug  1 07:32:27 2002
From: (eric jones)
Date: Thu, 1 Aug 2002 01:32:27 -0500
Subject: [Python-Dev] Docutils/reStructuredText is ready to process PEPs
In-Reply-To: <>
Message-ID: <000001c23925$37357a60$777ba8c0@ericlaptop>

I would very much like to see reStructuredText, or some minor variation
on it, move forward as a "standard" for doc-strings very soon.  I have
long lamented not having a prescribed format *and* an associated
processing tool suite included in the standard library.  Even if the
format isn't perfect (I think it looks very good), it is time to pick a
reasonable candidate and go.  

SciPy does not yet have a standard doc-string format.  The .3 release of
SciPy (we're at .2alpha) will primarily be a documentation/testing
effort.  I'd like to use the chosen standard so that we can
auto-generate the reference manual without setting up some complex third
party tools.  The user documentation for SciPy may still end up in TeX
(which is very hard for me to swallow) or Word (I know, I know) because
of their power, but doc-strings need something simpler.  If XML or
something like that is chosen, we'd probably use it, but I'd be less
excited because it doesn't read as well in plain text form.  Also, it
will be much harder to get the scientists that contribute modules to
conform to this.

I watched the doc-sig for many months a when SciPy was started, and
there was a lot of discussion on the multitude of different choices.  It
seemed like 50% wanted a dirt simple mark up like Ka-Ping suggests, and
50% wanted TeX or XML for maximum power as Eric R. suggests.  You can't
satisfy both camps, but David seems to have balanced most of the issues
very well.  Looking at David's marked-up PEPs, they read very nicely as
plain text. I'm fairly confident that, with these as an example and
without reading the specification, I can write my own marked up document
or doc-string with little effort.  Diving into the longer spec is only
needed if you want fancy stuff.

There are no doubt millions of choices for marking up doc-strings.
David's is quite reasonable, solves many problems with StructureText,
*has a champion*, and looks to have a fairly good start on a tool suite.
If another choice is as well balanced, *has a champion*, and has a
prayer of having tools ready in the standard library soon, then lets
consider it.  Otherwise, the argument over the perfect markup choice has
been kicked around enough over the last several years.  Let's just tweak
this one.  I wish for less vertical white space and a simpler heading
markup too, but, not so much that I'm willing to think through all this
as thoroughly as David has. I no longer wish for a "perfect" markup,
just a standard one -- and soon.  

On a related note, distutils is (far) less than perfect (sorry Greg),
and I have cursed it on many occasions.  However, it works, solves a
huge problem, and (with modifications) made building the 130,000 or so
lines of Python/C/Fortran code that is SciPy tractable in a platform
independent way.  Standardizing on reStructuredText will have similar

my 0.02,

> -----Original Message-----
> From: []
> Behalf Of David Goodger
> Sent: Wednesday, July 31, 2002 8:17 PM
> To:
> Subject: [Python-Dev] Docutils/reStructuredText is ready to process
> Python-developers,
> Pursuant to PEP 287, one of the deliverables of the just-released
> Docutils 0.2 ( is a processing system for
> reStructuredText-format PEPs as an alternative to the current PEP
> processing.  Here are examples of new-style PEPs (processed to HTML,
> with links to the source text as usual):
> - (latest)
> - (as a proof of concept
>   because of its special processing)
> Compare to the old-style PEPs:
> - (update pending)
> -
> Existing old-style PEPs can coexist with reStructuredText PEPs
> indefinitely.  What to do with new PEPs is a policy decision that
> doesn't have to be made immediately.  PEP 287 puts forward a detailed
> rationale for reStructuredText PEPs; especially see the "Questions &
> Answers" section, items 4 through 7.
> In earlier correspondence Guido critiqued some style issues (since
> corrected) and said "I'm sure you can fix all these things with a
> simple style sheet change, and then I'm all for allowing Docutils for
> PEPs."  I'd appreciate more critiques/suggestions on PEP formatting
> issues, no matter how small.  Especially, please point out any
> HTML/stylesheet issues with the various browsers.
> I hereby formally request permission to deploy Docutils for PEPs on
>  Here's a deployment plan for your consideration:
> - Install the Docutils-modified version of Fredrik Lundh's
>   nondist/peps/ script into CVS, along with ancillary
>   files.  The modified auto-detects old-style and
>   new-style PEPs and processes accordingly.
>   (
> - Install Docutils 0.2 on the server that does the PEP processing.  I
>   don't think it's necessary to put Docutils into Python's CVS.
> - Make up a README for the "peps" directory with instructions for
>   installing Docutils and running the modified
> - Modify PEP 1 (PEP Purpose and Guidelines) and PEP 9 (Sample PEP
>   Template) with the new formatting instructions.
> - Make an announcement to the Python community.
> - I will maintain the software, convert current meta-PEPs to the new
>   format as desired, handle PEP conversion updates, and assist other
>   PEP authors to convert their PEPs if they wish.
> If this is acceptable, to begin I will need write access to CVS and
> shell access to the server (however that works; please let
> me know what I need to do).  Once I have the necessary access, I will
> try to ensure a near-zero impact on the PythonLabs crew.
> Feedback is most welcome.
> --
> David Goodger  <>  Open-source projects:
>   - Python Docutils:
>     (includes reStructuredText:
>   - The Go Tools Project:
> _______________________________________________
> Python-Dev mailing list

From  Thu Aug  1 07:39:24 2002
From: (Tim Peters)
Date: Thu, 01 Aug 2002 02:39:24 -0400
Subject: [Python-Dev] split('') revisited
In-Reply-To: <>
Message-ID: <>

[Andrew Koenig]
> ...
> Section 4.2.4 of the library reference says that the 'split' method of a
> regular expression object is defined as
>         Identical to the split() function, using the compiled pattern.

Supplying words intended to be clear from context, it's saying that the
split method of a regexp object is identical to the re.split() function,
which is true.  In much the same way, list.pop() isn't the same thing as
eyeball.pop() <wink>.

> This claim does not appear to be correct:
>         >>> import re
>         >>> re.compile('').split('abcde')
>         ['abcde']
> This result differs from the result of using the string split method.

True, but it's the same as

>>> import re
>>> re.split('', 'abcde')

which is all the docs are trying to say.

> ...
> My first impulse was to argue that (4) is right, and that the behavior
> should be as follows
>         >>> 'abcde'.split('')
> 	['a', 'b', 'c', 'd', 'e']

If that's what you want, list('abcde') is a direct way to get it.

> ...
> I made the counterargument that one could disambiguate by adding the
> rule that no element of the result could be equal to the delimiter.
> Therefore, if s is a string, s.split('') cannot contain any empty
> strings.

Sure, that's one arbitrary rule <wink>.  It doesn't seem to extend to
regexps in a reasonable way, though:

>>> re.split('.*', 'abcde')
['', '']

Both split pieces there match the pattern.

> However, looking at the behavior of regular expression splitting more
> closely, I become more confused.  Can someone explain the following
> behavior to me?
>         >>> re.compile('a|(x?)').split('abracadabra')
>         ['', None, 'br', None, 'c', None, 'd', None, 'br', None, '']

>From the docs:

    If capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting list.

It should also say that splits never occur at points where the only match is
against an empty string (indeed, that's exactly why re.split('', 'abcde')
doesn't split anywhere).  The logic is like:

    while True:
        find next non-empty match, else break
        emit the slice between this and the end of the last match
        emit all capturing groups
        advance position by length of match
    emit the slice from the end of the last match to the end of the string

It's the last line in the loop body that makes empty matches a wart if
allowed:  they wouldn't advance the position at all, and an infinite loop
would result.  In order to make them do what you think you want, we'd have
to add, at the end of the loop body

        ah, and if the match was emtpy, advance the position again, by,
        oh, i don't know, how about 1?  That's close to 0 <wink>.

So the pattern matches at the first 'a', and adds '' to the list (the slice
to the left of the first match) and None to the list (the capturing group
didn't participate in the match, but that doesn't excuse it from adding
something to the list).  There are no other non-empty matches until getting
to the second 'a', and then that adds 'br' to the list (the slice between
the current match and the last match), and None again for the
non-participating capturing group.  Etc.  The trailing empty string is the
slice from the end of the last match to the end of the string (which happens
to be empty in this case).

It's unclear to me what you expected instead.  Perhaps this?

>>> re.split('a|(?:x?)', 'abracadabra')
['', 'br', 'c', 'd', 'br', '']

From  Thu Aug  1 09:27:10 2002
From: (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Thu, 1 Aug 2002 10:27:10 +0200 (CEST)
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Misc,1.24,1.25
Message-ID: <>

Jack Jansen <> writes:

> > ! Force stdin, stdout and stderr to be totally unbuffered.  Note that
> > ! there is internal buffering in xreadlines(), readlines() and file-object
> > ! iterators ("for line in sys.stdin") which is not influenced by this
> > ! option.  To work around this, you will want to use "sys.stdin.readline()"
> > ! inside a "while 1:" loop.
> For readlines() I think this is the right thing to do, but
> xreadlines() and file iterators could actually "do the right thing"
> and revert to a slower scheme if the underlying stream is unbuffered?
> Or is this overkill?

I'm not sure. The patch describes the current state; if anybody
improves that, they should change the man page, too.


From  Thu Aug  1 09:31:56 2002
From: (Thomas Heller)
Date: Thu, 1 Aug 2002 10:31:56 +0200
Subject: [Python-Dev] pre-PEP: The Safe Buffer Interface
References: <>
Message-ID: <002701c23935$e5831c20$e000a8c0@thomasnotebook>

From: "Greg Ewing" <>
> Scott Gilbert <>:
> > getreadbufferproc bf_getreadbuffer;
> > getwritebufferproc bf_getwritebuffer;
> >
> > acquirereadbufferproc bf_acquirereadbuffer;
> > acquirewritebufferproc bf_acquirewritebuffer;
> Is there really a need for both "get" and "acquire"
> methods? Surely if an object requires locking, it
> always requires locking, so why can't the "get"
> functions simply include the locking operation
> if they need it?

Backward compatibility.
If we change the array object to enter a locked state
when getreadbuffer() is called, it would be surprising.


From  Thu Aug  1 09:29:48 2002
From: (Thomas Wouters)
Date: Thu, 1 Aug 2002 10:29:48 +0200
Subject: [Python-Dev] Re: What to do about the Wiki?
In-Reply-To: <>
References: <> <15688.2985.118330.48738@localhost.localdomain> <> <> <> <>
Message-ID: <>

On Wed, Jul 31, 2002 at 07:56:49PM +0200, M.-A. Lemburg wrote:

> >A process running out of memory, AFAIK.

> In that case, wouldn't it be better to impose a memoryuse limit
> on the user which Apache uses for dealing with CGI
> scripts ? That wouldn't solve any specific Wiki related
> problem, but prevents the server from going offline because
> of memory problems.

There is a memory limit, and the problem is not that a single process
freezes the server. Instead, if a single process's memory limits is 1/4th of
the physical limit, 4 bloated wiki's freeze the server. If it's 1/10th, it's
10, and so on.

Thomas Wouters <>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

From  Thu Aug  1 10:00:21 2002
From: (Michael Hudson)
Date: 01 Aug 2002 10:00:21 +0100
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: "Mark Hammond"'s message of "Thu, 1 Aug 2002 09:51:36 +1000"
References: <>
Message-ID: <>

"Mark Hammond" <> writes:

> > "Mark Hammond" <> writes:
> > > IMO, the Python debugger "interface" should include function entry.
> >
> > There goes the time machine: it does.  I just think everyone ignores
> > 'call' messages because they're a bit redundant today (because of the
> > matter under discussion).
> Yes, I should have said "continue to include function entry".
> I understood that a patch under discussion may have *removed* this facility
> from the debugger.

Nononononononono.  No.  No.

Currently a trace function can be called for four reasons: 'call',
'line', 'return' and 'raise'.

'call' is called high up in eval_frame, on entry to the code object (I
suspect it is also called on resumption of generators, but also
suspect that this is accidental).  'return' is called when the main
loop finished with why == WHY_RETURN or WHY_YIELD, 'raise' ditto but
why == WHY_EXCEPTION.  None of these are affected by my patch.

At the moment 'line' is called by the SET_LINENO opcode.  My patch
changes it to be called when the co_lnotab indicates execution has
moved onto a different line.

The reason this changes behaviour is that currently a SET_LINENO
opcode is emitted for the def line of every function (I guess this is
to cope with

def functions_like_this(): return 1

).  After my patch there are no SET_LINENO opcodes, so execution is
never on the def line[*], so no 'line' trace event is generated for
the def line, so a debugger that only listens to the 'line' events and
ignores the 'call' events will not stop on that line.

If my patch goes in, I'll probably change pdb to catch 'call' events,
and nag authors of other debuggers that they should do the same.

It is possible to generate an extra 'line' trace event to mimic the
old behaviour, but it's gross.

> While I agree it is redundant and most debuggers will choose to
> ignore it, I believe removing it from the low level debugger hooks
> would be a mistake.

Now I've spent some minutes explaining myself, you can explain to me
where you got the idea that I was even considering doing so from!


[*] For a typical function which has no code on the def line.
34. The string is a stark data structure and everywhere it is
    passed there is much duplication of process.  It is a perfect
    vehicle for hiding information.
  -- Alan Perlis,

From  Thu Aug  1 10:07:42 2002
From: (Thomas Heller)
Date: Thu, 1 Aug 2002 11:07:42 +0200
Subject: [Python-Dev] PEP 298, final (?) version
Message-ID: <00b101c2393a$e4a01ce0$e000a8c0@thomasnotebook>

Here is PEP 298 in it's near final version (not yet checked in).

It seems to me we have to end the discussion and I'm quite
happy with it. If accepted in this form, I can start the
implementation right after the end of my vacation, second half
of August.

The only thing I consider worth changing is to rename the
whole stuff from 'fixed buffer interface' to 'locked buffer
interface', which makes more sense at the current state.


PEP: 298
Title: The Fixed Buffer Interface
Version: $Revision: 1.4 $
Last-Modified: $Date: 2002/07/31 18:48:36 $
Author: Thomas Heller <>
Status: Draft
Type: Standards Track
Created: 26-Jul-2002
Python-Version: 2.3
Post-History: 30-Jul-2002, 1-Aug-2002


    This PEP proposes an extension to the buffer interface called the
    'fixed buffer interface'.

    The fixed buffer interface fixes the flaws of the 'old' buffer
    interface [1] as defined in Python versions up to and including
    2.2, and has the following semantics:

        The lifetime of the retrieved pointer is clearly defined and
        controlled by the client.

        The buffer size is returned as a 'size_t' data type, which
        allows access to large buffers on platforms where sizeof(int)
        != sizeof(void *).

    (Guido comments: This second sounds like a change we could also
    make to the "old" buffer interface, if we introduce another flag
    bit that's *not* part of the default flags.)


    The fixed buffer interface exposes new functions which return the
    size and the pointer to the internal memory block of any python
    object which chooses to implement this interface.

    Retrieving a buffer from an object puts this object in a locked
    state during which the buffer may not be freed, resized, or

    The object must be unlocked again by releasing the buffer if it's
    no longer used by calling another function in the fixed buffer
    interface.  If the object never resizes or reallocates the buffer
    during it's lifetime, this function may be NULL. Failure to call
    this function (if it is != NULL) is a programming error and may
    have unexpected results.

    The fixed buffer interface omits the memory segment model which is
    present in the old buffer interface - only a single memory block
    can be exposed.


    Define a new flag in Include/object.h:

        /* PyBufferProcs contains bf_acquirefixedreadbuffer,
           bf_acquirefixedwritebuffer, and bf_releasefixedbuffer */
        #define Py_TPFLAGS_HAVE_FIXEDBUFFER (1L<<15)

    This flag would be included in Py_TPFLAGS_DEFAULT:

        #define Py_TPFLAGS_DEFAULT  ( \
                             Py_TPFLAGS_HAVE_FIXEDBUFFER | \

    Extend the PyBufferProcs structure by new fields in

        typedef size_t (*acquirefixedreadbufferproc)(PyObject *,
                                                     const void **);
        typedef size_t (*acquirefixedwritebufferproc)(PyObject *,
                                                      void **);
        typedef void (*releasefixedbufferproc)(PyObject *);

        typedef struct {
                getreadbufferproc bf_getreadbuffer;
                getwritebufferproc bf_getwritebuffer;
                getsegcountproc bf_getsegcount;
                getcharbufferproc bf_getcharbuffer;
                /* fixed buffer interface functions */
                acquirefixedreadbufferproc bf_acquirefixedreadbuffer;
                acquirefixedwritebufferproc bf_acquirefixedwritebuffer;
                releasefixedbufferproc bf_releasefixedbuffer;
        } PyBufferProcs;

    The new fields are present if the Py_TPFLAGS_HAVE_FIXEDBUFFER
    flag is set in the object's type.

    The Py_TPFLAGS_HAVE_FIXEDBUFFER flag implies the

    The acquirefixedreadbufferproc and acquirefixedwritebufferproc
    functions return the size in bytes of the memory block on success,
    and fill in the passed void * pointer on success.  If these
    functions fail - either because an error occurs or no memory block
    is exposed - they must set the void * pointer to NULL and raise an
    exception.  The return value is undefined in these cases and
    should not be used.

    If calls to these functions succeed, eventually the buffer must be
    released by a call to the releasefixedbufferproc, supplying the
    original object as argument.  The releasefixedbufferproc cannot

    Usually these functions aren't called directly, they are called
    through convenience functions declared in Include/abstract.h:

        int PyObject_AquireFixedReadBuffer(PyObject *obj,
                                           const void **buffer,
                                           size_t *buffer_len);

        int PyObject_AcquireFixedWriteBuffer(PyObject *obj,
                                             void **buffer,
                                             size_t *buffer_len);

        void PyObject_ReleaseFixedBuffer(PyObject *obj);

    The former two functions return 0 on success, set buffer to the
    memory location and buffer_len to the length of the memory block
    in bytes. On failure, or if the fixed buffer interface is not
    implemented by obj, they return -1 and set an exception.

    The latter function doesn't return anything, and cannot fail.

Backward Compatibility

    The size of the PyBufferProcs structure changes if this proposal
    is implemented, but the type's tp_flags slot can be used to
    determine if the additional fields are present.

Reference Implementation

    Will be uploaded to the SourceForge patch manager by the author.

Additional Notes/Comments

    Python strings, unicode strings, mmap objects, and array objects
    would expose the fixed buffer interface.

    mmap and array objects would actually enter a locked state while
    the buffer is active, this is not needed for strings and unicode
    objects.  Resizing locked array objects is not allowed and will
    raise an exception. Whether closing a locked mmap object is an
    error or will only be deferred until the lock count reaches zero
    is an implementation detail.

Community Feedback

    Greg Ewing doubts the fixed buffer interface is needed at all, he
    thinks the normal buffer interface could be used if the pointer is
    (re)fetched each time it's used.  This seems to be dangerous,
    because even innocent looking calls to the Python API like
    Py_DECREF() may trigger execution of arbitrary Python code.

    The first version of this proposal didn't have the release
    function, but it turned out that this would have been too
    restrictive: mmap and array objects wouldn't have been able to
    implement it, because mmap objects can be closed anytime if not
    locked, and array objects could resize or reallocate the buffer.


    Scott Gilbert came up with the name 'fixed buffer interface'.


    [1] The buffer interface

    [2] The Buffer Problem


    This document has been placed in the public domain.

Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70

From  Thu Aug  1 12:22:19 2002
From: (Ka-Ping Yee)
Date: Thu, 1 Aug 2002 04:22:19 -0700 (PDT)
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to process
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0208010343380.7588-100000@ziggy>

On Thu, 1 Aug 2002, Eric S. Raymond wrote:
> Ka-Ping Yee <>:
> > I am not against structured text processing systems in general.
> > I think that something of this flavour would be a great solution
> > for PEPs and docstrings, and that David has done an impressive
> > job on RST.  It's just that RST is much too big (for me).
> And if we're going to pay the transition costs to move to a
> heavyweight markup, it ought to be DocBook, same direction GNOME and
> KDE and the Linux kernel and FreeBSD and PHP are going.

I would be very unhappy about having to enter and edit inline
documentation in an XML-based markup language.

RST is not what i would call heavyweight *markup*.  It's just a
heavy specification.  There are too many cases to know.  If you
simplified RST in the following ways, we might have something
i would consider reasonably-sized:

    - Choose one way to do headings.
    - Choose one way to do numbered and non-numbered lists.
    - Choose one way to do tables.
    - Drop bibliographic fields.
    - Drop RCS keyword processing.
    - Get rid of option lists (we already have definition lists).
    - Drop some fancy reference features (e.g. auto-numbered and
        auto-symbol footnotes, indirect references, substitutions).
    - Drop inline hyperlink references (we already have inline URLs).
    - Drop inline internal targets (we already have explicit targets).
    - Drop interpreted text (we already have inline literals).
    - Drop citations (we already have footnotes).
    - (Or, in summary -- instead of ten kinds of inline markup, we
      only need four: emphasis, literals, footnotes, and URLs.)
    - Simplify inline markup rules (way too many characters to know).
        Instead of 100 lines describing markup rules, two lines are
        sufficient: emphasis starts from " *" and stops at "*", literals
        go from " `" to "`", and footnotes go from " [" to "[".

-- ?!ng

"This code is better than any code that doesn't work has any right to be."
    -- Roger Gregory, on Xanadu

From  Thu Aug  1 13:52:48 2002
From: (David Goodger)
Date: Thu, 01 Aug 2002 08:52:48 -0400
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to process PEPs
In-Reply-To: <Pine.LNX.4.44.0207312104140.7588-100000@ziggy>
Message-ID: <>

Ka-Ping Yee wrote:
> I have just read the specification:
> It took a long time.  Perhaps it seems not so big to others, but
> my personal opinion would be to recommend against this proposal
> until the specification fits in, say, 1000 lines and can be absorbed
> in ten minutes.

The specification is, as its title says, a *specification*.  It's a detailed
description of the markup, intended to guide the *developer* who is writing
a parser or other tool.  It's not user documentation.

For that, see the quick reference at  It's only 1153 lines of HTML
(with lots of blank lines and linebreaks, hand-written before the
reStructuredText parser could handle everything).

Perhaps you started at the wrong end.  The best place to start is with "A
ReStructuredText Primer" by Richard Jones, at (which *is* generated from
text).  It's only 335 lines long :-).  It leads to the quick reference,
which leads to the spec itself.

And there was this item of the "deployment plan":

    - Modify PEP 1 (PEP Purpose and Guidelines) and PEP 9 (Sample PEP
      Template) with the new formatting instructions.

PEP 9 could contain or point to a short & to-the-point overview of the
markup.  I see no problem coming up a user document that's more complete
than the "Primer" above but still weighs in at under 1000 lines.  But with
the docs mentioned above, it it necessary?

You could also begin by perusing an example.  Look at the markup in, marked up in reStructuredText in
the intended way.  With the exception of embedded references and targets
(for which there is no plaintext equivalent), none of the markup there looks
like markup, and should be very easy to follow.  Now look at the processed
result (; I think the return is
worth the investment.

> For me, it violates the fits-in-my-brain principle:
> the spec is 2500 lines long, and supports six different kinds of
> references and five different kinds of lists (even lists with roman
> numerals!).  It also violates the one-way-to-do-it principle:
> for example, there are a huge variety of ways to do headings,
> and two different syntaxes for drawing a table.

How many times have we heard this?  "All we need are paragraphs and bullet
lists."  That line of argument has been going on for at least six years, and
has hampered progress all along.

IMHO, variety in markup is good and necessary.  Artificially limiting the
markup makes for limited usefulness.

OTOH, I have no problem with mandating standard uses, like a standard set of
section title adornments.

> I am not against structured text processing systems in general.
> It's just that RST is much too big (for me).

Somehow I think that "size of the spec" is a specious argument.  Take the
Python spec, for example: many times the size of the reStructuredText spec
and yet it arguably fits in many different sized brains.  I know they're
different things and I'm not implying they're the same; the markup is a much
smaller thing.  reStructuredText is very practical in its scope.  Constructs
are there because they're useful and *used*.  If we removed all the items
you list, we'd end up with a crippled markup of little use to anyone.

> I think that something of this flavour would be a great solution
> for PEPs and docstrings, and that David has done an impressive
> job on RST.

Thank you.

Anyhow, off to work.  I'll follow up on further posts this evening.

David Goodger  <>  Open-source projects:
  - Python Docutils:
    (includes reStructuredText:
  - The Go Tools Project:

From  Thu Aug  1 14:14:04 2002
From: (Andrew Koenig)
Date: Thu, 1 Aug 2002 09:14:04 -0400 (EDT)
Subject: [Python-Dev] split('') revisited
In-Reply-To: <> (message from
 Tim Peters on Thu, 01 Aug 2002 02:39:24 -0400)
References: <>
Message-ID: <>

>> Section 4.2.4 of the library reference says that the 'split' method of a
>> regular expression object is defined as
>> Identical to the split() function, using the compiled pattern.

Tim> Supplying words intended to be clear from context, it's saying that the
Tim> split method of a regexp object is identical to the re.split() function,
Tim> which is true.  In much the same way, list.pop() isn't the same thing as
Tim> eyeball.pop() <wink>.

Right.  I missed the fact that there's another split.  Sorry about that.

>> My first impulse was to argue that (4) is right, and that the behavior
>> should be as follows
>> >>> 'abcde'.split('')
>> ['a', 'b', 'c', 'd', 'e']

Tim> If that's what you want, list('abcde') is a direct way to get it.

True, but that doesn't explain why it is useful to have
'abcde'.split('') and re.split('', 'abcde') behave differently.

>> I made the counterargument that one could disambiguate by adding the
>> rule that no element of the result could be equal to the delimiter.
>> Therefore, if s is a string, s.split('') cannot contain any empty
>> strings.

Tim> Sure, that's one arbitrary rule <wink>.  It doesn't seem to extend to
Tim> regexps in a reasonable way, though:

>>>> re.split('.*', 'abcde')
Tim> ['', '']

Tim> Both split pieces there match the pattern.

Yes, that's part of the source fo my confusion.

>> However, looking at the behavior of regular expression splitting more
>> closely, I become more confused.  Can someone explain the following
>> behavior to me?

>> >>> re.compile('a|(x?)').split('abracadabra')
>> ['', None, 'br', None, 'c', None, 'd', None, 'br', None, '']

>> From the docs:

Tim>     If capturing parentheses are used in pattern, then the text of all
Tim>     groups in the pattern are also returned as part of the resulting list.

OK -- as I said, I had assumed that split() was referring to the other
split function, probably because both of them were offscreen at the time.

Tim> It should also say that splits never occur at points where the only match is
Tim> against an empty string (indeed, that's exactly why re.split('', 'abcde')
Tim> doesn't split anywhere).  The logic is like:

Tim>     while True:
Tim>         find next non-empty match, else break
Tim>         emit the slice between this and the end of the last match
Tim>         emit all capturing groups
Tim>         advance position by length of match
Tim>     emit the slice from the end of the last match to the end of the string

Tim> It's the last line in the loop body that makes empty matches a wart if
Tim> allowed:  they wouldn't advance the position at all, and an infinite loop
Tim> would result.  In order to make them do what you think you want, we'd have
Tim> to add, at the end of the loop body

Tim>         ah, and if the match was emtpy, advance the position again, by,
Tim>         oh, i don't know, how about 1?  That's close to 0 <wink>.

Indeed, that's an arbitrary rule -- just about as arbitrary as the one
that you abbreviated above, which should really be

	    find the next match, but if the match is empty, disregard it;
	    instead, find the next match with a length of at least,
	    oh, I don't know, how about 1?  That's close to 0 <wink>.

What I'm trying to do is come up with a useful example to convince myself
that one is better than the other.

From  Thu Aug  1 14:22:55 2002
From: (Mark Hammond)
Date: Thu, 1 Aug 2002 23:22:55 +1000
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: <00b101c2393a$e4a01ce0$e000a8c0@thomasnotebook>
Message-ID: <>

> The only thing I consider worth changing is to rename the
> whole stuff from 'fixed buffer interface' to 'locked buffer
> interface', which makes more sense at the current state.


Sorry if I missed this before, but:
>     If the object never resizes or reallocates the buffer
>     during it's lifetime, this function may be NULL. Failure to call
>     this function (if it is != NULL) is a programming error and may
>     have unexpected results.

Not sure I like this.  I would prefer to put the burden of "you must provide
a (possibly empty) release function" on the few buffer interface
implementers than the many (ie, potentially any extension author) buffer
interface consumers.

I believe there is a good chance of extension authors testing against, and
therefore assuming, non-NULL implementations of this function.  OTOH, if
every fixed buffer consumer assumes a non-NULL implementation, people
implementing this interface will quickly see their error well before it gets
into the wild.

No biggie, but worth considering...


From  Thu Aug  1 14:28:59 2002
From: (Mark Hammond)
Date: Thu, 1 Aug 2002 23:28:59 +1000
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: <>
Message-ID: <>


> At the moment 'line' is called by the SET_LINENO opcode.  My patch
> changes it to be called when the co_lnotab indicates execution has
> moved onto a different line.
> The reason this changes behaviour is that currently a SET_LINENO
> opcode is emitted for the def line of every function (I guess this is
> to cope with
> def functions_like_this(): return 1

Right - sorry - my misunderstanding.

> If my patch goes in, I'll probably change pdb to catch 'call' events,
> and nag authors of other debuggers that they should do the same.

Yes, I agree this should not be necessary.  You may even find debugger
implementers already hack around this :)  And yes, I agree that if debugger
implementers really want to hook something on function entry, they should
use the facility explicity designed for that purpose ;)

> It is possible to generate an extra 'line' trace event to mimic the
> old behaviour, but it's gross.


> Now I've spent some minutes explaining myself, you can explain to me
> where you got the idea that I was even considering doing so from!

Sorry, I just didn't re-read the thread well enough.  Jumping to conclusions
seems to be one of my strong points ;)


From  Thu Aug  1 14:50:11 2002
From: (M.-A. Lemburg)
Date: Thu, 01 Aug 2002 15:50:11 +0200
Subject: [Python-Dev] Enabling Python cross-compilation
Message-ID: <>

Someone just posted this link to the German Python mailing

The page contains instruction to cross compile Python for
the ARM processor and includes a patch which enables cross
compiling Python in a very generic way.

Wouldn't it make sense to add this kind of support to the
standard dist ?

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Thu Aug  1 15:19:37 2002
From: (Thomas Heller)
Date: Thu, 1 Aug 2002 16:19:37 +0200
Subject: [Python-Dev] PEP 298, final (?) version
References: <>
Message-ID: <024501c23966$77c63100$e000a8c0@thomasnotebook>

> >     If the object never resizes or reallocates the buffer
> >     during it's lifetime, this function may be NULL. Failure to call
> >     this function (if it is != NULL) is a programming error and may
> >     have unexpected results.
> Not sure I like this.  I would prefer to put the burden of "you must provide
> a (possibly empty) release function" on the few buffer interface
> implementers than the many (ie, potentially any extension author) buffer
> interface consumers.
> I believe there is a good chance of extension authors testing against, and
> therefore assuming, non-NULL implementations of this function.  OTOH, if
> every fixed buffer consumer assumes a non-NULL implementation, people
> implementing this interface will quickly see their error well before it gets
> into the wild.
> No biggie, but worth considering...

I thought nobody would call these functions directly, but only through
the PyObject_AcquireBuffer/PyObject_ReleaseBuffer functions, but
maybe you're right.

So probably it should be required that the release function must be
implemented if any of the aquire functions is implemented.
We could even implement the lockcount in every fixed buffer object
even if it does no actual locking, and issue a warning or raise
an exception in the destructor if it is not zero.
(Or can we somehow prevent clients from calling these functions
without going through the PyObject_ funcs?)


From  Thu Aug  1 15:38:40 2002
From: (Scott Gilbert)
Date: Thu, 1 Aug 2002 07:38:40 -0700 (PDT)
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: <00b101c2393a$e4a01ce0$e000a8c0@thomasnotebook>
Message-ID: <>

--- Thomas Heller <> wrote:
>                                                       void **);
>         typedef void (*releasefixedbufferproc)(PyObject *);
>     If calls to these functions succeed, eventually the buffer must be
>     released by a call to the releasefixedbufferproc, supplying the
>     original object as argument.  The releasefixedbufferproc cannot
>     fail.
>         void PyObject_ReleaseFixedBuffer(PyObject *obj);

Would it be useful to allow bf_releasefixedbuffer to return an int
indicating an exception?  For instance, it could raise an exception if the
extension errantly releases more times than it has acquired (a negative
lock count).  Just a thought.

>     Python strings, unicode strings, mmap objects, and array objects
>     would expose the fixed buffer interface.
>     mmap and array objects would actually enter a locked state while
>     the buffer is active, this is not needed for strings and unicode
>     objects.  Resizing locked array objects is not allowed and will
>     raise an exception. Whether closing a locked mmap object is an
>     error or will only be deferred until the lock count reaches zero
>     is an implementation detail.

The mmap object is a good candidate for this, but I'm a little worried
about adding it to array.  I'm not saying it shouldn't be done, but I can
imagine a surprized user who:

   - has an existing application using the array module
   - starts making use of a new extension that uses the fixed/locked
     buffer interface
   - gets an exception in code that never raised that exception before

With the "deferred closed" strategy for the mmap object, this can't be a
problem there.  Just something to think about (or ignore :-).


Do You Yahoo!?
Yahoo! Health - Feel better, live better

From  Thu Aug  1 15:22:35 2002
From: (Jack Jansen)
Date: Thu, 1 Aug 2002 16:22:35 +0200
Subject: [Python-Dev] Enabling Python cross-compilation
In-Reply-To: <>
Message-ID: <>

On donderdag, augustus 1, 2002, at 03:50 , M.-A. Lemburg wrote:

> Someone just posted this link to the German Python mailing
> list:
> The page contains instruction to cross compile Python for
> the ARM processor and includes a patch which enables cross
> compiling Python in a very generic way.

I like the idea, but I think it could be implemented slightly 
cleaner (without need for the make clean and all the environment 
variables). I was thinking something along the lines of having 
two build subdirectories (as is already supported currently), 
let's say build-host and build-crosscompile. Then you would 
first configure and build normally in build-host, and then in 
build-crosscompile do something like "CC=xxxx ETC ETC 
../configure --hostbuilddir=../build-host". hostbuilddir would 
be used for finding python and pgen, and would default to ".".
And I think all the funnies like EXEEXT would work correctly too.

(Please note that I'm not volunteering to write the code, 
crosscompiling is not on my current wishlist)
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Thu Aug  1 16:18:22 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 11:18:22 -0400
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Your message of "Thu, 01 Aug 2002 10:00:21 BST."
References: <>
Message-ID: <>

> After my patch there are no SET_LINENO opcodes, so execution is
> never on the def line[*], so no 'line' trace event is generated for
> the def line, so a debugger that only listens to the 'line' events and
> ignores the 'call' events will not stop on that line.

If the argument list contains embedded tuples, there's code executed
to unpack those before the first line of the function.  Example:

  >>> def f(a, (b, c), d):
  ...     print a, b, c, d
  >>> f(1, (2, 3), 4)
  1 2 3 4
  >>> f(1, 2, 3)
  Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    File "<stdin>", line 1, in f
  TypeError: unpack non-sequence

I hope the debugger will stop *before* this unpacking happens!  It
does now:

  >>> import pdb
  >>>"f(1, 2, 3)")
  > <string>(0)?()
  (Pdb) s
  > <string>(1)?()
  > <stdin>(1)f()
  TypeError: 'unpack non-sequence'
  > <stdin>(1)f()
  (Pdb) q

--Guido van Rossum (home page:

From  Thu Aug  1 16:34:04 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 11:34:04 -0400
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: Your message of "Thu, 01 Aug 2002 23:22:55 +1000."
References: <>
Message-ID: <>

> > The only thing I consider worth changing is to rename the
> > whole stuff from 'fixed buffer interface' to 'locked buffer
> > interface', which makes more sense at the current state.
> Agreed.

Ditto.  Ready for implementation now.

> Sorry if I missed this before, but:
> >     If the object never resizes or reallocates the buffer
> >     during it's lifetime, this function may be NULL. Failure to call
> >     this function (if it is != NULL) is a programming error and may
> >     have unexpected results.
> Not sure I like this.  I would prefer to put the burden of "you must provide
> a (possibly empty) release function" on the few buffer interface
> implementers than the many (ie, potentially any extension author) buffer
> interface consumers.
> I believe there is a good chance of extension authors testing against, and
> therefore assuming, non-NULL implementations of this function.  OTOH, if
> every fixed buffer consumer assumes a non-NULL implementation, people
> implementing this interface will quickly see their error well before it gets
> into the wild.
> No biggie, but worth considering...

Hm, *users* of the interface would always go through this API:

        int PyObject_AquireFixedReadBuffer(PyObject *obj,
                                           const void **buffer,
                                           size_t *buffer_len);

        int PyObject_AcquireFixedWriteBuffer(PyObject *obj,
                                             void **buffer,
                                             size_t *buffer_len);

        void PyObject_ReleaseFixedBuffer(PyObject *obj);

But I'm still very concerned that if most built-in types
(e.g. strings, bytes) don't implement the release functionality, it's
too easy for an extension to seem to work while forgetting to release
the buffer.  I recommend that at least some built-in types implement
the acquire/release functionality with a counter, and assert that the
counter is zero when the object is deleted -- if the assert fails,
someone DECREF'ed their reference to the object without releasing it.
(The rule should be that you must own a reference to the object while
you've aquired the object.)

For strings that might be impractical because the string object would
have to grow 4 bytes to hold the counter; but the new bytes object
(PEP 296) could easily implement the counter, and the array object too
-- that way there will be plenty of opportunity to test proper use of
the protocol.

--Guido van Rossum (home page:

From  Thu Aug  1 16:22:06 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 11:22:06 -0400
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to process PEPs
In-Reply-To: Your message of "Thu, 01 Aug 2002 04:22:19 PDT."
References: <Pine.LNX.4.44.0208010343380.7588-100000@ziggy>
Message-ID: <>

> On Thu, 1 Aug 2002, Eric S. Raymond wrote:
> > Ka-Ping Yee <>:
> > > I am not against structured text processing systems in general.
> > > I think that something of this flavour would be a great solution
> > > for PEPs and docstrings, and that David has done an impressive
> > > job on RST.  It's just that RST is much too big (for me).
> >
> > And if we're going to pay the transition costs to move to a
> > heavyweight markup, it ought to be DocBook, same direction GNOME and
> > KDE and the Linux kernel and FreeBSD and PHP are going.
> I would be very unhappy about having to enter and edit inline
> documentation in an XML-based markup language.

Agreed 110%.  Perhaps Eric thought we were talking about the core
Python docs?  David was only talking about PEPs right now.

> RST is not what i would call heavyweight *markup*.  It's just a
> heavy specification.  There are too many cases to know.  If you
> simplified RST in the following ways, we might have something
> i would consider reasonably-sized:
>     - Choose one way to do headings.
>     - Choose one way to do numbered and non-numbered lists.
>     - Choose one way to do tables.
>     - Drop bibliographic fields.
>     - Drop RCS keyword processing.
>     - Get rid of option lists (we already have definition lists).
>     - Drop some fancy reference features (e.g. auto-numbered and
>         auto-symbol footnotes, indirect references, substitutions).
>     - Drop inline hyperlink references (we already have inline URLs).
>     - Drop inline internal targets (we already have explicit targets).
>     - Drop interpreted text (we already have inline literals).
>     - Drop citations (we already have footnotes).
>     - (Or, in summary -- instead of ten kinds of inline markup, we
>       only need four: emphasis, literals, footnotes, and URLs.)
>     - Simplify inline markup rules (way too many characters to know).
>         Instead of 100 lines describing markup rules, two lines are
>         sufficient: emphasis starts from " *" and stops at "*", literals
>         go from " `" to "`", and footnotes go from " [" to "[".

Perhaps this could be a preferred subset?

--Guido van Rossum (home page:

From  Thu Aug  1 16:47:25 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 11:47:25 -0400
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: Your message of "Thu, 01 Aug 2002 07:38:40 PDT."
References: <>
Message-ID: <>

> >         void PyObject_ReleaseFixedBuffer(PyObject *obj);
> > 
> Would it be useful to allow bf_releasefixedbuffer to return an int
> indicating an exception?  For instance, it could raise an exception if the
> extension errantly releases more times than it has acquired (a negative
> lock count).  Just a thought.

OTOH, it means that the caller would have to check for errors.  It may
make more sense to make this a fatal error, since it's purely the
(or at least *a*) caller's fault.

> >     Python strings, unicode strings, mmap objects, and array objects
> >     would expose the fixed buffer interface.
> > 
> >     mmap and array objects would actually enter a locked state while
> >     the buffer is active, this is not needed for strings and unicode
> >     objects.  Resizing locked array objects is not allowed and will
> >     raise an exception. Whether closing a locked mmap object is an
> >     error or will only be deferred until the lock count reaches zero
> >     is an implementation detail.
> The mmap object is a good candidate for this, but I'm a little worried
> about adding it to array.  I'm not saying it shouldn't be done, but I can
> imagine a surprized user who:
>    - has an existing application using the array module
>    - starts making use of a new extension that uses the fixed/locked
>      buffer interface
>    - gets an exception in code that never raised that exception before

Hm.  As long as it's not too hard to point out the cause (using the
new extension) I don't think this would be a problem.

--Guido van Rossum (home page:

From  Thu Aug  1 16:36:59 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 11:36:59 -0400
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: Your message of "Thu, 01 Aug 2002 16:19:37 +0200."
References: <>
Message-ID: <>

> (Or can we somehow prevent clients from calling these functions
> without going through the PyObject_ funcs?)

No, don't worry.  Once the PyObject_ API exists, nobody will bother
calling call the slot functions directly.

--Guido van Rossum (home page:

From  Thu Aug  1 16:56:49 2002
From: (Michael Hudson)
Date: 01 Aug 2002 16:56:49 +0100
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Guido van Rossum's message of "Thu, 01 Aug 2002 11:18:22 -0400"
References: <> <> <>
Message-ID: <>

Guido van Rossum <> writes:

> > After my patch there are no SET_LINENO opcodes, so execution is
> > never on the def line[*], so no 'line' trace event is generated for
> > the def line, so a debugger that only listens to the 'line' events and
> > ignores the 'call' events will not stop on that line.
> If the argument list contains embedded tuples, there's code executed
> to unpack those before the first line of the function.

Well, if there's code there, then the debugger stops.  I know it's
confusing to have intuitive behaviour in this area...

>  Example:
>   >>> def f(a, (b, c), d):
>   ...     print a, b, c, d
>   ... 
>   >>> f(1, (2, 3), 4)
>   1 2 3 4
>   >>> f(1, 2, 3)
>   Traceback (most recent call last):
>     File "<stdin>", line 1, in ?
>     File "<stdin>", line 1, in f
>   TypeError: unpack non-sequence
>   >>>
> I hope the debugger will stop *before* this unpacking happens!  It
> does now:
>   >>> import pdb
>   >>>"f(1, 2, 3)")
>   > <string>(0)?()
>   (Pdb) s
>   > <string>(1)?()
>   (Pdb) 
>   > <stdin>(1)f()
>   (Pdb) 
>   TypeError: 'unpack non-sequence'
>   > <stdin>(1)f()
>   (Pdb) q
>   >>> 

Still does:

$ cat
def f(a, (b, c), d):
    print a, b, c, d
$ ./python 
Python 2.3a0 (#14, Aug  1 2002, 16:48:20) 
[GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-108.7.2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdb, t
>>>"t.f(1, 2, 3)")
> <string>(1)?()
(Pdb) s
> /home/mwh/src/sf/python/dist/src/build/
-> def f(a, (b, c), d):
TypeError: 'unpack non-sequence'
> /home/mwh/src/sf/python/dist/src/build/
-> def f(a, (b, c), d):
(Pdb) q

Anyway, I think I'm done now (as you maybe able to tell from the pile
of patch notification emails than just landed in your inbox :).

These issues from my original mail in this thread still haven't be

4) The patch installs a descriptor for f_lineno so that there is no
   incompatibility for Python code.  The question is what to do with
   the f_lineno field in the C struct?  Remove it?  That would
   (probably) mean bumping PY_API_VERSION.  Leave it in?  Then its
   contents would usually be meaningless (keeping it up to date would
   rather defeat the point of this patch).

I think leaving f_lineno there but useless is the way to go.  If we
actually make incompatible changes for other reasons, then it can

8) I haven't measured the performance impact of the changes to code
   that is tracing or code that isn't.  There's a possible
   optimization mentioned in the patch for traced code.  For not
   traced code it MAY be worthwhile putting the tracing support code
   in a static function somewhere so there's less code to jump over in
   the main loop (for i-caches and such).

Still haven't done this.

9) This patch stops LLTRACE telling you when execution moves onto a
   different line.  This could be restored, but

   a) I expect I'm the only persion to have used LLTRACE recently
      (debugging this patch).
   b) This will cause obfuscation, so I'd prefer to do it last.

No change here either.


  The gripping hand is really that there are morons everywhere, it's
  just that the Americon morons are funnier than average.
                              -- Pim van Riezen, alt.sysadmin.recovery

From  Thu Aug  1 16:35:47 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 11:35:47 -0400
Subject: [Python-Dev] Enabling Python cross-compilation
In-Reply-To: Your message of "Thu, 01 Aug 2002 15:50:11 +0200."
References: <>
Message-ID: <>

> Someone just posted this link to the German Python mailing
> list:
> The page contains instruction to cross compile Python for
> the ARM processor and includes a patch which enables cross
> compiling Python in a very generic way.
> Wouldn't it make sense to add this kind of support to the
> standard dist ?


--Guido van Rossum (home page:

From  Thu Aug  1 17:14:25 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 12:14:25 -0400
Subject: [Python-Dev] Speed of
In-Reply-To: Your message of "Wed, 31 Jul 2002 19:23:08 PDT."
References: <>
Message-ID: <>

[Tim, in python-checkins]
> Bizarre:  this takes 11x longer to run if and only if test_longexp is
> run before it, on my box.  The bigger REPS is in test_longexp, the
> slower this gets.  What happens on your box?  It's not gc on my box
> (which is good, because gc isn't a plausible candidate here).
> The slowdown is massive in the parts of test_sort that implicitly
> invoke a new-style class's __lt__ or __cmp__ methods.  If I boost
> REPS large enough in test_longexp, even the test_sort tests on an array
> of size 64 visibly c-r-a-w-l.  The relative slowdown is even worse in
> a debug build.  And if I reduce REPS in test_longexp, the slowdown in
> test_sort goes away.
> test_longexp does do horrid things to Win98's management of user
> address space, but I thought I had made that a whole lot better a month
> or so ago (by overallocating aggressively in the parser).

It's about the same on my Linux box (system time is CPU time spent in
the kernel):

test_longexp alone takes 1.92 user + 0.22 system seconds.
test_sort alone takes 1.71 user + 0.01 system seconds.
test_sort + test_longexp takes 3.62 user + 0.18 system seconds.
test_longexp + test_sort takes 38.05 user and 0.34 system seconds!!!

I'll see if I can get this to run under a profiler.

--Guido van Rossum (home page:

From  Thu Aug  1 17:54:47 2002
From: (Martin v. Loewis)
Date: 01 Aug 2002 18:54:47 +0200
Subject: [Python-Dev] Enabling Python cross-compilation
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen <> writes:

> I like the idea, but I think it could be implemented slightly cleaner
> (without need for the make clean and all the environment variables). I
> was thinking something along the lines of having two build
> subdirectories (as is already supported currently), let's say
> build-host and build-crosscompile. 

I think requiring the host compilation is wrong in the first
place. Instead, when cross-compiling, Python should require that the
host python already exists - whether from a previous configure;make;
make install, or because a host Python had been there all along (it
doesn't even have to be the same Python version).

Likewise, building the host pgen in a cross-compilation should not be
necessary, since the pgen output is shipped with the source release.

configure already supports cross-compilation, so setting CC should not
be necessary (since it will automatically find arm-linux-gcc if you
have a GNU cross-compilation environment).

I don't volunteer to write patches, either, but I do volunteer to
review patches.


From  Thu Aug  1 17:19:46 2002
From: (Zack Weinberg)
Date: Thu, 1 Aug 2002 09:19:46 -0700
Subject: [Python-Dev] Weird error handling in os._execvpe
Message-ID: <>

While testing my rewrite I ran into this mess in

def _execvpe(file, args, env=None):
    # ...
    if not _notfound:
        if sys.platform[:4] == 'beos':
            #  Process handling (fork, wait) under BeOS (up to 5.0)
            #  doesn't interoperate reliably with the thread interlocking
            #  that happens during an import.  The actual error we need
            #  is the same on BeOS for et al., ENOENT.
            try: unlink('/_#.# ## #.#')
            except error, _notfound: pass
            import tempfile
            t = tempfile.mktemp()
            # Exec a file that is guaranteed not to exist
            try: execv(t, ('blah',))
            except error, _notfound: pass
    exc, arg = error, _notfound
    for dir in PATH:
        fullname = path.join(dir, file)
            apply(func, (fullname,) + argrest)
        except error, (errno, msg):
            if errno != arg[0]:
                exc, arg = error, (errno, msg)
    raise exc, arg

This appears to be an overcomplicated, unreliable way of writing

import errno

def _execvpe(file, args, env=None):
    # ...
    for dir in PATH:
	fullname = path.join(dir, file)
	    apply(func, (fullname,) + argrest)
	except error, (err, msg):
	    if err != errno.ENOENT: # and err != errno.ENOTDIR, maybe
    raise error, (err, msg)

Can anyone explain why it is done this way?


From  Thu Aug  1 17:26:09 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 12:26:09 -0400
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Your message of "Thu, 01 Aug 2002 16:56:49 BST."
References: <> <> <>
Message-ID: <>

> Well, if there's code there, then the debugger stops.  I know it's
> confusing to have intuitive behaviour in this area...


> Anyway, I think I'm done now (as you maybe able to tell from the pile
> of patch notification emails than just landed in your inbox :).
> These issues from my original mail in this thread still haven't be
> addressed:
> 4) The patch installs a descriptor for f_lineno so that there is no
>    incompatibility for Python code.  The question is what to do with
>    the f_lineno field in the C struct?  Remove it?  That would
>    (probably) mean bumping PY_API_VERSION.  Leave it in?  Then its
>    contents would usually be meaningless (keeping it up to date would
>    rather defeat the point of this patch).
> I think leaving f_lineno there but useless is the way to go.  If we
> actually make incompatible changes for other reasons, then it can
> disappear.


> 8) I haven't measured the performance impact of the changes to code
>    that is tracing or code that isn't.  There's a possible
>    optimization mentioned in the patch for traced code.  For not
>    traced code it MAY be worthwhile putting the tracing support code
>    in a static function somewhere so there's less code to jump over in
>    the main loop (for i-caches and such).
> Still haven't done this.

I don't care if it slows down tracing, but I'd like it not to slow
down regular operation.  Of course, since SET_LINENO is gone, it
should speed things up dramatically; but how does it do compared to
previous -O mode?  (I guess the only difference that -O makes now is
that asserts aren't compiled. :-)

> 9) This patch stops LLTRACE telling you when execution moves onto a
>    different line.  This could be restored, but
>    a) I expect I'm the only persion to have used LLTRACE recently
>       (debugging this patch).
>    b) This will cause obfuscation, so I'd prefer to do it last.
> No change here either.

I'm not too attached to LLTRACE.  As long as it's usable for debugging
massive changes to the VM implementation I'm okay with it.

--Guido van Rossum (home page:

From  Thu Aug  1 18:23:03 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 13:23:03 -0400
Subject: [Python-Dev] Weird error handling in os._execvpe
In-Reply-To: Your message of "Thu, 01 Aug 2002 09:19:46 PDT."
References: <>
Message-ID: <>

> While testing my rewrite I ran into this mess in
> def _execvpe(file, args, env=None):
>     # ...
>     if not _notfound:
>         if sys.platform[:4] == 'beos':
>             #  Process handling (fork, wait) under BeOS (up to 5.0)
>             #  doesn't interoperate reliably with the thread interlocking
>             #  that happens during an import.  The actual error we need
>             #  is the same on BeOS for et al., ENOENT.
>             try: unlink('/_#.# ## #.#')
>             except error, _notfound: pass
>         else:
>             import tempfile
>             t = tempfile.mktemp()
>             # Exec a file that is guaranteed not to exist
>             try: execv(t, ('blah',))
>             except error, _notfound: pass
>     exc, arg = error, _notfound
>     for dir in PATH:
>         fullname = path.join(dir, file)
>         try:
>             apply(func, (fullname,) + argrest)
>         except error, (errno, msg):
>             if errno != arg[0]:
>                 exc, arg = error, (errno, msg)
>     raise exc, arg
> This appears to be an overcomplicated, unreliable way of writing
> import errno
> def _execvpe(file, args, env=None):
>     # ...
>     for dir in PATH:
> 	fullname = path.join(dir, file)
> 	try:
> 	    apply(func, (fullname,) + argrest)
> 	except error, (err, msg):
> 	    if err != errno.ENOENT: # and err != errno.ENOTDIR, maybe
> 	       raise
>     raise error, (err, msg)
> Can anyone explain why it is done this way?

Because not all systems report the same error for this error condition
(attempting to execute a file that doesn't exist).

--Guido van Rossum (home page:

From  Thu Aug  1 18:11:32 2002
From: (Michael Hudson)
Date: Thu, 1 Aug 2002 18:11:32 +0100 (BST)
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: <>
Message-ID: <>

On Thu, 1 Aug 2002, Guido van Rossum wrote:

> > I think leaving f_lineno there but useless is the way to go.  If we
> > actually make incompatible changes for other reasons, then it can
> > disappear.
> Agreed.


> > 8) I haven't measured the performance impact of the changes to code
> >    that is tracing or code that isn't.  There's a possible
> >    optimization mentioned in the patch for traced code.  For not
> >    traced code it MAY be worthwhile putting the tracing support code
> >    in a static function somewhere so there's less code to jump over in
> >    the main loop (for i-caches and such).
> > 
> > Still haven't done this.
> I don't care if it slows down tracing, but I'd like it not to slow
> down regular operation.  Of course, since SET_LINENO is gone, it
> should speed things up dramatically; but how does it do compared to
> previous -O mode?  

Currently compiling up two interpreters for pybench testing...

Here goes.  Everything is relative to 221-base, which is 2.2.1 from Sean's 
RPM.  This is the slowest, so all percentages are negative, and more 
negative is better.  I hope the names are obvious.

221-base             +0.00% (obviously)
221-O-base:          -9.69%
CVS-base:           -15.43%
CVS-O-base:         -23.56%
CVS-hacked:         -23.66%
CVS-O-hacked:       -23.70%

(Nearly 25% speed up since 221?  Boggle.  Some of this may be compilation 
options, I guess)

Anyway, it seems I haven't slowed -O down.  At some point I might try 
moving the trace code out of line and see if that has any effect.  Not 

If you want to look at where the improvements are in more detail, I've put 
the pybench files here:

> (I guess the only difference that -O makes now is that asserts aren't
> compiled. :-)

I think so, yes.

> > 9) This patch stops LLTRACE telling you when execution moves onto a
> >    different line.  This could be restored, but
> > 
> >    a) I expect I'm the only persion to have used LLTRACE recently
> >       (debugging this patch).
> >    b) This will cause obfuscation, so I'd prefer to do it last.
> > 
> > No change here either.
> I'm not too attached to LLTRACE.  As long as it's usable for debugging
> massive changes to the VM implementation I'm okay with it.

Good.  I don't suppose you'd actually LLTRACE something without dis output 
in front of you anyway, so this isn't much of a loss. Something I just 
remembered: I turned off LLTRACE for trace functions.  I guess this isn't 
really worth caring about either.


From  Thu Aug  1 19:16:04 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 14:16:04 -0400
Subject: [Python-Dev] Re: Speed of
In-Reply-To: Your message of "Thu, 01 Aug 2002 12:14:25 EDT."
Message-ID: <>

> [Tim, in python-checkins]
> > Bizarre:  this takes 11x longer to run if and only if test_longexp is
> > run before it, on my box.  The bigger REPS is in test_longexp, the
> > slower this gets.  What happens on your box?  It's not gc on my box
> > (which is good, because gc isn't a plausible candidate here).
> > 
> > The slowdown is massive in the parts of test_sort that implicitly
> > invoke a new-style class's __lt__ or __cmp__ methods.  If I boost
> > REPS large enough in test_longexp, even the test_sort tests on an array
> > of size 64 visibly c-r-a-w-l.  The relative slowdown is even worse in
> > a debug build.  And if I reduce REPS in test_longexp, the slowdown in
> > test_sort goes away.
> > 
> > test_longexp does do horrid things to Win98's management of user
> > address space, but I thought I had made that a whole lot better a month
> > or so ago (by overallocating aggressively in the parser).
> It's about the same on my Linux box (system time is CPU time spent in
> the kernel):
> test_longexp alone takes 1.92 user + 0.22 system seconds.
> test_sort alone takes 1.71 user + 0.01 system seconds.
> test_sort + test_longexp takes 3.62 user + 0.18 system seconds.
> test_longexp + test_sort takes 38.05 user and 0.34 system seconds!!!
> I'll see if I can get this to run under a profiler.

The profiler shows that in the latter run, 86% of the time (39 seconds
-- the profiler slows things down :-) was spent in PyFrame_New, for
188923 calls.  I note that the longexp-only profile has only 593 calls
to that function, and the sort-only profile has 183075 calls to it,
but look only 0.39 seconds for those altogether!

The numbers don't quite add up to 188923 (it misses 5255), but it's
close enough, and the rest is probably because regrtest does extra
stuff when it runs two tests, or there's some randomness in the sort

So why would 180,000 calls to PyFrame_New take 38 seconds in one case
and 0.39 seconds in the other?  I checked the call tree in the
profiler output, and only very few of the calls to PyFrame_New call
something else (mostly PyType_IsSubtype and PyDict_GetItem), and
besides the two cases have an almost identical call profile.

Suggestion: doesn't test_longexp create some frames with a very large
number of local variables?  Then PyFrame_New could spend a lot of time
in this loop:

	while (--extras >= 0)
		f->f_localsplus[extras] = NULL;

There's a free list of frames, and PyFrame_New picks the first frame
on the free list.  It grows the space for locals if necessary, but it
never shrinks it.

Back to Tim -- does this make sense?  Should we attempt to fix it?

--Guido van Rossum (home page:

From  Thu Aug  1 19:19:11 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 14:19:11 -0400
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Your message of "Thu, 01 Aug 2002 18:11:32 BST."
References: <>
Message-ID: <>

> Here goes.  Everything is relative to 221-base, which is 2.2.1 from Sean's 
> RPM.  This is the slowest, so all percentages are negative, and more 
> negative is better.  I hope the names are obvious.
> 221-base             +0.00% (obviously)
> 221-O-base:          -9.69%
> CVS-base:           -15.43%
> CVS-O-base:         -23.56%
> CVS-hacked:         -23.66%
> CVS-O-hacked:       -23.70%
> (Nearly 25% speed up since 221?  Boggle.  Some of this may be compilation 
> options, I guess)

No, pymalloc sped us up quite a bit.

> Anyway, it seems I haven't slowed -O down.  At some point I might try 
> moving the trace code out of line and see if that has any effect.  Not 
> today.


> If you want to look at where the improvements are in more detail, I've put 
> the pybench files here:
> > (I guess the only difference that -O makes now is that asserts aren't
> > compiled. :-)
> I think so, yes.

Ah well.  So much -O. :-)

> > > 9) This patch stops LLTRACE telling you when execution moves onto a
> > >    different line.  This could be restored, but
> > > 
> > >    a) I expect I'm the only persion to have used LLTRACE recently
> > >       (debugging this patch).
> > >    b) This will cause obfuscation, so I'd prefer to do it last.
> > > 
> > > No change here either.
> > 
> > I'm not too attached to LLTRACE.  As long as it's usable for debugging
> > massive changes to the VM implementation I'm okay with it.
> Good.  I don't suppose you'd actually LLTRACE something without dis output 
> in front of you anyway, so this isn't much of a loss. Something I just 
> remembered: I turned off LLTRACE for trace functions.  I guess this isn't 
> really worth caring about either.


What's the next step?  I haven't had time to review your code.  Do you
want to check it in without further review, or do you want to wait
until someone can give it a serious look?  (Tim's on vacation this
week so it might be a while.)

--Guido van Rossum (home page:

From  Thu Aug  1 19:23:40 2002
From: (Zack Weinberg)
Date: Thu, 1 Aug 2002 11:23:40 -0700
Subject: [Python-Dev] Weird error handling in os._execvpe
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Thu, Aug 01, 2002 at 01:23:03PM -0400, Guido van Rossum wrote:
> > Can anyone explain why it is done this way?
> Because not all systems report the same error for this error condition
> (attempting to execute a file that doesn't exist).

That's unfortunate.  The existing code is buggy on at least three

First and most important, it's currently trivial to cause any program
that uses os.execvp[e] to invoke a program of the attacker's choice,
rather than the intended one, on any platform that supports symbolic
links and has predictable PIDs.  My tempfile rewrite will make this
much harder, but still not impossible.

Second, the BeOS code will silently delete the file '/_#.# ## #.#' if
it exists, which is unlikely, but not impossible.  A user who had
created such a file would certainly be surprised to discover it gone
after running an apparently-innocuous Python program.

Third, if an error other than the expected one comes back, the loop
clobbers the saved exception info and keeps going.  Consider the
situation where PATH=/bin:/usr/bin, /bin/foobar exists but is not
executable by the invoking user, and /usr/bin/foobar does not exist.
The exception thrown will be 'No such file or directory', not the
expected 'Permission denied'.

Also, I'm not certain what will happen if two threads go through the
if not _notfound: block at the same time, but it could be bad,
depending on how much implicit locking there is in the interpreter.

I see three possible fixes.  In order of personal preference:

1. Make os.execvp[e] just call the C library's execvp[e]; it has to
   get this stuff right anyway.  We are already counting on it for
   execv - I would be surprised to find a system that had execv and
   not execvp, as long as PATH was a meaningful concept (it isn't, for
   instance, on classic MacOS).

2. Enumerate all the platform-specific errno values for this failure
   mode, and check them all.  On Unix, ENOENT and arguably ENOTDIR.  I
   don't know about others.

3. If we must do the temporary file thing, create a temporary
   _directory_; we control the contents of that directory, so we can
   be sure that the file name we choose does not exist.  Cleanup is
   messier than the other two possibilities.


From  Thu Aug  1 19:11:42 2002
From: (Tim Peters)
Date: Thu, 01 Aug 2002 14:11:42 -0400
Subject: [Python-Dev] Speed of
In-Reply-To: <>
Message-ID: <>

[Guido, mixing test_longexp w/ the new test_sort]
> It's about the same on my Linux box (system time is CPU time spent in
> the kernel):

Dang!  I was more than half hoping it was a Windows glitch.

> test_longexp alone takes 1.92 user + 0.22 system seconds.
> test_sort alone takes 1.71 user + 0.01 system seconds.
> test_sort + test_longexp takes 3.62 user + 0.18 system seconds.
> test_longexp + test_sort takes 38.05 user and 0.34 system seconds!!!
> I'll see if I can get this to run under a profiler.

It's intriguing, but I have to do other things today.  Here's a
self-contained test case that allows to vary REPS from the command line:

import sys
from time import clock as now

def do_shuffle(x):
    import random

def do_longexp(REPS):
    l = eval("[" + "2," * REPS + "]")
    assert len(l) == REPS

def do_sort(x):
    x.sort(lambda x, y: cmp(x, y))
    # Doing x.sort(cmp) instead, there's no slowdown, so it's not just
    # that there's an explicit comparison function.

x = range(1000)

REPS = 65580
if len(sys.argv) > 1:
    REPS = int(sys.argv[1])

t1 = now()
t2 = now()
t3 = now()

print "At REPS=%d, longexp %.2g sort %.2g" % (REPS, t2-t1, t3-t2)

On my box, the time it takes for the sort appears, after a certain point, to
grow quadratically(!) in the REPS value (these timings were hasty and not on
a quiet box, so only gross conclusions are justified):

C:\Code\python\PCbuild>python 1
At REPS=1, longexp 0.00021 sort 0.027

At REPS=1, longexp 0.00018 sort 0.036
At REPS=10, longexp 0.00035 sort 0.053
At REPS=100, longexp 0.002 sort 0.028
At REPS=1000, longexp 0.039 sort 0.073
At REPS=10000, longexp 0.47 sort 0.45
At REPS=20000, longexp 0.89 sort 0.44
At REPS=40000, longexp 1.5 sort 1.3
At REPS=80000, longexp 2.5 sort 5.9
At REPS=160000, longexp 5 sort 22

From  Thu Aug  1 19:38:17 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 14:38:17 -0400
Subject: [Python-Dev] Re: Speed of
In-Reply-To: Your message of "Thu, 01 Aug 2002 14:16:04 EDT."
References: <>
Message-ID: <>

> Suggestion: doesn't test_longexp create some frames with a very large
> number of local variables?  Then PyFrame_New could spend a lot of time
> in this loop:
> 	while (--extras >= 0)
> 		f->f_localsplus[extras] = NULL;
> There's a free list of frames, and PyFrame_New picks the first frame
> on the free list.  It grows the space for locals if necessary, but it
> never shrinks it.

Jeremy made me think about it some more.  Deleting two lines from
PyFrame_New() made the timing behavior much more reasonable:

*** frameobject.c	20 Apr 2002 04:46:55 -0000	2.62
--- frameobject.c	1 Aug 2002 18:32:40 -0000
*** 265,272 ****
  			if (f == NULL)
  				return NULL;
- 		else
- 			extras = f->ob_size;
  		_Py_NewReference((PyObject *)f);
  	if (builtins == NULL) {
--- 265,270 ----

This means that the while loop only clears that part of the stack that
we plan to *use*, not all that's available.  I've run the whole test
suite in debug mode with this change and it showed no failures, so I'll
check this in now.

Should we fix this in 2.2.2 too?

--Guido van Rossum (home page:

From  Thu Aug  1 20:05:16 2002
From: (Tim Peters)
Date: Thu, 01 Aug 2002 15:05:16 -0400
Subject: [Python-Dev] Re: Speed of
In-Reply-To: <>
Message-ID: <>

[Guido, pins the blame on PyFrame_New -- cool!]
> ...
> Suggestion: doesn't test_longexp create some frames with a very large
> number of local variables?  Then PyFrame_New could spend a lot of time
> in this loop:
> 	while (--extras >= 0)
> 		f->f_localsplus[extras] = NULL;

In my poor man's profiling <wink>, I ran the self-contained test case posted
eariler under the debugger with REPS=120000, and since the "sort" part takes
20 seconds then, there was lots of opportunity to break at random times (the
MSVC debugger lets you do that, i.e. click a button that means "I don't care
where you are, break *now*").  It was always in that loop when it broke, and
extras always started life at 120000 before that loop.  Yikes!

> There's a free list of frames, and PyFrame_New picks the first frame
> on the free list.  It grows the space for locals if necessary, but it
> never shrinks it.
> Back to Tim -- does this make sense?  Should we attempt to fix it?

I can't make sufficient time to think about this, but I suspect a principled
fix is simply to delete these two lines:

			extras = f->ob_size;

The number of extras the code object actually needs was already computed
correctly earlier, via

	extras = code->co_stacksize + code->co_nlocals + ncells + nfrees;

and there's no point clearing any more than that original value.  IOW, I
don't think it hurts to have a big old frame left on the freelist, the pain
comes from clearing out more slots in it than the *current* code object

A quick test of this showed it cured the test_longexp + test_sort speed
problem, and the regression suite ran without problems.

If someone understands this code well enough to finish thinking about
whether that's a correct thing to do, please do!

From  Thu Aug  1 20:12:29 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 15:12:29 -0400
Subject: [Python-Dev] Re: Speed of
In-Reply-To: Your message of "Thu, 01 Aug 2002 15:05:16 EDT."
References: <>
Message-ID: <>

> If someone understands this code well enough to finish thinking about
> whether that's a correct thing to do, please do!

Please do a cvs update. :-)

Jeremy & I independently came up with the same solution, so I consider
this resolved.

--Guido van Rossum (home page:

From  Thu Aug  1 20:27:43 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 15:27:43 -0400
Subject: [Python-Dev] Weird error handling in os._execvpe
In-Reply-To: Your message of "Thu, 01 Aug 2002 11:23:40 PDT."
References: <> <>
Message-ID: <>

> > > Can anyone explain why it is done this way?
> > 
> > Because not all systems report the same error for this error condition
> > (attempting to execute a file that doesn't exist).
> That's unfortunate.  The existing code is buggy on at least three
> grounds:
> First and most important, it's currently trivial to cause any program
> that uses os.execvp[e] to invoke a program of the attacker's choice,
> rather than the intended one, on any platform that supports symbolic
> links and has predictable PIDs.  My tempfile rewrite will make this
> much harder, but still not impossible.

That's important.

> Second, the BeOS code will silently delete the file '/_#.# ## #.#' if
> it exists, which is unlikely, but not impossible.  A user who had
> created such a file would certainly be surprised to discover it gone
> after running an apparently-innocuous Python program.

I really don't care about that. :-)

> Third, if an error other than the expected one comes back, the loop
> clobbers the saved exception info and keeps going.  Consider the
> situation where PATH=/bin:/usr/bin, /bin/foobar exists but is not
> executable by the invoking user, and /usr/bin/foobar does not exist.
> The exception thrown will be 'No such file or directory', not the
> expected 'Permission denied'.

Hm, you're right.  The code (which I believe I wrote, except for the
BeOS bit) was attempting to get the opposite effect, but seems to be
broken. :-(

> Also, I'm not certain what will happen if two threads go through the
> if not _notfound: block at the same time, but it could be bad,
> depending on how much implicit locking there is in the interpreter.
> I see three possible fixes.  In order of personal preference:
> 1. Make os.execvp[e] just call the C library's execvp[e]; it has to
>    get this stuff right anyway.  We are already counting on it for
>    execv - I would be surprised to find a system that had execv and
>    not execvp, as long as PATH was a meaningful concept (it isn't, for
>    instance, on classic MacOS).

Probably agreed for execvpe().  All the non-env versions must call the
env version because not all platforms have putenv, and there changes
to os.environ don't get reflected in the process's environment.

> 2. Enumerate all the platform-specific errno values for this failure
>    mode, and check them all.  On Unix, ENOENT and arguably ENOTDIR.  I
>    don't know about others.
> 3. If we must do the temporary file thing, create a temporary
>    _directory_; we control the contents of that directory, so we can
>    be sure that the file name we choose does not exist.  Cleanup is
>    messier than the other two possibilities.

I like to agree with this, but I don't recall exactly why we ended up
in this situation in the first place.  It's possible that it's an
unnecessary sacrifice of a dead chicken, but it's also possible that
there are platforms where this addressed a real need.  I'd like to
think that it was because I didn't want to add more cruft to
posixmodule.c (I've long given up on that :-).

Can you post a patch to SF?  Then we can ask for volunteers to test it
on various platforms.

--Guido van Rossum (home page:

From  Thu Aug  1 23:06:41 2002
From: (Zack Weinberg)
Date: Thu, 1 Aug 2002 15:06:41 -0700
Subject: [Python-Dev] Weird error handling in os._execvpe
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

On Thu, Aug 01, 2002 at 03:27:43PM -0400, Guido van Rossum wrote:
> > 1. Make os.execvp[e] just call the C library's execvp[e]; it has to
> >    get this stuff right anyway.  We are already counting on it for
> >    execv - I would be surprised to find a system that had execv and
> >    not execvp, as long as PATH was a meaningful concept (it isn't, for
> >    instance, on classic MacOS).
> Probably agreed for execvpe().  All the non-env versions must call the
> env version because not all platforms have putenv, and there changes
> to os.environ don't get reflected in the process's environment.

execvp could be just

def execvp(file, args):
   return execvpe(file, args, environ)


> > 2. Enumerate all the platform-specific errno values for this failure
> >    mode, and check them all.  On Unix, ENOENT and arguably ENOTDIR.  I
> >    don't know about others.
> > 
> > 3. If we must do the temporary file thing, create a temporary
> >    _directory_; we control the contents of that directory, so we can
> >    be sure that the file name we choose does not exist.  Cleanup is
> >    messier than the other two possibilities.
> I like to agree with this, but I don't recall exactly why we ended up
> in this situation in the first place.  It's possible that it's an
> unnecessary sacrifice of a dead chicken, but it's also possible that
> there are platforms where this addressed a real need.  I'd like to
> think that it was because I didn't want to add more cruft to
> posixmodule.c (I've long given up on that :-).
> Can you post a patch to SF?  Then we can ask for volunteers to test it
> on various platforms.

I will write such a patch, however, I keep getting lost in the Python
source tree.  In addition to Modules/posixmodule.c, I would need to
update the nt, dos, os2, mac, ce, and riscos modules also, yes?  Where
are their sources kept?  I don't see an ntmodule.c, etc anywhere.


From  Thu Aug  1 23:43:41 2002
From: (Greg Ewing)
Date: Fri, 02 Aug 2002 10:43:41 +1200 (NZST)
Subject: [Python-Dev] pre-PEP: The Safe Buffer Interface
In-Reply-To: <002701c23935$e5831c20$e000a8c0@thomasnotebook>
Message-ID: <>

> Backward compatibility.
> If we change the array object to enter a locked state
> when getreadbuffer() is called, it would be surprising.

Yes, I understand now. I hadn't realised that list
include both the old and new routines.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug  1 23:51:16 2002
From: (Barry A. Warsaw)
Date: Thu, 1 Aug 2002 18:51:16 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process PEPs
References: <>
Message-ID: <>

>>>>> "DG" == David Goodger <> writes:

    DG> I hereby formally request permission to deploy Docutils for
    DG> PEPs on  Here's a deployment plan for your
    DG> consideration:

I'm sympathetic to your aims, but I have reservations.  As lightweight
as reST is, it's still too heavy for me.  Ka-Ping described some of my
feelings quite well so I won't repeat what he said.

I like that PEPs are 70-odd column plain text, with just a few style
guidelines to aid in the html generation tool, and to promote
consistency.  I think of PEPs as our RFCs and I'm dinosaurically
attached to the RFC format, which has served standards bodies well for
so long.  I like that the plain text sources are readable and
consistent, with virtually no rules that are hard to remember.  More
importantly for me, I find it easy to do editing passes on submitted
PEPs in order to ensure consistency.

The noisy markup in reST bothers me, although you've done a good job
in minimizing the impact compared to other markup languages.  Magical
double colons, trailing underscores, etc. are jarring to me.  I wonder
how tools like ispell will handle some of it (I haven't tried it on
your reST source versions).

I made this suggestion privately to David, but I'll repeat it here.
I'd be willing to accept that PEPs /may/ be written in reST as an
alternative to plaintext, but not require it.  I'd like for PEP
authors to explicitly choose one or the other, preferrably by file
extension (e.g. .txt for plain text .rst or .rest for reST).  I'd also
like for there to be two tools for generation derivative forms from
the original source.

I would leave alone.  That's the tool that generates .html
from .txt.  I'd write a different tool that took a .rst file and
generated both a .html file and a .txt file.  The generated .txt file
would have no markup and would conform to .txt PEP style as closely as
possible.  reST generated html would then have a link both to the
original reST source, and to the plain text form.

A little competition never hurt anyone. :)  So I'd open it up and let
PEP authors decide, and we can do a side-by-side comparison of which
format folks prefer to use.


From  Thu Aug  1 23:42:11 2002
From: (Mark Hammond)
Date: Fri, 2 Aug 2002 08:42:11 +1000
Subject: [Python-Dev] Weird error handling in os._execvpe
In-Reply-To: <>
Message-ID: <>

> update the nt, dos, os2, mac, ce, and riscos modules also, yes?  Where
> are their sources kept?  I don't see an ntmodule.c, etc anywhere.

The nt module is built from the posixmodule.c sources.  It is not called
posix to prevent flame wars ;)


From  Fri Aug  2 00:19:13 2002
From: (Greg Ewing)
Date: Fri, 02 Aug 2002 11:19:13 +1200 (NZST)
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: <>
Message-ID: <>

> >     Failure to call
> >     this function (if it is != NULL) is a programming error
> Not sure I like this.  I would prefer to put the burden of "you must provide
> a (possibly empty) release function" on the few buffer interface
> implementers than the many (ie, potentially any extension author) buffer
> interface consumers.

The test for whether the release routine is NULL or not (if one is
needed at all) surely belongs inside PyObject_ReleaseFixedBuffer. 
Clients should be required to always call this routine.

I say "if one is needed at all" because PyType_Ready could fill
it in with a default one if required.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug  2 00:36:50 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 19:36:50 -0400
Subject: [Python-Dev] Weird error handling in os._execvpe
In-Reply-To: Your message of "Thu, 01 Aug 2002 15:06:41 PDT."
References: <> <> <> <>
Message-ID: <>

> > > 1. Make os.execvp[e] just call the C library's execvp[e]; it has to
> > >    get this stuff right anyway.  We are already counting on it for
> > >    execv - I would be surprised to find a system that had execv and
> > >    not execvp, as long as PATH was a meaningful concept (it isn't, for
> > >    instance, on classic MacOS).
> > 
> > Probably agreed for execvpe().  All the non-env versions must call the
> > env version because not all platforms have putenv, and there changes
> > to os.environ don't get reflected in the process's environment.
> execvp could be just
> def execvp(file, args):
>    return execvpe(file, args, environ)
> yes?

It already is, sort of:

def execvp(file, args):
    """execp(file, args)

    Execute the executable file (which is searched for along $PATH)
    with argument list args, replacing the current process.
    args may be a list or tuple of strings. """
    _execvpe(file, args)

> > > 2. Enumerate all the platform-specific errno values for this failure
> > >    mode, and check them all.  On Unix, ENOENT and arguably ENOTDIR.  I
> > >    don't know about others.
> > > 
> > > 3. If we must do the temporary file thing, create a temporary
> > >    _directory_; we control the contents of that directory, so we can
> > >    be sure that the file name we choose does not exist.  Cleanup is
> > >    messier than the other two possibilities.
> > 
> > I like to agree with this, but I don't recall exactly why we ended up
> > in this situation in the first place.  It's possible that it's an
> > unnecessary sacrifice of a dead chicken, but it's also possible that
> > there are platforms where this addressed a real need.  I'd like to
> > think that it was because I didn't want to add more cruft to
> > posixmodule.c (I've long given up on that :-).
> > 
> > Can you post a patch to SF?  Then we can ask for volunteers to test it
> > on various platforms.
> I will write such a patch, however, I keep getting lost in the Python
> source tree.  In addition to Modules/posixmodule.c, I would need to
> update the nt, dos, os2, mac, ce, and riscos modules also, yes?  Where
> are their sources kept?  I don't see an ntmodule.c, etc anywhere.

The nt module is built from the posixmodule.c source file.  AFAIK the
others don't support the exec* family at all, so don't worry about
them; if something is needed the respective maintainers will have to
provide it.

--Guido van Rossum (home page:

From  Fri Aug  2 00:58:40 2002
From: (Ka-Ping Yee)
Date: Thu, 1 Aug 2002 16:58:40 -0700 (PDT)
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to process
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0208011605280.7588-100000@ziggy>

On Thu, 1 Aug 2002, David Goodger wrote:
> Ka-Ping Yee wrote:
> > It took a long time.  Perhaps it seems not so big to others, but
> > my personal opinion would be to recommend against this proposal
> > until the specification fits in, say, 1000 lines and can be absorbed
> > in ten minutes.
> The specification is, as its title says, a *specification*.  It's a detailed
> description of the markup, intended to guide the *developer* who is writing
> a parser or other tool.  It's not user documentation.

Okay, i understand that it's a spec and not a user manual.  I think
the fact that it takes that much text to describe all of the rules
does say something about its complexity, though.  Other people may
have different thresholds; it exceeds my threshold.

But again i want to stress that i think the structured-text approach
is good and i do not advocate abandoning the whole idea; i just want
a simpler set of rules.

> > For me, it violates the fits-in-my-brain principle:
> > the spec is 2500 lines long, and supports six different kinds of
> > references and five different kinds of lists (even lists with roman
> > numerals!).  It also violates the one-way-to-do-it principle:
> > for example, there are a huge variety of ways to do headings,
> > and two different syntaxes for drawing a table.
> How many times have we heard this?  "All we need are paragraphs and bullet
> lists."  That line of argument has been going on for at least six years, and
> has hampered progress all along.

Well, that depends what you mean by "progress"!  :)  There might be
something to that line of argument, if it has a habit of cropping up.

One can separate two issues here:

    1. too much functionality (YAGNI)
    2. too many ways of expressing the same functionality (TMTOWTDI)

As for the first, there's some room to argue here.  I happen to feel
there are quite a few YAGNI features in RST, like the Roman numerals
and the RCS keyword processing.  Auto-numbering in particular takes
RST in a direction that makes me uncomfortable -- it means that RST
now has the potential for a compile-debug cycle.

But as for the second, i just don't see any justification for it.
Reducing the multiple ways to do headers and lists and tables doesn't
cripple anything; it only makes RST simpler and easier to understand.

I acknowledge that there is some question of opinion as to what is the
"same" functionality, causing issues to slush over from #1 to #2.

To me, using "1.", "(1)", or "1)" to number a list makes no semantic
difference at all, and so it counts as redundancy.  If you already
have definition lists, why also have option lists and field lists?
If you already have literals, why have interpreted text?  If you
already have both footnotes *and* inline URLs, why also have anonymous
inline hyperlink references?

> OTOH, I have no problem with mandating standard uses, like a standard set of
> section title adornments.

If you're going to recommend certain ways, why not just decide
what to use and be done with it?  When designing a new standard,
there's no point starting out with parts of it already deprecated.

-- ?!ng

From  Fri Aug  2 01:17:44 2002
From: (Guido van Rossum)
Date: Thu, 01 Aug 2002 20:17:44 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process PEPs
In-Reply-To: Your message of "Thu, 01 Aug 2002 18:51:16 EDT."
References: <>
Message-ID: <>

> I made this suggestion privately to David, but I'll repeat it here.
> I'd be willing to accept that PEPs /may/ be written in reST as an
> alternative to plaintext, but not require it.  I'd like for PEP
> authors to explicitly choose one or the other, preferrably by file
> extension (e.g. .txt for plain text .rst or .rest for reST).  I'd also
> like for there to be two tools for generation derivative forms from
> the original source.

AFAICT that's all that David asked for.  It's the only thing that
makes sense; nobody's going to convert over 200 existing PEPs to reST.

> I would leave alone.  That's the tool that generates .html
> from .txt.  I'd write a different tool that took a .rst file and
> generated both a .html file and a .txt file.  The generated .txt file
> would have no markup and would conform to .txt PEP style as closely as
> possible.  reST generated html would then have a link both to the
> original reST source, and to the plain text form.

I don't see why reST needs to produce .txt output.  The reST source is
readable enough.

> A little competition never hurt anyone. :)  So I'd open it up and let
> PEP authors decide, and we can do a side-by-side comparison of which
> format folks prefer to use.

Exactly.  Let's do it.

--Guido van Rossum (home page:

From  Fri Aug  2 01:28:14 2002
From: (Delaney, Timothy)
Date: Fri, 2 Aug 2002 10:28:14 +1000
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to proces
 s PEPs
Message-ID: <>

> From: Ka-Ping Yee []
> One can separate two issues here:
>     1. too much functionality (YAGNI)
>     2. too many ways of expressing the same functionality (TMTOWTDI)

>From my reading of the reST docs at various times, I've come to the
conclusion that YAGNI doesn't apply - that each of the features exists
because someone *did* need it (i.e. they had a real use case for it).

I do feel that the explanations of some of the constructs are somewhat
confusing. I think all constructs should include at least one (and
preferably more) use cases in their explanations.

Personally, I'm in favour of having the complete reST specification, but
have well-defined conventions for usage of reST within specific
applications. So a "docstring convention" document would specify what the
structure of a docstring should (or must) include, how it is parsed, what
interpreted text means, etc. Fairly comprehensive examples should be
included. Unless you had  very specific need you shouldn't go outside of the
convention, but it should be available if you needed it for something which
couldn't be expressed otherwise.

If all you wanted to do was write docstrings, you would refer to the
docstring convention document.

If you wanted to write a PEP, you would refer to the PEP convention

Because they use the same underlying syntax, knowning how to do one will
help with learning how to do the other. One may normally use more (or
different) constructs than the other, but there will be a lot of crossover.

Tim Delaney

From  Fri Aug  2 01:36:52 2002
From: (Greg Ewing)
Date: Fri, 02 Aug 2002 12:36:52 +1200 (NZST)
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: <>
Message-ID: <>

> >         void PyObject_ReleaseFixedBuffer(PyObject *obj);
> > 
> Would it be useful to allow bf_releasefixedbuffer to return an int
> indicating an exception?  For instance, it could raise an exception if the
> extension errantly releases more times than it has acquired

The code making the call might not be in an easy position
to deal with an exception -- e.g. an asynchronous I/O
routine called from a signal handler, another thread,

Maybe use the warning mechanism to produce a message?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug  2 01:38:52 2002
From: (Greg Ewing)
Date: Fri, 02 Aug 2002 12:38:52 +1200 (NZST)
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to process PEPs
In-Reply-To: <Pine.LNX.4.44.0208011605280.7588-100000@ziggy>
Message-ID: <>

Ka-Ping Yee <>:

> To me, using "1.", "(1)", or "1)" to number a list makes no semantic
> difference at all, and so it counts as redundancy.

I imagine the variations are there so that RST documents
can be easily read in their own right. Having different
styles of headings, list numbers, etc. for different
levels aids readability.

Each of these features is no doubt useful for one
appplication or another, but we're talking about
a fairly restricted application here. Docstrings are
usually pretty short and not likely to require
multiple levels of headings, lists, etc.

So I'm in favour of choosing a subset to recommend,
perhaps mandate. Maybe a slightly larger subset
could be used for PEPs, since they're somewhat

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug  2 03:07:19 2002
From: (Aahz)
Date: Thu, 1 Aug 2002 22:07:19 -0400
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: <00b101c2393a$e4a01ce0$e000a8c0@thomasnotebook>
References: <00b101c2393a$e4a01ce0$e000a8c0@thomasnotebook>
Message-ID: <>

<whew!>  I finally read all these threads today, cleaning out much of my
OSCON backlog.  Now, maybe I'm stupid, but I'm not understanding the
relationship between the new buffer protocol (PEP 298) and the new bytes
object (PEP 296).  Should this be something documented in one or both
Aahz (           <*>

Project Vote Smart:

From  Fri Aug  2 03:21:07 2002
From: (David Goodger)
Date: Thu, 01 Aug 2002 22:21:07 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process
In-Reply-To: <>
Message-ID: <>

Barry A. Warsaw wrote:
> I like that PEPs are 70-odd column plain text, with just a few style
> guidelines to aid in the html generation tool, and to promote
> consistency.  I think of PEPs as our RFCs and I'm dinosaurically
> attached to the RFC format, which has served standards bodies well for
> so long.  I like that the plain text sources are readable and
> consistent, with virtually no rules that are hard to remember.  More
> importantly for me, I find it easy to do editing passes on submitted
> PEPs in order to ensure consistency.

Why are PEPs converted to HTML at all then?  (Semi-seriously :-)

RFCs pre-date the Web, HTML, GUIs, and PCs.  There is a great advantage in
sticking to a text-based format, but the existing structure is very limited.
RFCs are so 20th century; don't you think it's time to move on? ;-)
Dinosaurs have a tendency to become extinct you know.

Given a small amount of use, I think you'll find the rules easy to remember.
There should be little effect on editing.  At most, Emacs may need to be
taught to recognize a bit more punctuation.

> The noisy markup in reST bothers me, although you've done a good job
> in minimizing the impact compared to other markup languages.

It's a trade-off: functionality for markup intrusion.  It's the
functionality of the processed form that's important: inline live links;
live links to & from footnotes; automatic tables of contents (with live
links!); images (don't you just *cringe* when you see ASCII graphics?);
pleasant, readable text.  The markup is minimal, quickly and easily ignored.

> I made this suggestion privately to David, but I'll repeat it here.
> I'd be willing to accept that PEPs /may/ be written in reST as an
> alternative to plaintext, but not require it.

Sure. I thought I'd emphasized that in my original post: it'd be an
alternative, the two styles can coexist.  If you want to keep PEP 0 as it
is, that's fine.  I converted it to show that its special processing was
also supported.

> I'd like for PEP authors to explicitly choose one or the other, preferrably by
> file extension (e.g. .txt for plain text .rst or .rest for reST).

I'm not keen on a new file extension (this issue has come up before).
There's so much in place on many platforms that says .txt means text files,
and reStructuredText files *are* text files, with just a bit of formal
structure sprinkled over.  Browsers know what to do with .txt files; they
wouldn't know what to do with .rest or .rtxt files.  Near-universal file
naming conventions are not the place to innovate IMHO.

> I'd also like for there to be two tools for generation derivative forms from
> the original source.
> I would leave alone.  That's the tool that generates .html
> from .txt.

See (based on revision 1.37 of
Python's nondist/peps/  Other than abstracting the file I/O and
some minor changes for consistency & legibility, the
reStructuredText-specific part is just two functions.  One checks for the
format of the PEP, and the other calls Docutils to do the work.  Even
without a new file extension, there's no need for a separate tool.

> I'd write a different tool that took a .rst file and
> generated both a .html file and a .txt file.  The generated .txt file
> would have no markup and would conform to .txt PEP style as closely as
> possible.  reST generated html would then have a link both to the
> original reST source, and to the plain text form.

Do we need a slightly less-structured text output?  I don't think so, but I
offered two alternative strategies in PEP 287:

    a) Keep the existing PEP section structure constructs (one-line
       section headers, indented body text).  Subsections can either
       be forbidden, or supported with reStructuredText-style
       underlined headers in the indented body text.
    b) Replace the PEP section structure constructs with the
       reStructuredText syntax.  Section headers will require
       underlines, subsections will be supported out of the box, and
       body text need not be indented (except for block quotes).

Strategy (b) has been implemented; that's what the edited PEP 287 uses.  I'd
recommend against it, but if you insist on existing PEP structure, strategy
(a) fits better although inconsistently (depending on the decision on

> A little competition never hurt anyone. :)  So I'd open it up and let
> PEP authors decide, and we can do a side-by-side comparison of which
> format folks prefer to use.

Sure.  Once authors see what the new markup gives them, I'm sure there will
be some converts.

David Goodger  <>  Open-source projects:
  - Python Docutils:
    (includes reStructuredText:
  - The Go Tools Project:

From  Fri Aug  2 03:28:42 2002
From: (David Goodger)
Date: Thu, 01 Aug 2002 22:28:42 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process
In-Reply-To: <Pine.LNX.4.44.0208011605280.7588-100000@ziggy>
Message-ID: <>

>>> It took a long time.  Perhaps it seems not so big to others, but
>>> my personal opinion would be to recommend against this proposal
>>> until the specification fits in, say, 1000 lines and can be absorbed
>>> in ten minutes.

>> The specification is, as its title says, a *specification*.  It's a detailed
>> description of the markup, intended to guide the *developer* who is writing a
>> parser or other tool.  It's not user documentation.

> Okay, i understand that it's a spec and not a user manual.  I think
> the fact that it takes that much text to describe all of the rules
> does say something about its complexity, though.

I prefer the term "rich" over "complex". ;-)

Seriously, any significant technology requires a significant spec.  Did you
look at the primer and quick reference?

- Primer:
- Quick reference:

I wouldn't recommend the Language or Library Reference to a Python newbie
either; they're references!  I'd point them to the Tutorial.  The primer
above is reStructuredText's tutorial: short & sweet.

> But again i want to stress that i think the structured-text approach
> is good and i do not advocate abandoning the whole idea; i just want
> a simpler set of rules.

Speaking from experience having hashed out all these issues over the last
two years, "a simpler set of rules" won't work.  Sure, a few conveniences
could be trimmed from reStructuredText, and all we'd lose would be
convenience.  Go past that and the markup would become less useful.  Cut
everything you listed and the markup would be next to useless.

>>> For me, it violates the fits-in-my-brain principle:
>>> the spec is 2500 lines long, and supports six different kinds of
>>> references and five different kinds of lists (even lists with roman
>>> numerals!).  It also violates the one-way-to-do-it principle:
>>> for example, there are a huge variety of ways to do headings,
>>> and two different syntaxes for drawing a table.
>> How many times have we heard this?  "All we need are paragraphs and bullet
>> lists."  That line of argument has been going on for at least six years, and
>> has hampered progress all along.
> Well, that depends what you mean by "progress"!  :)  There might be
> something to that line of argument, if it has a habit of cropping up.

"Progress" in auto-documentation tools for Python.  "Progress" in a usable,
successful structured plaintext markup.  OTOH, there's been just as much
pressure from the other direction: "The markup needs a construct for XYZ."
reStructuredText is the result of working toward a practical, usable, and
readable balance.

> One can separate two issues here:
> 1. too much functionality (YAGNI)
> 2. too many ways of expressing the same functionality (TMTOWTDI)
> As for the first, there's some room to argue here.  I happen to feel
> there are quite a few YAGNI features in RST, like the Roman numerals
> and the RCS keyword processing.

What's the big deal about Roman numerals?  Human beings have many ways to
count; our markup should allow us the freedom to choose the style we like.
Ask a lawyer if Roman numerals for lists are expendible.  If *you* don't
like them, don't use them.

RCS keyword processing is *not* a syntax feature; it's for readability, so
readers don't have that $RCS: cruft$ shoved in their faces.

> Auto-numbering in particular takes RST in a direction that makes me
> uncomfortable -- it means that RST now has the potential for a
> compile-debug cycle.

That's true with *any* markup processing system.  It's the price of the
increased functionality and readability of the processed result.  A small
price, IMHO.  The reStructuredText parser is very helpful with diagnostics,
and can only improve with user feedback.

I've volunteered to do the processing, so there should be no impact on

> But as for the second, i just don't see any justification for it.
> Reducing the multiple ways to do headers and lists and tables doesn't
> cripple anything; it only makes RST simpler and easier to understand.

Headers: by "multiple ways", are you referring to the author's choice of
underline style?  Or to the choice for overline & underline versus
underline-only?  Perhaps, when reading the spec, it's overwhelming; so don't
start with the spec!  But I don't see the big deal in having variety.  The
true test is this: when you look at a reStructuredText title, in whatever
style, does it scream out at you, "I am a title!"?  Without knowing anything
about the markup, most people would answer "yes, it does".  The same is true
for lists and tables too.  I base this on reports from people who are using
Docutils/reStructuredText in the real world, introducing it to non-technical
users, and reporting nothing but positive experiences.

Lists: see below.

Tables: the "simple table" syntax was added recently, because although it's
limited, it's much simpler to type and edit than the original "grid tables".
But grid tables don't have the limitations, so it's practical to keep both
constructs around.

> I acknowledge that there is some question of opinion as to what is the
> "same" functionality, causing issues to slush over from #1 to #2.
> To me, using "1.", "(1)", or "1)" to number a list makes no semantic
> difference at all, and so it counts as redundancy.

The variety of list styles is based on real-world usage.  See "The Chicago
Manual of Style", 14th edition, section 8.79 (page 315): every variation of
list enumeration is right there in a single nested list.  Any reasonable
person looking at any of those list styles will understand what they mean.

Different strokes for different folks.  Variety is the spice of life, and a
necessity for otherwise dry documentation.

> If you already have definition lists, why also have option lists and field
> lists?

They're semantically different.  Sure you could implement option & field
lists with definition lists, just as you could implement definition lists
with tables.  Option lists are explicitly for command-line option
descriptions.  Field lists are for name-value pairs where the details
matter, like database records or attributes of extension constructs

> If you already have literals, why have interpreted text?

They're very different things.  Literals are for monospaced,
*uninterpreted*, unprocessed, computer I/O text.  From PEP 287:

  Text enclosed in single backquotes is recognized as "interpreted
  text", whose interpretation is application-dependent.  In the
  context of a Python docstring, the default interpretation of
  interpreted text is as Python identifiers.  The text will be marked
  up with a hyperlink connected to the documentation for the
  identifier given.

In PEPs, there is no use for interpreted text currently (so they wouldn't be
mentioned in the new-style-PEP guide, except perhaps in a footnote saying
so).  In the future auto-documentation tool, interpreted text will do
explicitly what pydoc does auto-magically: link Python identifiers to their
definitions elsewhere.  But because it's explicit, interpreted text will not
be accidentally misinterpreted (as can happen in pydoc).

> If you already have both footnotes *and* inline URLs, why also have anonymous
> inline hyperlink references?

Because inline live links are useful, but nobody wants to trip over a
three-line URL in the middle of a sentence.

>> OTOH, I have no problem with mandating standard uses, like a standard set of
>> section title adornments.
> If you're going to recommend certain ways, why not just decide
> what to use and be done with it?  When designing a new standard,
> there's no point starting out with parts of it already deprecated.

PEPs are just one application of Docutils/reStructuredText.  I see no
conflict here.  Groups often use a technology in conjunction with a
conventions guide limiting the local use of that technology, for the sake of
consistency or simplicity.  We have such guides for Python's C code and
stdlib code.  (Does the Python LaTeX documentation mandate a subset of
LaTeX?  I know it specifies *additional* macros to use.)

In the case of PEPs, I think a guide recommending certain practices would be
appropriate, rather than mandating that certain constructs *not* be used.
Constructs not used in PEPs are useful in other applications.  Nothing would
be deprecated, just "not used in PEPs".

David Goodger  <>  Open-source projects:
  - Python Docutils:
    (includes reStructuredText:
  - The Go Tools Project:

From  Fri Aug  2 03:28:57 2002
From: (David Goodger)
Date: Thu, 01 Aug 2002 22:28:57 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process
In-Reply-To: <000001c23925$37357a60$777ba8c0@ericlaptop>
Message-ID: <>

Eric and Timothy, thank you for putting quite clearly what I am sometimes
unable to express myself.

-- David

From  Fri Aug  2 04:04:41 2002
From: (Aahz)
Date: Thu, 1 Aug 2002 23:04:41 -0400
Subject: [Python-Dev] Sorting
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Mon, Jul 22, 2002, Tim Peters wrote:
> In an effort to save time on email (ya, right ...), I wrote up a pretty
> detailed overview of the "timsort" algorithm.  It's attached.

It seems pretty clear by now that the new mergesort is going to replace
samplesort, but since nobody else has said this, I figured I'd add one
more comment:

I actually understood your description of the new mergesort algorithm.
Unless you can come up with similar docs for samplesort, the Beer Truck
scenario dictates that mergesort be the new gold standard.

Or to quote Tim Peters, "Complex is better than complicated."
Aahz (           <*>

Project Vote Smart:

From  Fri Aug  2 06:24:33 2002
From: (Zack Weinberg)
Date: Thu, 1 Aug 2002 22:24:33 -0700
Subject: [Python-Dev] Weird error handling in os._execvpe
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

On Thu, Aug 01, 2002 at 03:27:43PM -0400, Guido van Rossum wrote:

> ... I don't recall exactly why we ended up in this situation in the
> first place.  It's possible that it's an unnecessary sacrifice of a
> dead chicken, but it's also possible that there are platforms where
> this addressed a real need.  I'd like to think that it was because I
> didn't want to add more cruft to posixmodule.c (I've long given up
> on that :-).

I found out why it's done the way it is: There is no execvpe() in C,
not even in the extended-to-hell-and-back GNU libc.  I considered
dinking around with the C-level environ pointer so that execvp() would
do what we want, but this seems unreliable at best, given how many
different ways to access the environment there are.

So I think we're back to option 2 (enumerate the possible errors for
each platform).  ENOENT and ENOTDIR should cover it for Unix.  Would
other platform maintainers care to comment, please?


From  Fri Aug  2 06:31:35 2002
From: (Scott Gilbert)
Date: Thu, 1 Aug 2002 22:31:35 -0700 (PDT)
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: <>
Message-ID: <>

--- Aahz <> wrote:
> <whew!>  I finally read all these threads today, cleaning out much of my
> OSCON backlog.  Now, maybe I'm stupid, but I'm not understanding the
> relationship between the new buffer protocol (PEP 298) and the new bytes
> object (PEP 296).  Should this be something documented in one or both
> PEPs?

In the course of examining PEP 296 (the one I'm working on), Thomas Heller
thought it would be a good idea to make some additions to PyBufferProcs and
abstract.h so that he, and others, could treat a wider class of objects
with one API.  I was only proposing the bytes object, where as he wanted to
be able to write code that works with bytes, string, mmap, array, and any
other buffer-like object uniformly (since they all make promises about the
lifetime of the pointer).

I liked his idea but was concerned that making additional changes to the
Python baseline might get received poorly.  In other words, I'm an
overconservative worrywort, and wanted to make sure I didn't sink PEP 296
with features of PEP 298.  As such, I encouraged him to submit a separate
PEP so that if the protocol part got sunk, the bytes object part could
remain.  He was probably sick of arguing with me at that point, so PEP 298
got created.

Guido apparently likes both PEPs, so it looks like both will get in if our
implementations are timely and don't suck.  If I could have channeled Guido
a week ago, there might be only one PEP.  However, with the way this played
out, it has the benefit (to me at least) that now Thomas Heller is on the
hook for part of the implementation.  :-)

As for documenting this, my next draft of PEP 296 (later tonight) will
refer to PEP 298 to indicate that the bytes objects will support the
"fixed/locked buffer protocol".


Do You Yahoo!?
Yahoo! Health - Feel better, live better

From  Fri Aug  2 06:54:12 2002
From: (Scott Gilbert)
Date: Thu, 1 Aug 2002 22:54:12 -0700 (PDT)
Subject: [Python-Dev] PEP 298, __buffer__
Message-ID: <>

Tonight, I remember another thought that I've had for a while.

There isn't currently a way for a class object created from Python script
to indicate that it wishes to implement the buffer interface.  In the
Numeric source, I've seen them use self.__buffer__ for this purpose, but
this isn't actually an officially sanctioned magic name.

Now that classes can derive from builtin types, perhaps there is less of a
need for this, but I still think we would want it.  There are times when
you want inheritance, and others when you want containment.  With a slight
modification to the PyObject_*Buffer functions (in the failure branches),
an instance of a class could use containment of a PyBufferProcs supporting
object and publish the buffer interface as its own.

I'm thinking one of:

    class OneWay(object):
        def __init__(self):
            self.__buffer__ = bytes(1000)


    class SomeOther(object):
        def __init__(self):
            self._private = bytes(1000)
        def __buffer__(self):
            return self._private
I believe the first one is the way it's done in Numeric (Numarray too?). 
(Maybe Todd Miller will comment on this and whether it's useful to him.)

If this is worthwhile, it could be added to PEP 298 or as a new mini PEP. 
In either case, I'm willing to do the work.


Do You Yahoo!?
Yahoo! Health - Feel better, live better

From  Fri Aug  2 07:45:39 2002
From: (Zack Weinberg)
Date: Thu, 1 Aug 2002 23:45:39 -0700
Subject: [Python-Dev] rewrite, take two
Message-ID: <>

Now at Sourceforge:


From  Fri Aug  2 08:10:07 2002
From: (Ville Vainio)
Date: Fri, 02 Aug 2002 10:10:07 +0300
Subject: [Python-Dev] Docutils/reStructuredText is ready to process PEPs
References: <>
Message-ID: <>

David G wrote:

>PEPs are just one application of Docutils/reStructuredText.  I see no
Exactly. I can't understand the motivation in crippling a useful markup 
in order to serve some niche application (however important). As it 
stands, restx appears to be general enough to be accepted as a kind of 
"standard" markup, to be used for authoring all documentation, python 
related or not - perhaps even motivating people to write an emacs mode 
for it.

-- Ville

From  Fri Aug  2 09:27:17 2002
From: (Michael Hudson)
Date: 02 Aug 2002 09:27:17 +0100
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Guido van Rossum's message of "Thu, 01 Aug 2002 14:19:11 -0400"
References: <> <>
Message-ID: <>

Guido van Rossum <> writes:

> > Here goes.  Everything is relative to 221-base, which is 2.2.1 from Sean's 
> > RPM.  This is the slowest, so all percentages are negative, and more 
> > negative is better.  I hope the names are obvious.
> > 
> > 221-base             +0.00% (obviously)
> > 221-O-base:          -9.69%
> > CVS-base:           -15.43%
> > CVS-O-base:         -23.56%
> > CVS-hacked:         -23.66%
> > CVS-O-hacked:       -23.70%
> > 
> > (Nearly 25% speed up since 221?  Boggle.  Some of this may be compilation 
> > options, I guess)
> No, pymalloc sped us up quite a bit.

Yes, this occurred to me after I posted.

pystone is a mystery.  It's a fair bit slower but also much more
variable with my patch.  Moving trace code out of line helps quite a
bit but it's still ~1% slower.

> > Anyway, it seems I haven't slowed -O down.  At some point I might try 
> > moving the trace code out of line and see if that has any effect.  Not 
> > today.

Did do this yesterday, in fact.  As I said, it helped pystone a bit,
so I'll upload a separate patch to sf.

> What's the next step?  I haven't had time to review your code.  Do you
> want to check it in without further review, or do you want to wait
> until someone can give it a serious look?  (Tim's on vacation this
> week so it might be a while.)

I think I'd like to wait for serious review.  I'd be surprised if the
patch went out of date at all quickly.

Also, it seems Lib/compiler currently works by generating SET_LINENO
and then builds co_lnotab by scanning for them afterwards.  That's not
going to work in the new world, so I should probably think about how
to change it...


  Finding a needle in a haystack is a lot easier if you burn down
  the haystack and scan the ashes with a metal detector.
      -- the Silicon Valley Tarot (another one nicked from David Rush)

From  Fri Aug  2 09:45:06 2002
From: (Tim Peters)
Date: Fri, 02 Aug 2002 04:45:06 -0400
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: <>
Message-ID: <>

[Michael Hudson]
> ...
> Here goes.  Everything is relative to 221-base, which is 2.2.1
> from Sean's RPM.  This is the slowest, so all percentages are negative,
> and more negative is better.  I hope the names are obvious.
> 221-base             +0.00% (obviously)
> 221-O-base:          -9.69%
> CVS-base:           -15.43%
> CVS-O-base:         -23.56%
> CVS-hacked:         -23.66%
> CVS-O-hacked:       -23.70%


> (Nearly 25% speed up since 221?  Boggle.  Some of this may be compilation
> options, I guess)

No, it's the new sort implementation -- it's truly magical <wink>.

I've been telling people at Zope Corp that getting rid of SET_LINENO would
speed pystone (which is said to be a good predictor of Zope performance) by
at least 7%.  If you can fudge up a test showing that, your performance work
will be complete.

> ...
> What's the next step?  I haven't had time to review your code.  Do you
> want to check it in without further review, or do you want to wait
> until someone can give it a serious look?  (Tim's on vacation this
> week so it might be a while.)

I'm really not the best person for this, since, e.g., I never use the
debugger, so couldn't personally care less if it stopped working <0.9 wink>.

The patch set looks very complete, so I'd encourage a checkin if nobody

I have one objection, but it's kind of vague:  Michael, you're taking too
much delight in how obscure this is!  Two examples:

+ 	int instr_ub = -1, instr_lb = 0; /* for tracing */

It takes a lot of effort to reverse-engineer that the line number has
changed if and only if

    not instr_lb <= current_bytecode_offset < instr_ub

-- or at least to reverse-engineer that this is what you believe <wink>.
Paste the above in as a comment and save the next person the pain.  I got
hung up the first 5 minutes guessing that "lb" and "ub" referred to "lower
byte" and "upper byte".

The other example:

+ 			/* I (mwh) will gladly buy anyone a beer who
+ 			   can tell me off the top of their head why
+ 			   the exception for POP_TOP is needed... */

That's not going to be amusing two years from now when your unstated
reasoning is no longer true, and this code breaks.  Then someone will have
to guess what you thought you meant by this comment, whether your reasoning
was correct at the time, and what may have changed to invalidate it.  Rather
than tease, just explain why POP_TOP must be an exception.  If you don't
know why, I'll buy *you* a beer <wink>.

From  Fri Aug  2 10:29:57 2002
From: (Michael Hudson)
Date: 02 Aug 2002 10:29:57 +0100
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Tim Peters's message of "Fri, 02 Aug 2002 04:45:06 -0400"
References: <>
Message-ID: <>

Tim Peters <> writes:

> I've been telling people at Zope Corp that getting rid of SET_LINENO would
> speed pystone (which is said to be a good predictor of Zope performance) by
> at least 7%.  If you can fudge up a test showing that, your performance work
> will be complete.

It's about 5%:

$ ../../build/python 
Pystone(1.1) time for 100000 passes = 8.11
This machine benchmarks at 12330.5 pystones/second
$ ../../build/python 
Pystone(1.1) time for 100000 passes = 7.69
This machine benchmarks at 13003.9 pystones/second

I can run the vanilla pystone whilst compiling or something if you
like :)

As I said, my patched Python is much more variable in pystone than
before.  I'm going to try invoking the Cache Effect Demon on this one,
unless someone can come up with a real explanation.

> [Guido]
> > ...
> > What's the next step?  I haven't had time to review your code.  Do you
> > want to check it in without further review, or do you want to wait
> > until someone can give it a serious look?  (Tim's on vacation this
> > week so it might be a while.)
> I'm really not the best person for this, since, e.g., I never use the
> debugger, so couldn't personally care less if it stopped working <0.9 wink>.
> The patch set looks very complete, so I'd encourage a checkin if nobody
> objects.
> I have one objection, but it's kind of vague:  Michael, you're taking too
> much delight in how obscure this is!

It's the old boys club effect: I worked damn hard to get to the point
of understanding this stuff, so everyone else should bloody well have
to too!

>  Two examples:
> + 	int instr_ub = -1, instr_lb = 0; /* for tracing */
> It takes a lot of effort to reverse-engineer that the line number has
> changed if and only if
>     not instr_lb <= current_bytecode_offset < instr_ub
> -- or at least to reverse-engineer that this is what you believe <wink>.
> Paste the above in as a comment and save the next person the pain.  I got
> hung up the first 5 minutes guessing that "lb" and "ub" referred to "lower
> byte" and "upper byte".

Ah, OK.  Actually, taking the tracing code out of line makes me feel
less uneasy about adding hundred+ line comments explaining what's going

> The other example:
> + 			/* I (mwh) will gladly buy anyone a beer who
> + 			   can tell me off the top of their head why
> + 			   the exception for POP_TOP is needed... */
> That's not going to be amusing two years from now when your unstated
> reasoning is no longer true, and this code breaks.  Then someone will have
> to guess what you thought you meant by this comment, whether your reasoning
> was correct at the time, and what may have changed to invalidate it.  Rather
> than tease, just explain why POP_TOP must be an exception.  If you don't
> know why, I'll buy *you* a beer <wink>.

All I can say is that I'd been driven insane by co_lnotab and
Python/compile.c when I wrote that comment <wink>.


  I'm okay with intellegent buildings, I'm okay with non-sentient
  buildings. I have serious reservations about stupid buildings.
     -- Dan Sheppard, (from Owen Dunn's summary of the year)

From  Fri Aug  2 11:14:04 2002
From: (Ka-Ping Yee)
Date: Fri, 2 Aug 2002 03:14:04 -0700 (PDT)
Subject: [Python-Dev] Docutils/reStructuredText is ready to process PEPs
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0208020305000.7588-100000@ziggy>

On Fri, 2 Aug 2002, Ville Vainio wrote:
> Exactly. I can't understand the motivation in crippling a useful markup
> in order to serve some niche application (however important).

I just don't see how it is "crippling" to have one simple way to do
something instead of lots of different ways to do the same thing.
If anything, having more choices is worse, because it's not clear
which differences are meaningful or meaningless, and you may have to
think harder about which one to choose.

-- ?!ng

From  Fri Aug  2 11:41:05 2002
From: (Todd Miller)
Date: Fri, 02 Aug 2002 06:41:05 -0400
Subject: [Python-Dev] PEP 298, __buffer__
References: <>
Message-ID: <>

Scott Gilbert wrote:

>Tonight, I remember another thought that I've had for a while.
>There isn't currently a way for a class object created from Python script
>to indicate that it wishes to implement the buffer interface.  In the
>Numeric source, I've seen them use self.__buffer__ for this purpose, but
>this isn't actually an officially sanctioned magic name.
>I'm thinking one of:
>    class OneWay(object):
>        def __init__(self):
>            self.__buffer__ = bytes(1000)
>    class SomeOther(object):
>        def __init__(self):
>            self._private = bytes(1000)
>        def __buffer__(self):
>            return self._private
>I believe the first one is the way it's done in Numeric (Numarray too?). 
The numarray C-API essentially supports both usages, although we only 
use the __buffer__ name in the second case.  

>(Maybe Todd Miller will comment on this and whether it's useful to him.)
Yes, it is useful for prototyping.   Numarray calls a  __buffer__() 
method to support python class wrappers around mmap.   We use our class 
wrappers around mmap to add the ability to chop a file up into 
non-overlapping resizeable slices.  Each slice can be used as the buffer 
of an independent memory mapped array.


From  Fri Aug  2 11:34:55 2002
From: (Michael Hudson)
Date: 02 Aug 2002 11:34:55 +0100
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Guido van Rossum's message of "Thu, 01 Aug 2002 14:19:11 -0400"
References: <> <>
Message-ID: <>

Guido van Rossum <> writes:

> What's the next step?  I haven't had time to review your code.  Do you
> want to check it in without further review, or do you want to wait
> until someone can give it a serious look?  (Tim's on vacation this
> week so it might be a while.)

I've found another annoying problem.  I'm not really expecting someone
here to sovle it for me, but writing it down might help me think

This is about the function epilogues that always get generated.  I.e:

>>> def f():
...     if a:
...         print 1
>>> import dis
>>> dis.dis(f)
  2           0 LOAD_GLOBAL              0 (a)
              3 JUMP_IF_FALSE            9 (to 15)
              6 POP_TOP             

  3           7 LOAD_CONST               1 (1)
             10 PRINT_ITEM          
             11 PRINT_NEWLINE       
             12 JUMP_FORWARD             1 (to 16)
        >>   15 POP_TOP             
        >>   16 LOAD_CONST               0 (None)
             19 RETURN_VALUE        

You can see here that the epilogue gets associated with line 3,
whereas it shouldn't really be associated with any line at all.

For why this is a problem:

$ cat
a = 0
def f():
    if a:
        print 1

>>> pdb.runcall(t.f)
> /home/mwh/src/sf/python/dist/src/build/
-> if a:
(Pdb) s
> /home/mwh/src/sf/python/dist/src/build/
-> print 1
> /home/mwh/src/sf/python/dist/src/build/>None
-> print 1

The debugger stopping on the "print 1" is confusing.

There's an "obvious" solution to this: check it we're less than 4
bytes from the end of the code string and don't do anything if we are.
This would be easy, except that for some bonkers reason, we support
arbitrary buffer objects for code strings!  (see _PyCode_GETCODEPTR in
Include/compile.h -- though at least you can't create a code object
with an array code string from python, the getreadbuffer failing will
cause the interpreter to unceremoniously crash and burn).

I guess I can store the length somewhere -- _PyCode_GETCODEPTR returns
this, more by accident than design I suspect -- or call
bf_getsegcount(frame->f_code->co_code, &length) or something.

Does anyone actually *use* this feature?  I see Guido checked it in
and the patch was written by Greg Stein.  Anyone remember motivations
from the time?


  In general, I'd recommend injecting LSD directly into your temples,
  Syd-Barret-style, before mucking with Motif's resource framework.
  The former has far lower odds of leading directly to terminal
  insanity.                                            -- Dan Martinez

From  Fri Aug  2 11:42:42 2002
From: (Ka-Ping Yee)
Date: Fri, 2 Aug 2002 03:42:42 -0700 (PDT)
Subject: [Python-Dev] Docutils/reStructuredText is ready to process PEPs
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0208020201510.7588-100000@ziggy>

On Thu, 1 Aug 2002, David Goodger wrote:
> Speaking from experience having hashed out all these issues over the last
> two years, "a simpler set of rules" won't work.  Sure, a few conveniences
> could be trimmed from reStructuredText, and all we'd lose would be
> convenience.  Go past that and the markup would become less useful.  Cut
> everything you listed and the markup would be next to useless.

If you took reST and removed the features i listed, you would have a
markup system with paragraphs, multi-level headings, nestable bullet
lists and numbered lists, definition lists, literal blocks, block quotes,
tables, inline emphasis, inline literals, footnotes, inline hyperlinks,
and internal and external hyperlink targets.

Sounds pretty powerful to me.  I find it strange that you would call
this "next to useless".

> > Auto-numbering in particular takes RST in a direction that makes me
> > uncomfortable -- it means that RST now has the potential for a
> > compile-debug cycle.
> That's true with *any* markup processing system.  It's the price of the
> increased functionality and readability of the processed result.

I don't want to have to debug my text.

If the markup is simple enough, determining the output requires very
little context (a line or two); this means i can be sure what i'm going
to get.  Auto-numbering expands the context so any part of the entire
document can affect the transformation of another part of the document.

> A small price, IMHO.

The difference between writing a document once and *knowing* that it's
correct, and having to compile-test-debug a few times, is a big cost.

> I've volunteered to do the processing, so there should be no impact on
> anyone.

Huh?  I don't know what you mean here.  The design of reST impacts
everyone who has to read to write documents in it.

> > But as for the second, i just don't see any justification for it.
> > Reducing the multiple ways to do headers and lists and tables doesn't
> > cripple anything; it only makes RST simpler and easier to understand.
> Headers: by "multiple ways", are you referring to the author's choice of
> underline style?  Or to the choice for overline & underline versus
> underline-only?

Both.  Why have an assortment of 32 random punctuation characters,
for a total of 64 different ways to do a heading?  Who's going to
remember what the characters are, anyway?  Pick one or two and stick
to them.  There are really only two obvious ones: '-' and '='.

You can differentiate heading levels by indenting the heading, one
space per level.  It would be vastly easier to tell what level a
heading was at by looking at its position, rather than running all
the way back to the beginning of the document and counting the number
of different heading styles that appear.

> true test is this: when you look at a reStructuredText title, in whatever
> style, does it scream out at you, "I am a title!"?  Without knowing anything
> about the markup, most people would answer "yes, it does".

That's a backwards argument.  It's good that reST titles look
like titles.  But that doesn't mean reST has to recognize all
possible things that might look like titles, as titles.

It's a lot easier to say "just underline your title with a row
of hyphens" than "choose one of the following list of 32 random
punctuation marks to underline your title; and optionally overline it;
oh, but actually we think you should use only the following subset
of the 32 punctuation marks..."

> > If you already have definition lists, why also have option lists and field
> > lists?
> They're semantically different.  Sure you could implement option & field
> lists with definition lists, just as you could implement definition lists
> with tables.  Option lists are explicitly for command-line option
> descriptions.  Field lists are for name-value pairs where the details
> matter, like database records or attributes of extension constructs
> (directives).

All three are about associating a list of things with their
corresponding definitions.  Distinguishing whether the things
being defined are options or not is just as unnecessary as
distinguishing shopping lists, to-do lists, hit lists, etc.

-- ?!ng

From  Fri Aug  2 14:25:56 2002
From: (Aahz)
Date: Fri, 2 Aug 2002 09:25:56 -0400
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Thu, Aug 01, 2002, Scott Gilbert wrote:
> --- Aahz <> wrote:
>> <whew!>  I finally read all these threads today, cleaning out much of my
>> OSCON backlog.  Now, maybe I'm stupid, but I'm not understanding the
>> relationship between the new buffer protocol (PEP 298) and the new bytes
>> object (PEP 296).  Should this be something documented in one or both
>> PEPs?
> In the course of examining PEP 296 (the one I'm working on), Thomas Heller
> thought it would be a good idea to make some additions to PyBufferProcs and
> abstract.h so that he, and others, could treat a wider class of objects
> with one API.  I was only proposing the bytes object, where as he wanted to
> be able to write code that works with bytes, string, mmap, array, and any
> other buffer-like object uniformly (since they all make promises about the
> lifetime of the pointer).

Seems to me that part of my confusion lies in the fact that PEP 296 says
that the bytes object is suitable for implementing arrays, whereas the
discussion surrounding PEP 298 coughed up the issue that pure fixed
buffers without locking were insufficient for arrays.  
Aahz (           <*>

Project Vote Smart:

From  Fri Aug  2 15:18:02 2002
From: (Barry A. Warsaw)
Date: Fri, 2 Aug 2002 10:18:02 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process
References: <>
Message-ID: <>

>>>>> "DG" == David Goodger <> writes:

    DG> Why are PEPs converted to HTML at all then?  (Semi-seriously
    DG> :-)

To brand them with a Python banner and give them some hyperlinks. <wink>

    DG> RFCs pre-date the Web, HTML, GUIs, and PCs.  There is a great
    DG> advantage in sticking to a text-based format, but the existing
    DG> structure is very limited.  RFCs are so 20th century; don't
    DG> you think it's time to move on? ;-) Dinosaurs have a tendency
    DG> to become extinct you know.

They also become the oil that drives our engines of industry, to twist
an analogy.  :)

An of course RFCs are also converted to html:

    DG> Given a small amount of use, I think you'll find the rules
    DG> easy to remember.  There should be little effect on editing.
    DG> At most, Emacs may need to be taught to recognize a bit more
    DG> punctuation.

We'll see!

    >> The noisy markup in reST bothers me, although you've done a
    >> good job in minimizing the impact compared to other markup
    >> languages.

    DG> It's a trade-off: functionality for markup intrusion.  It's
    DG> the functionality of the processed form that's important:
    DG> inline live links; live links to & from footnotes; automatic
    DG> tables of contents (with live links!); images (don't you just
    DG> *cringe* when you see ASCII graphics?); pleasant, readable
    DG> text.  The markup is minimal, quickly and easily ignored.

Taken to the extreme, why do we even use a text based format at all?
We could, of course, get all that by authoring the PEPs directly in

    >> I made this suggestion privately to David, but I'll repeat it
    >> here.  I'd be willing to accept that PEPs /may/ be written in
    >> reST as an alternative to plaintext, but not require it.

    DG> Sure. I thought I'd emphasized that in my original post: it'd
    DG> be an alternative, the two styles can coexist.  If you want to
    DG> keep PEP 0 as it is, that's fine.  I converted it to show that
    DG> its special processing was also supported.


    >> I'd like for PEP authors to explicitly choose one or the other,
    >> preferrably by file extension (e.g. .txt for plain text .rst or
    >> .rest for reST).

    DG> I'm not keen on a new file extension (this issue has come up
    DG> before).  There's so much in place on many platforms that says
    DG> .txt means text files, and reStructuredText files *are* text
    DG> files, with just a bit of formal structure sprinkled over.
    DG> Browsers know what to do with .txt files; they wouldn't know
    DG> what to do with .rest or .rtxt files.  Near-universal file
    DG> naming conventions are not the place to innovate IMHO.

Don't most servers default to text/plain for types they don't know?
I'm pretty sure Apache does.

If a file extension isn't acceptable, then I'd still want the
determination of plaintext vs. reST to be explicit.  The other
alternative is to add a PEP header to specify.  I'd propose calling it
Content-Type: and use text/x-rest as the value.

    >> I'd also like for there to be two tools for generation
    >> derivative forms from the original source.  I would leave
    >> alone.  That's the tool that generates .html from
    >> .txt.

    DG> See (based on
    DG> revision 1.37 of Python's nondist/peps/  Other
    DG> than abstracting the file I/O and some minor changes for
    DG> consistency & legibility, the reStructuredText-specific part
    DG> is just two functions.  One checks for the format of the PEP,
    DG> and the other calls Docutils to do the work.  Even without a
    DG> new file extension, there's no need for a separate tool.

Fair enough.  Let's do this: send me a diff against v1.39 of  I just downloaded docutils-0.2, but I'm not sure of the
best way to integrate this in the nondist/peps directory.

- If we do the normal install, that's fine for my machine but
  it means that everyone who will be pushing out peps will have to do
  the same.

- If we hack to put ./docutils-0.2 on sys.path, then we
  can just check this stuff into the peps directory and it should Just
  Work.  We'd have to update it when new docutils releases are made.

Suggestions?  Mostly I'd like to hear from others who push out new PEP
versions.  Would you rather have to install a disutils package in the
normal way locally, or would you rather have everything you need in
the nondist/peps directory?

OTOH, if plaintext PEPs work without having access to the docutils
package, that would be fine too (another reason perhaps for an
explicit flag).

    >> I'd write a different tool that took a .rst file and generated
    >> both a .html file and a .txt file.  The generated .txt file
    >> would have no markup and would conform to .txt PEP style as
    >> closely as possible.  reST generated html would then have a
    >> link both to the original reST source, and to the plain text
    >> form.

    DG> Do we need a slightly less-structured text output?

Maybe not.  I'd prefer to have it, but if I'm alone there then I'll
give up that crusade (or at least call YAGNI for now).
    DG> I don't think so, but I offered two alternative strategies in
    DG> PEP 287:

    >> Keep the existing PEP section structure constructs (one-line
    >> section headers, indented body text).  Subsections can either be
    >> forbidden, or supported with reStructuredText-style underlined
    >> headers in the indented body text.

    >> Replace the PEP section structure constructs with the
    >> reStructuredText syntax.  Section headers will require underlines,
    >> subsections will be supported out of the box, and body text need
    >> not be indented (except for block quotes).

    DG> Strategy (b) has been implemented; that's what the edited PEP
    DG> 287 uses.  I'd recommend against it, but if you insist on
    DG> existing PEP structure, strategy (a) fits better although
    DG> inconsistently (depending on the decision on subsections).

a) might also mean you'd have to reflow paragraphs to fit in the
column width restrictions.  I'd prefer a) but it may be more
problematic.  Moot if YAGNI prevails.

    >> A little competition never hurt anyone. :) So I'd open it up
    >> and let PEP authors decide, and we can do a side-by-side
    >> comparison of which format folks prefer to use.

    DG> Sure.  Once authors see what the new markup gives them, I'm
    DG> sure there will be some converts.

Let's find out.

From  Fri Aug  2 15:19:53 2002
From: (Guido van Rossum)
Date: Fri, 02 Aug 2002 10:19:53 -0400
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: Your message of "Fri, 02 Aug 2002 12:36:52 +1200."
References: <>
Message-ID: <>

> > >         void PyObject_ReleaseFixedBuffer(PyObject *obj);
> > 
> > Would it be useful to allow bf_releasefixedbuffer to return an int
> > indicating an exception?  For instance, it could raise an exception if the
> > extension errantly releases more times than it has acquired
> The code making the call might not be in an easy position
> to deal with an exception -- e.g. an asynchronous I/O
> routine called from a signal handler, another thread,
> etc.
> Maybe use the warning mechanism to produce a message?

In an asynch I/O situation, calling PyErr_Warn() is out of the
question (it invokes Python code!).

I propose to make it a fatal error -- after all the only reason why
bf_releasefixedbuffer could fail should be that the caller makes a
mistake.  Since that's a bug in C code, a fatail error is acceptable.

--Guido van Rossum (home page:

From  Fri Aug  2 15:23:42 2002
From: (Thomas Heller)
Date: Fri, 2 Aug 2002 16:23:42 +0200
Subject: [Python-Dev] PEP 298, __buffer__
Message-ID: <011901c23a30$3449d3d0$e000a8c0@thomasnotebook>

[Unfortunately my email-server seems to be on vacation even earlier
than myself.
It seems I have not received some posts/replies: I'm currently reading
the archives. Hopefully this one gets through]
[Not the first time.]

Scott writes:
> There isn't currently a way for a class object created from Python script
> to indicate that it wishes to implement the buffer interface.  In the
> Numeric source, I've seen them use self.__buffer__ for this purpose, but
> this isn't actually an officially sanctioned magic name.

This is an idea I also had for quite some time (very vague, maybe).
I like it, but I haven't thought about it very carefully.


From  Fri Aug  2 15:31:17 2002
From: (Thomas Heller)
Date: Fri, 2 Aug 2002 16:31:17 +0200
Subject: [Python-Dev] Email problems, PEP 298
Message-ID: <012d01c23a31$435493a0$e000a8c0@thomasnotebook>

I'm currently having severe email-problems here: right
in place for my vacation :-(. I cannot participate in
the discussion anyway for two weeks, but hopefully
I will be able to read it afterwards.

Maybe I should have posted PEP 298 in it's current form
to python-dev.

Some of the suggestions mentioned here are already included.


From  Fri Aug  2 15:38:14 2002
From: (Guido van Rossum)
Date: Fri, 02 Aug 2002 10:38:14 -0400
Subject: [Python-Dev] Weird error handling in os._execvpe
In-Reply-To: Your message of "Thu, 01 Aug 2002 22:24:33 PDT."
References: <> <> <> <>
Message-ID: <>

> On Thu, Aug 01, 2002 at 03:27:43PM -0400, Guido van Rossum wrote:
> > ... I don't recall exactly why we ended up in this situation in the
> > first place.  It's possible that it's an unnecessary sacrifice of a
> > dead chicken, but it's also possible that there are platforms where
> > this addressed a real need.  I'd like to think that it was because I
> > didn't want to add more cruft to posixmodule.c (I've long given up
> > on that :-).
> I found out why it's done the way it is: There is no execvpe() in C,
> not even in the extended-to-hell-and-back GNU libc.  I considered
> dinking around with the C-level environ pointer so that execvp() would
> do what we want, but this seems unreliable at best, given how many
> different ways to access the environment there are.


> So I think we're back to option 2 (enumerate the possible errors for
> each platform).  ENOENT and ENOTDIR should cover it for Unix.  Would
> other platform maintainers care to comment, please?

Don't wait for them.  Just submit a patch and assign it to me. :-)

--Guido van Rossum (home page:

From  Fri Aug  2 15:48:30 2002
From: (Aahz)
Date: Fri, 2 Aug 2002 10:48:30 -0400
Subject: [Python-Dev] PEP 1, PEP Purpose and Guidelines
In-Reply-To: <>
References: <>
Message-ID: <>

All right, here are some suggested changes.  Actual suggestions are
indented; commentary and meta-information is not indented.

On Mon, Jul 29, 2002, Barry A. Warsaw wrote:
> Kinds of PEPs
>     There are two kinds of PEPs.  A standards track PEP describes a
>     new feature or implementation for Python.  An informational PEP
>     describes a Python design issue, or provides general guidelines or
>     information to the Python community, but does not propose a new
>     feature.  Informational PEPs do not necessarily represent a Python
>     community consensus or recommendation, so users and implementors
>     are free to ignore informational PEPs or follow their advice.


    Some informational PEPs become Meta-PEPs that describe the workflow
    of the Python project.  Project contributions that fail to follow the
    prescriptions of Meta-PEPs are likely to be rejected.

>     If the PEP editor approves, he will assign the PEP a number, label
>     it as standards track or informational, give it status 'draft',
>     and create and check-in the initial draft of the PEP.  The PEP
>     editor will not unreasonably deny a PEP.  Reasons for denying PEP
>     status include duplication of effort, being technically unsound,
>     not providing proper motivation or addressing backwards
>     compatibility, or not in keeping with the Python philosophy.  The
>     BDFL (Benevolent Dictator for Life, Guido van Rossum) can be
>     consulted during the approval phase, and is the final arbitrator
>     of the draft's PEP-ability.


    If the PEP editor approves, he will assign the pre-PEP a number,
    label it as standards track or informational, give it status 'draft',
    and create and check-in the initial draft of the PEP.  The PEP editor
    will not unreasonably deny a pre-PEP.  Reasons for denying PEP status
    include duplication of effort, being technically unsound, not
    providing proper motivation or addressing backwards compatibility, or
    not in keeping with the Python philosophy.  The BDFL (Benevolent
    Dictator for Life, Guido van Rossum) can be consulted during the
    approval phase, and is the final arbitrator of the draft's
    PEP-ability.  Generally speaking, if a pre-PEP meets technical
    standards, it will be accepted as a PEP to provide a historical
    record even if likely to be rejected (see the later section on
    rejecting a PEP).

(This is to clarify the distinction between denying a pre-PEP and
rejecting a PEP later in the process.)

>     A PEP can also be `Rejected'.  Perhaps after all is said and done
>     it was not a good idea.  It is still important to have a record of
>     this fact.

Add (not sure whether it should be a separate paragraph):

    The PEP author is responsible for recording summaries of all
    arguments in favor and opposition.  This is particularly important
    for rejected PEPs to reduce the likelihood of rehashing the same

>     6. Rationale -- The rationale fleshes out the specification by
>        describing what motivated the design and why particular design
>        decisions were made.  It should describe alternate designs that
>        were considered and related work, e.g. how the feature is
>        supported in other languages.
>        The rationale should provide evidence of consensus within the
>        community and discuss important objections or concerns raised
>        during discussion.

I'm thinking we should add a section 9) titled "Discussion summary" to
make it clearer that the PEP author is required to include this
Aahz (           <*>

Project Vote Smart:

From  Fri Aug  2 16:32:20 2002
From: (Gordon McMillan)
Date: Fri, 2 Aug 2002 11:32:20 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process	PEPs
In-Reply-To: <>
References: <Pine.LNX.4.44.0208011605280.7588-100000@ziggy>
Message-ID: <3D4A6DC4.26978.2F61817D@localhost>

On 1 Aug 2002 at 22:28, David Goodger wrote:

> Seriously, any significant technology requires a
> significant spec.  

The wheel?

> Speaking from experience having hashed out all these
> issues over the last two years, "a simpler set of
> rules" won't work.  

Ah, but you've been hashing it out with a group of
people who *care* about things like this. Welcome
to the larger world (where dinosaurs still roam).

-- Gordon

From  Fri Aug  2 16:34:33 2002
From: (Guido van Rossum)
Date: Fri, 02 Aug 2002 11:34:33 -0400
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Your message of "Fri, 02 Aug 2002 11:34:55 BST."
References: <> <>
Message-ID: <>

> I've found another annoying problem.  I'm not really expecting someone
> here to sovle it for me, but writing it down might help me think
> clearly.
> This is about the function epilogues that always get generated.  I.e:
> The debugger stopping on the "print 1" is confusing.
> There's an "obvious" solution to this: check it we're less than 4
> bytes from the end of the code string and don't do anything if we are.

Um, I think that's less than reliable.  I believe we just discussed
this when Oren's patch for yield in try/finally did a similar thing
(and weren't you the one who mentioned that your bytecodehacks can
cause this assumption to fail? :-).

I'm not actually sure that this needs fixing.  Surely the --Return--
should be a sufficient hint.  I note that without your patch it also
stops at a confusing place, albeit a different one (on the "if a:"

> This would be easy, except that for some bonkers reason, we support
> arbitrary buffer objects for code strings!  (see _PyCode_GETCODEPTR in
> Include/compile.h -- though at least you can't create a code object
> with an array code string from python, the getreadbuffer failing will
> cause the interpreter to unceremoniously crash and burn).

That went a little too fast.  Can you explain that parenthetical
remark more clearly?

> I guess I can store the length somewhere -- _PyCode_GETCODEPTR returns
> this, more by accident than design I suspect -- or call
> bf_getsegcount(frame->f_code->co_code, &length) or something.
> Does anyone actually *use* this feature?  I see Guido checked it in
> and the patch was written by Greg Stein.  Anyone remember motivations
> from the time?

Yes, Greg insisted that he might want to store Python bytecode in
Flash ROM, and that this way the bytecode would not have to be copied
to RAM.  But I don't think this ever happened (well, maybe the
now-dead Pippy port to PalmOS used it???).

I'd be happy to kill it as a YAGNI.

But that still doesn't mean I approve checking for "4 bytes from the end".

--Guido van Rossum (home page:

From  Fri Aug  2 16:35:55 2002
From: (Guido van Rossum)
Date: Fri, 02 Aug 2002 11:35:55 -0400
Subject: [Python-Dev] PEP 298, __buffer__
In-Reply-To: Your message of "Fri, 02 Aug 2002 06:41:05 EDT."
References: <>
Message-ID: <>

> Scott Gilbert wrote:
> >Tonight, I remember another thought that I've had for a while.
> >
> >There isn't currently a way for a class object created from Python script
> >to indicate that it wishes to implement the buffer interface.  In the
> >Numeric source, I've seen them use self.__buffer__ for this purpose, but
> >this isn't actually an officially sanctioned magic name.
> >
> >
> >I'm thinking one of:
> >
> >    class OneWay(object):
> >        def __init__(self):
> >            self.__buffer__ = bytes(1000)
> >
> >Or:
> >
> >    class SomeOther(object):
> >        def __init__(self):
> >            self._private = bytes(1000)
> >        def __buffer__(self):
> >            return self._private
> > 
> >I believe the first one is the way it's done in Numeric (Numarray too?). 

[Todd Miller]
> The numarray C-API essentially supports both usages, although we only 
> use the __buffer__ name in the second case.  
> >
> >(Maybe Todd Miller will comment on this and whether it's useful to him.)
> >
> Yes, it is useful for prototyping.   Numarray calls a  __buffer__() 
> method to support python class wrappers around mmap.   We use our class 
> wrappers around mmap to add the ability to chop a file up into 
> non-overlapping resizeable slices.  Each slice can be used as the buffer 
> of an independent memory mapped array.

This would be easy enough to add, I suppose, but (a) I don't think
it's got much to do with PEP 298, and (b) let's wait until we have a
real use case, so perhaps we can decide which form it should take.
Until then, I call YAGNI.

--Guido van Rossum (home page:

From  Fri Aug  2 16:43:06 2002
From: (Guido van Rossum)
Date: Fri, 02 Aug 2002 11:43:06 -0400
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Your message of "Fri, 02 Aug 2002 09:27:17 BST."
References: <> <>
Message-ID: <>

> pystone is a mystery.  It's a fair bit slower but also much more
> variable with my patch.  Moving trace code out of line helps quite a
> bit but it's still ~1% slower.

Hm.  For me (with your latest patch which moves the trace code out of
line) pystone is actually *less* variable with your patch than
without, and it's also faster with -O than before.

So I wouldn't lose any sleep over pystone (leave that to Tim :-).

Maybe you should increase LOOPS in; I usually set it to 40K
or even 100K.

> I think I'd like to wait for serious review.  I'd be surprised if the
> patch went out of date at all quickly.

Fair enough.

> Also, it seems Lib/compiler currently works by generating SET_LINENO
> and then builds co_lnotab by scanning for them afterwards.  That's not
> going to work in the new world, so I should probably think about how
> to change it...

Or wait for Jeremy.  (I suppose you still *support* the SET_LINENO

BTW, you should change the .pyc magic number in your patch.

--Guido van Rossum (home page:

From  Fri Aug  2 16:46:51 2002
From: (Michael Hudson)
Date: 02 Aug 2002 16:46:51 +0100
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Guido van Rossum's message of "Fri, 02 Aug 2002 11:43:06 -0400"
References: <> <> <> <>
Message-ID: <>

Guido van Rossum <> writes:

> > pystone is a mystery.  It's a fair bit slower but also much more
> > variable with my patch.  Moving trace code out of line helps quite a
> > bit but it's still ~1% slower.
> Hm.  For me (with your latest patch which moves the trace code out of
> line) pystone is actually *less* variable with your patch than
> without, and it's also faster with -O than before.
> So I wouldn't lose any sleep over pystone (leave that to Tim :-).

I wasn't going to.

> Maybe you should increase LOOPS in; I usually set it to 40K
> or even 100K.

Did that.  I thought measuring 0.8 secs was a bit on the thin side.

> > Also, it seems Lib/compiler currently works by generating SET_LINENO
> > and then builds co_lnotab by scanning for them afterwards.  That's not
> > going to work in the new world, so I should probably think about how
> > to change it...
> Or wait for Jeremy.


> (I suppose you still *support* the SET_LINENO opcode?)

No.  Do you think I should?

> BTW, you should change the .pyc magic number in your patch.

Really?  It's already changed since the last released Python.  Easy
enough to change again, though, and it makes testing easier.


  You have run into the classic Dmachine problem: your machine has
  become occupied by a malevolent spirit.  Replacing hardware or
  software will not fix this - you need an exorcist. 
                                       -- Tim Bradshaw, comp.lang.lisp

From  Fri Aug  2 16:47:14 2002
From: (Guido van Rossum)
Date: Fri, 02 Aug 2002 11:47:14 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process PEPs
In-Reply-To: Your message of "Fri, 02 Aug 2002 11:32:20 EDT."
References: <Pine.LNX.4.44.0208011605280.7588-100000@ziggy>
Message-ID: <>

> > Speaking from experience having hashed out all these
> > issues over the last two years, "a simpler set of
> > rules" won't work.  
> Ah, but you've been hashing it out with a group of
> people who *care* about things like this. Welcome
> to the larger world (where dinosaurs still roam).

Funny, Ping doesn't strike me as a dinosaur.  More as someone who
enjoys a good argument. :-)

--Guido van Rossum (home page:

From  Fri Aug  2 16:53:55 2002
From: (Michael Hudson)
Date: 02 Aug 2002 16:53:55 +0100
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Guido van Rossum's message of "Fri, 02 Aug 2002 11:34:33 -0400"
References: <> <> <> <>
Message-ID: <>

Guido van Rossum <> writes:

> > I've found another annoying problem.  I'm not really expecting someone
> > here to sovle it for me, but writing it down might help me think
> > clearly.
> > 
> > This is about the function epilogues that always get generated.  I.e:
> [...]
> > The debugger stopping on the "print 1" is confusing.
> > 
> > There's an "obvious" solution to this: check it we're less than 4
> > bytes from the end of the code string and don't do anything if we are.
> Um, I think that's less than reliable.  I believe we just discussed
> this when Oren's patch for yield in try/finally did a similar thing
> (and weren't you the one who mentioned that your bytecodehacks can
> cause this assumption to fail? :-).

Good point.

> I'm not actually sure that this needs fixing.  Surely the --Return--
> should be a sufficient hint.  I note that without your patch it also
> stops at a confusing place, albeit a different one (on the "if a:"
> line).

The problem is that when we jump into the epilogue, a 'line' trace
event gets generated before the 'return' one.  So there is no
--Return-- hint.

> > This would be easy, except that for some bonkers reason, we support
> > arbitrary buffer objects for code strings!  (see _PyCode_GETCODEPTR in
> > Include/compile.h -- though at least you can't create a code object
> > with an array code string from python, the getreadbuffer failing will
> > cause the interpreter to unceremoniously crash and burn).
> That went a little too fast.  Can you explain that parenthetical
> remark more clearly?

1) Don't you find the idea of type(co.co_code) == types.ArrayType at
   least a little scary?  Mainly due to resizes -- having mutable code
   might be nice for development environments and such.

2) I thought it was possible for bf_getreadbuffer to fail (maybe I'm
   wrong here).  _PyCode_GETCODEPTR does no error checking.

> > I guess I can store the length somewhere -- _PyCode_GETCODEPTR returns
> > this, more by accident than design I suspect -- or call
> > bf_getsegcount(frame->f_code->co_code, &length) or something.
> > 
> > Does anyone actually *use* this feature?  I see Guido checked it in
> > and the patch was written by Greg Stein.  Anyone remember motivations
> > from the time?
> Yes, Greg insisted that he might want to store Python bytecode in
> Flash ROM, and that this way the bytecode would not have to be copied
> to RAM.

I see.

> But I don't think this ever happened


> (well, maybe the now-dead Pippy port to PalmOS used it???).

Maybe.  Somehow doubt it, though.

> I'd be happy to kill it as a YAGNI.

That's nice, but if...

> But that still doesn't mean I approve checking for "4 bytes from the
> end". doesn't actually help.

Does anyone have any better ideas for not generating 'line' trace
events in the epilogue?


  I also feel it essential to note, [...], that Description Logics,
  non-Monotonic Logics, Default Logics and Circumscription Logics 
  can all collectively go suck a cow. Thank you.

From  Fri Aug  2 16:51:24 2002
From: (Guido van Rossum)
Date: Fri, 02 Aug 2002 11:51:24 -0400
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Your message of "Fri, 02 Aug 2002 16:46:51 BST."
References: <> <> <> <>
Message-ID: <>

> > (I suppose you still *support* the SET_LINENO opcode?)
> No.  Do you think I should?

I like to be on the conservative side.  Also, it would make life
easier for the compiler package until Jeremy has time to fix it. :-)

> > BTW, you should change the .pyc magic number in your patch.
> Really?  It's already changed since the last released Python.  Easy
> enough to change again, though, and it makes testing easier.

Given that each time I try your patches I get unknown opcode errors,
please change it.

--Guido van Rossum (home page:

From  Fri Aug  2 16:55:53 2002
From: (Scott Gilbert)
Date: Fri, 2 Aug 2002 08:55:53 -0700 (PDT)
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: <>
Message-ID: <>

--- Aahz <> wrote:
> Seems to me that part of my confusion lies in the fact that PEP 296 says
> that the bytes object is suitable for implementing arrays, whereas the
> discussion surrounding PEP 298 coughed up the issue that pure fixed
> buffers without locking were insufficient for arrays.  

Theoretically, you could use the bytes object and the struct module to
implement something that is functionally equivalent to arrays from
arraymodule.c (at least from the Python scripting point of view).  Lets
call that hypothetical reimplementation "".  However, since arrays
from arraymodule.c can be resized in place, the pointer is not necessarily
constant for the lifetime of the array object.


Do You Yahoo!?
Yahoo! Health - Feel better, live better

From  Fri Aug  2 17:07:44 2002
From: (Skip Montanaro)
Date: Fri, 2 Aug 2002 11:07:44 -0500
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: <>
References: <>
Message-ID: <15690.44624.228485.503769@localhost.localdomain>

    Michael> Does anyone have any better ideas for not generating 'line'
    Michael> trace events in the epilogue?

How about adding a field to the code object which holds the byte code offset
of the epilogue?  The code which emits line events (where is that, btw?)
would not emit if the current instruction offset is >= the epilogue offset.


From  Fri Aug  2 17:13:56 2002
From: (Guido van Rossum)
Date: Fri, 02 Aug 2002 12:13:56 -0400
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Your message of "Fri, 02 Aug 2002 16:53:55 BST."
References: <> <> <> <>
Message-ID: <>

> The problem is that when we jump into the epilogue, a 'line' trace
> event gets generated before the 'return' one.  So there is no
> --Return-- hint.

Ah, I missed that detail in your transcript.

> 1) Don't you find the idea of type(co.co_code) == types.ArrayType at
>    least a little scary?  Mainly due to resizes -- having mutable code
>    might be nice for development environments and such.

Yes, it's scary.  But nobody does this, and as you say, it can't be
done from Python.

> 2) I thought it was possible for bf_getreadbuffer to fail (maybe I'm
>    wrong here).  _PyCode_GETCODEPTR does no error checking.

So one should only use objects whose bf_getreadbuffer won't fail (when
invoked with segment index 0).

> > I'd be happy to kill it as a YAGNI.
> That's nice, but if...
> > But that still doesn't mean I approve checking for "4 bytes from the
> > end".
> doesn't actually help.

Well, it kills off a potentially unsafe feature.

> Does anyone have any better ideas for not generating 'line' trace
> events in the epilogue?

Use a separate opcode for which you could check?

--Guido van Rossum (home page:

From  Fri Aug  2 16:56:57 2002
From: (Michael Hudson)
Date: Fri, 2 Aug 2002 16:56:57 +0100 (BST)
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: <>
Message-ID: <>

On Fri, 2 Aug 2002, Guido van Rossum wrote:

> > > (I suppose you still *support* the SET_LINENO opcode?)
> > 
> > No.  Do you think I should?
> I like to be on the conservative side.  Also, it would make life
> easier for the compiler package until Jeremy has time to fix it. :-)

OK.  I guess it can go down the bottom of the eval loop now (if that makes 
any difference).

> > > BTW, you should change the .pyc magic number in your patch.
> > 
> > Really?  It's already changed since the last released Python.  Easy
> > enough to change again, though, and it makes testing easier.
> Given that each time I try your patches I get unknown opcode errors,
> please change it.

OK, but it's 5pm on a Friday here :)

Have a good weekend everyone.


From  Fri Aug  2 18:19:52 2002
From: (Skip Montanaro)
Date: Fri, 2 Aug 2002 12:19:52 -0500
Subject: [Python-Dev] dbm module, whichdb, test_whichdb
Message-ID: <15690.48952.51439.449198@localhost.localdomain>


I just checked in a modified dbmmodule.c, and a new regression
test file,  The change to dbmmodule.c accommodates linkage
with Berkeley DB.  The change to whichdb catches this case (opening "foo"
actually creates "foo.db").  The new file simply adds
regression tests for the whole mess.

Please have a look, try it out, and let me know if it gives your system
heartburn.  Having messed around with this stuff off-and-on for awhile, I
have no illusions about this going in without tickling some platform
dependency.  Jack, you're especially on alert, because I know you had
problems with some earlier bsddb-related changes to  My iMac w/
MacOS X is still in Michigan with Ellen for the summer.


From  Fri Aug  2 19:15:27 2002
From: (Neil Schemenauer)
Date: Fri, 2 Aug 2002 11:15:27 -0700
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib,NONE,1.1
In-Reply-To: <>; from on Fri, Aug 02, 2002 at 09:44:34AM -0700
References: <>
Message-ID: <> wrote:
> Adding the heap queue algorithm, per discussion in python-dev last
> week.


> __about__ = """Heap queues [...]

Is this going to become a "blessed" special name or do you consider it
harmless abuse of the namespace?


From  Fri Aug  2 19:23:34 2002
From: (Zack Weinberg)
Date: Fri, 2 Aug 2002 11:23:34 -0700
Subject: [Python-Dev] Weird error handling in os._execvpe
In-Reply-To: <>
References: <> <> <> <> <> <>
Message-ID: <>

On Fri, Aug 02, 2002 at 10:38:14AM -0400, Guido van Rossum wrote:
> > So I think we're back to option 2 (enumerate the possible errors for
> > each platform).  ENOENT and ENOTDIR should cover it for Unix.  Would
> > other platform maintainers care to comment, please?
> Don't wait for them.  Just submit a patch and assign it to me. :-)

Done: id 590294.


From  Fri Aug  2 19:31:44 2002
From: (Guido van Rossum)
Date: Fri, 02 Aug 2002 14:31:44 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib,NONE,1.1
In-Reply-To: Your message of "Fri, 02 Aug 2002 11:15:27 PDT."
References: <>
Message-ID: <>

> > __about__ = """Heap queues [...]
> Is this going to become a "blessed" special name or do you consider it
> harmless abuse of the namespace?

The latter.  I figured François' treatise was too long for the
docstring.  I was originally going to make it an unnamed string
literal -- maybe that's better?

--Guido van Rossum (home page:

From  Fri Aug  2 21:47:52 2002
From: (David Goodger)
Date: Fri, 02 Aug 2002 16:47:52 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process
In-Reply-To: <>
Message-ID: <>

Barry A. Warsaw wrote:
> An of course RFCs are also converted to html:

So they are.  Pretty picture at the top, navigation bar at top &
bottom, and a huge <PRE> in-between (with live RFC links at least).
Impressive.  ;-)

> Taken to the extreme, why do we even use a text based format at all?
> We could, of course, get all that by authoring the PEPs directly in

To answer your hypothetical (I assume), it's because raw HTML/XML/SGML
is unreadable to most people.  Plaintext is a common denominator,
useful because it's universally readable.  But texts like RFCs and
PEPs do have some structure; by formalizing that structure we can use
it.  Current PEPs are one up on RFCs, recognizing section titles for
HTML.  ReStructuredText just takes that further.

>     >> I'd like for PEP authors to explicitly choose one or the other,
>     >> preferrably by file extension (e.g. .txt for plain text .rst or
>     >> .rest for reST).
>     DG> I'm not keen on a new file extension (this issue has come up
>     DG> before).  There's so much in place on many platforms that says
>     DG> .txt means text files, and reStructuredText files *are* text
>     DG> files, with just a bit of formal structure sprinkled over.
>     DG> Browsers know what to do with .txt files; they wouldn't know
>     DG> what to do with .rest or .rtxt files.  Near-universal file
>     DG> naming conventions are not the place to innovate IMHO.
> Don't most servers default to text/plain for types they don't know?
> I'm pretty sure Apache does.

I don't know the answer to that.  There's still what the browser does
with it at the client end, and what apps like Windows Explorer and
Mac's File Exchange do with file extensions.  I think those
side-effects make keeping .txt worth it.

> If a file extension isn't acceptable, then I'd still want the
> determination of plaintext vs. reST to be explicit.  The other
> alternative is to add a PEP header to specify.  I'd propose calling
> it Content-Type: and use text/x-rest as the value.

[Already replied to in private email; repeated here to elicit

Good idea.  But the header "Content-Type: text/x-rest" seems to imply
much more than is intended.  PEP 258 proposes a __docformat__ variable
to contain the name of the format being used for docstrings; perhaps a
"Format:" header for PEPs?  For example:

    Format: reStructuredText


    Format: RST

I prefer "RST" to "rest", which is already used as an acronym for the
"Representational State Transfer" protocol (see Paul Prescod's article

The existing format could be called "Plaintext" (or "PEP 1.0" ;-).
Without the "Format:" header, "Plaintext" would be the default.

[In his reply to the aforementioned private email,]

Barry pointed out:

    Since the PEP headers are modeled on RFC 2822, I say we stick with
    established standards rather than invent our own.  So
    "Content-Type: text/x-rest" seems natural, and for most related
    standards, if there is no Content-Type: header, text/plain is
    already the documented default.

Looking at the relevant standards (RFC 2616 etc.) I see his point.
Using "Content-type:" may seem like overkill now, but it's flexible
and future-proof (!).  The "charset:" part could also come in handy;
already, there are some PEPs (including PEP 0) which implicitly use

But "text/x-rst" would be better. :-)

> Fair enough.  Let's do this: send me a diff against v1.39 of

Will do.

> I just downloaded docutils-0.2, but I'm not sure of the
> best way to integrate this in the nondist/peps directory.
> - If we do the normal install, that's fine for my machine but
>   it means that everyone who will be pushing out peps will have to do
>   the same.
> - If we hack to put ./docutils-0.2 on sys.path, then we
>   can just check this stuff into the peps directory and it should Just
>   Work.  We'd have to update it when new docutils releases are made.

The "docutils" package could be a subdirectory of nondist/peps under
CVS.  When is run, the current working directory is
already on the path so "import docutils" should just work and no
sys.path manipulation would be necessary.  But Docutils is
substantial and evolving.  I don't mind keeping Python's repository in
sync but would others object to the added files and CVS traffic?
Eventually I hope for Docutils to go into the stdlib, but it's not
ready for consideration yet.

I agree with the direct email consensus that "python install"
is best.

> OTOH, if plaintext PEPs work without having access to the docutils
> package, that would be fine too (another reason perhaps for an
> explicit flag).

Your wish is my command.  If Docutils isn't installed and
is asked to process a reStructuredText PEP, it will report the problem
and move on gracefully (no traceback).

>     DG> Sure.  Once authors see what the new markup gives them, I'm
>     DG> sure there will be some converts.
> Let's find out.

Great.  I'll work on, a README, and a new "Template"
Meta-PEP including a recommended reStructuredText subset.

David Goodger  <>  Open-source projects:
  - Python Docutils:
    (includes reStructuredText:
  - The Go Tools Project:

From  Fri Aug  2 21:48:05 2002
From: (David Goodger)
Date: Fri, 02 Aug 2002 16:48:05 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process
In-Reply-To: <3D4A6DC4.26978.2F61817D@localhost>
Message-ID: <>

>> Seriously, any significant technology requires a
>> significant spec.

> The wheel?

If the ISO were around at the time, yes!

> Ah, but you've been hashing it out with a group of
> people who *care* about things like this. Welcome
> to the larger world (where dinosaurs still roam).

> Funny, Ping doesn't strike me as a dinosaur.  More as someone who
> enjoys a good argument. :-)

So that wasn't Abuse?  What a relief!

David Goodger  <>  Open-source projects:
  - Python Docutils:
    (includes reStructuredText:
  - The Go Tools Project:

From  Fri Aug  2 21:52:19 2002
From: (Finn Bock)
Date: Fri, 02 Aug 2002 22:52:19 +0200
Subject: [Python-Dev] timsort for jython
Message-ID: <>


Here are some numbers for a javaport of the timsort code.

The old sorting code in jython was the 1.5 code from CPython with a 
quicksort implementaion also inspired by Tim Peters.

Switching to timsort is obviously a nobrainer for us. You also don't 
need to hold back on giving stability garanties in the documentation for 
jython's sake.

All numbers using JDK1.4.1 on Win2K & 1300Mhz AMD. I gave up waiting for 
the i=20 line.


  i  *sort  \sort  /sort  3sort  +sort  %sort  ~sort  =sort  !sort
15   0.66   0.50   0.30   0.29   0.40   0.42   0.32   0.28   2.53
16   1.36   1.10   0.67   0.67   0.94   1.05   0.75   0.60   6.12
17   3.12   2.38   1.47   1.47   1.88   2.12   1.62   1.28  15.43
18   6.52   5.14   3.22   3.22   4.52   5.56   3.35   2.73  36.04
19  14.32  11.07   6.99   6.99   8.71  11.72   7.33   5.86  87.80


  i  *sort  \sort  /sort  3sort  +sort  %sort  ~sort  =sort  !sort
15   0.44   0.05   0.03   0.03   0.04   0.06   0.17   0.02   0.06
16   0.82   0.08   0.06   0.07   0.07   0.11   0.32   0.05   0.11
17   1.76   0.18   0.13   0.13   0.13   0.23   0.64   0.11   0.22
18   3.87   0.34   0.26   0.29   0.27   0.49   1.29   0.21   0.45
19   8.91   0.70   0.53   0.54   0.54   1.07   2.62   0.43   0.90


From  Fri Aug  2 22:07:09 2002
From: (Guido van Rossum)
Date: Fri, 02 Aug 2002 17:07:09 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: Your message of "Fri, 02 Aug 2002 22:52:19 +0200."
References: <>
Message-ID: <>

> Here are some numbers for a javaport of the timsort code.
> The old sorting code in jython was the 1.5 code from CPython with a 
> quicksort implementaion also inspired by Tim Peters.
> Switching to timsort is obviously a nobrainer for us. You also don't 
> need to hold back on giving stability garanties in the documentation for 
> jython's sake.

Woo hoo!  Way to go, Finn.  Sounds like you'll be able to make the
stability guarantee in Jython 2.2, whereas we can only make it for
CPython 2.3. :-)

--Guido van Rossum (home page:

From  Fri Aug  2 22:56:55 2002
From: (David Goodger)
Date: Fri, 02 Aug 2002 17:56:55 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib,NONE,1.1
Message-ID: <>

Guido wrote:
> I was originally going to make it an unnamed string
> literal -- maybe that's better?

In PEP 258 I call those "Additional Docstrings":

    Many programmers would like to make extensive use of docstrings
    for API documentation.  However, docstrings do take up space in
    the running program, so some of these programmers are reluctant to
    "bloat up" their code.  Also, not all API documentation is
    applicable to interactive environments, where __doc__ would be

    The docstring processing system's extraction tools will
    concatenate all string literal expressions which appear at the
    beginning of a definition or after a simple assignment.  Only the
    first strings in definitions will be available as __doc__, and can
    be used for brief usage text suitable for interactive sessions;
    subsequent string literals and all attribute docstrings are
    ignored by the Python bytecode compiler and may contain more
    extensive API information.


        def function(arg):
            """This is __doc__, function's docstring."""
            This is an additional docstring, ignored by the bytecode
            compiler, but extracted by the Docutils.

(Original idea from Moshe Zadka.)

David Goodger  <>  Open-source projects:
  - Python Docutils:
    (includes reStructuredText:
  - The Go Tools Project:

From  Fri Aug  2 21:35:24 2002
From: (Jack Jansen)
Date: Fri, 2 Aug 2002 22:35:24 +0200
Subject: [Python-Dev] dbm module, whichdb, test_whichdb
In-Reply-To: <15690.48952.51439.449198@localhost.localdomain>
Message-ID: <>

On vrijdag, augustus 2, 2002, at 07:19 , Skip Montanaro wrote:
>   Jack, you're especially on alert, because I know you had
> problems with some earlier bsddb-related changes to

Both test_whichdb and test_anydbm pass without problems.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Fri Aug  2 23:51:25 2002
From: (Tim Peters)
Date: Fri, 02 Aug 2002 18:51:25 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

[Finn Bock]
> ...
> The old sorting code in jython was the 1.5 code from CPython with a
> quicksort implementaion also inspired by Tim Peters.

The sad thing is, that was a very good quicksort -- I thought I was done
when I wrote that <wink>.

> Switching to timsort is obviously a nobrainer for us.

Thanks for sharing this!  Made my day <smile>.  I noted in the Jython patch
that you should see a nice speedup by nuking the assert() calls once you're
confident in the port; Java is checking out-of-bound array indices for you,
and that's largely what the asserts are guarding against in the C

> You also don't need to hold back on giving stability garanties in the
> documentation for jython's sake.

I didn't <wink>.  Stability doesn't come free, and for all I know, in
another 3 years a method will be discovered that's 3x faster but not stable.
For example, Splaysort is (as an email correspondent reminded me) provably
adaptive to all known measures of presortedness, but when I looked at the
code it "was obvious" that it wouldn't be competitive on random data; it
also requires two pointers per list element.  In coming years, researchers
may well dream up quicker ways to get the same goodness, but Splaysort isn't
stable, and very few fast algorithms are.  So I don't want to hobble future
implementations by holding Python to promises I don't care much about.
OTOH, I do expect that once code relies on stability, we'll have about as
much chance of taking that away as getting rid of list.append().

From  Sat Aug  3 00:41:45 2002
From: (Aahz)
Date: Fri, 2 Aug 2002 19:41:45 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Fri, Aug 02, 2002, Tim Peters wrote:
> Stability doesn't come free, and for all I know, in another 3 years a
> method will be discovered that's 3x faster but not stable.

You're pulling our legs, right?  I thought you said this version of
mergesort was converging on the theoretical lower bound.
Aahz (           <*>

Project Vote Smart:

From  Sat Aug  3 01:28:43 2002
From: (Guido van Rossum)
Date: Fri, 02 Aug 2002 20:28:43 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib,NONE,1.1
In-Reply-To: Your message of "Fri, 02 Aug 2002 17:56:55 EDT."
References: <>
Message-ID: <>

> > I was originally going to make it an unnamed string
> > literal -- maybe that's better?
> In PEP 258 I call those "Additional Docstrings":
>     Many programmers would like to make extensive use of docstrings
>     for API documentation.  However, docstrings do take up space in
>     the running program, so some of these programmers are reluctant to
>     "bloat up" their code.  Also, not all API documentation is
>     applicable to interactive environments, where __doc__ would be
>     displayed.
>     The docstring processing system's extraction tools will
>     concatenate all string literal expressions which appear at the
>     beginning of a definition or after a simple assignment.  Only the
>     first strings in definitions will be available as __doc__, and can
>     be used for brief usage text suitable for interactive sessions;
>     subsequent string literals and all attribute docstrings are
>     ignored by the Python bytecode compiler and may contain more
>     extensive API information.
>     Example::
>         def function(arg):
>             """This is __doc__, function's docstring."""
>             """
>             This is an additional docstring, ignored by the bytecode
>             compiler, but extracted by the Docutils.
>             """
>             pass
> (Original idea from Moshe Zadka.)

Ah, I thought there had to be something like that. :-)

Do you also recognize this if there are comments between?  Or blank
lines?  E.g.

   def f():

       # blah


--Guido van Rossum (home page:

From  Sat Aug  3 01:51:22 2002
From: (Tim Peters)
Date: Fri, 02 Aug 2002 20:51:22 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

> Stability doesn't come free, and for all I know, in another 3 years a
> method will be discovered that's 3x faster but not stable.

> You're pulling our legs, right?  I thought you said this version of
> mergesort was converging on the theoretical lower bound.

For # of comparisons done on randomly ordered data, yes, there's a hard
lower bound of lg(n!) comparisons, but the samplesort hybrid was close
enough to that too that there wouldn't have been much point to timsort.  For
various kinds of partially ordered data, the only catch-all hard lower bound
is n-1 comparisons (read timsort.txt attached to the patch on SF, or
Objects/listsort.txt in current CVS -- there's much more info in those than
in the text file I posted to Python-Dev).

Comparisons aren't the whole story, either, as ~sort showed dramatically in
the x-platform timings (see the patch).  I believe timsort is sometimes more
cache-friendly than the samplesort hybrid (&, e.g., I see no other way to
explain the wild ~sort x-platform behavior), but it's not doing anything
heroic for cache-friendliness.  The pending-run stack invariants
automatically implement what's called "tiling" in the literature, but that's
not the only cache trick it *could* play.

i'll-be-dead-before-the-sorting-story-ly y'rs  - tim

From  Sat Aug  3 03:01:34 2002
From: (Skip Montanaro)
Date: Fri, 2 Aug 2002 21:01:34 -0500
Subject: [Python-Dev] dbm module, whichdb, test_whichdb
In-Reply-To: <>
References: <15690.48952.51439.449198@localhost.localdomain>
Message-ID: <15691.14718.263409.776329@localhost.localdomain>

    Jack> Both test_whichdb and test_anydbm pass without problems.



From  Sat Aug  3 07:21:20 2002
From: (Ka-Ping Yee)
Date: Sat, 3 Aug 2002 01:21:20 -0500 (CDT)
Subject: [Python-Dev] Docutils/reStructuredText is ready to process PEPs
In-Reply-To: <>
Message-ID: <>

On Fri, 2 Aug 2002, David Goodger wrote:
> [Guido]
> > Funny, Ping doesn't strike me as a dinosaur.  More as someone who
> > enjoys a good argument. :-)
> So that wasn't Abuse?  What a relief!

Just to make sure you know: i don't argue only for the sake of arguing.
I argue when i think it will make Python better.

-- ?!ng

From  Sat Aug  3 07:27:14 2002
From: (Aahz)
Date: Sat, 3 Aug 2002 02:27:14 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process PEPs
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Sat, Aug 03, 2002, Ka-Ping Yee wrote:
> On Fri, 2 Aug 2002, David Goodger wrote:
>> [Guido]
>>> Funny, Ping doesn't strike me as a dinosaur.  More as someone who
>>> enjoys a good argument. :-)
>> So that wasn't Abuse?  What a relief!
> Just to make sure you know: i don't argue only for the sake of arguing.
> I argue when i think it will make Python better.

"That's what they all say."  ;-)
Aahz (           <*>

Project Vote Smart:

From  Sat Aug  3 15:34:43 2002
From: (David Goodger)
Date: Sat, 03 Aug 2002 10:34:43 -0400
Subject: [Python-Dev] Docutils/reStructuredText is ready to process
In-Reply-To: <>
Message-ID: <>

Ka-Ping Yee wrote:
> On Fri, 2 Aug 2002, David Goodger wrote:
>> [Guido]
>>> Funny, Ping doesn't strike me as a dinosaur.  More as someone who
>>> enjoys a good argument. :-)
>> So that wasn't Abuse?  What a relief!
> Just to make sure you know: i don't argue only for the sake of arguing.
> I argue when i think it will make Python better.

Understood and appreciated.  My comment was just a lame attempt at humor.
Apologies for omission of ";-)".

David Goodger  <>  Open-source projects:
  - Python Docutils:
    (includes reStructuredText:
  - The Go Tools Project:

From  Sat Aug  3 16:01:05 2002
From: (Rasjid Wilcox)
Date: Sun, 4 Aug 2002 01:01:05 +1000
Subject: [Python-Dev] Adding popen2 like functionality to
Message-ID: <>

Dear Python Developers,

I have submited a patch to add a popen2 like function to the library.

It is just a first draft, and I'm happy to develop it further if there is 
interest.  If so, I will do some docs and have a look at the test library for 
it.  I would also be looking for some guidance on the best way to resolve 
some issues.

I'm new to Python and its development process, so I'm hoping I have not broken 
any rules by not waiting a response via the patch manager before posting to 

I would like to contribute to Python, as I think it is a truly delightful 
language.  I don't have a computer science background as such, more pure 
mathematics (set theory, group theory, logic etc). I don't know C (yet), so I 
would be looking to work on the pure Python libraries, or help create new 
ones.  I'm also willing to help with documentation.



From  Sat Aug  3 21:44:17 2002
From: (Scott Gilbert)
Date: Sat, 3 Aug 2002 13:44:17 -0700 (PDT)
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: <>
Message-ID: <>

--- Guido van Rossum <> wrote:
> > > >         void PyObject_ReleaseFixedBuffer(PyObject *obj);
> > > 
> > > Would it be useful to allow bf_releasefixedbuffer to return an int
> > > indicating an exception?  For instance, it could raise an exception
> > > if the extension errantly releases more times than it has acquired
> > 
> > The code making the call might not be in an easy position
> > to deal with an exception -- e.g. an asynchronous I/O
> > routine called from a signal handler, another thread,
> > etc.
> > 
> > Maybe use the warning mechanism to produce a message?
> In an asynch I/O situation, calling PyErr_Warn() is out of the
> question (it invokes Python code!).
> I propose to make it a fatal error -- after all the only reason why
> bf_releasefixedbuffer could fail should be that the caller makes a
> mistake.  Since that's a bug in C code, a fatail error is acceptable.

I don't know if you guys are hinting at the possibility of the
PyObject_ReleaseFixedBuffer function being called without holding
the GIL or not, but I think the GIL should be necessary during
this call.  (As such, the code making the call *could* deal with
the exception...  even if we don't want it to have to.)

So while a fatal error is still a reasonable response, the 
asynchronous I/O routine or signal handler or whatever really
should acquire the GIL before doing the release.  For one thing
this protects the lock_count variable from race conditions, and
another, it allows the implementation of bf_releasefixedbuffer
to use other Python APIs.


Do You Yahoo!?
Yahoo! Health - Feel better, live better

From  Sun Aug  4 01:22:41 2002
From: (Guido van Rossum)
Date: Sat, 03 Aug 2002 20:22:41 -0400
Subject: [Python-Dev] PEP 298, final (?) version
In-Reply-To: Your message of "Sat, 03 Aug 2002 13:44:17 PDT."
References: <>
Message-ID: <>

> > > > >         void PyObject_ReleaseFixedBuffer(PyObject *obj);
> > > > 
> > > > Would it be useful to allow bf_releasefixedbuffer to return an int
> > > > indicating an exception?  For instance, it could raise an exception
> > > > if the extension errantly releases more times than it has acquired
> > > 
> > > The code making the call might not be in an easy position
> > > to deal with an exception -- e.g. an asynchronous I/O
> > > routine called from a signal handler, another thread,
> > > etc.
> > > 
> > > Maybe use the warning mechanism to produce a message?
> > 
> > In an asynch I/O situation, calling PyErr_Warn() is out of the
> > question (it invokes Python code!).
> > 
> > I propose to make it a fatal error -- after all the only reason why
> > bf_releasefixedbuffer could fail should be that the caller makes a
> > mistake.  Since that's a bug in C code, a fatail error is acceptable.
> I don't know if you guys are hinting at the possibility of the
> PyObject_ReleaseFixedBuffer function being called without holding
> the GIL or not, but I think the GIL should be necessary during
> this call.  (As such, the code making the call *could* deal with
> the exception...  even if we don't want it to have to.)

Good point.

> So while a fatal error is still a reasonable response, the 
> asynchronous I/O routine or signal handler or whatever really
> should acquire the GIL before doing the release.  For one thing
> this protects the lock_count variable from race conditions, and
> another, it allows the implementation of bf_releasefixedbuffer
> to use other Python APIs.


Is the PEP clear that you have to hold the GIL for these calls?
(Can't hurt to be explicit, given the fact that one intention is to
*use* the buffer while the GIL is released...

--Guido van Rossum (home page:

From  Sun Aug  4 01:43:05 2002
From: (Tim Peters)
Date: Sat, 03 Aug 2002 20:43:05 -0400
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: <>
Message-ID: <>

[Michael Hudson]
> It's about 5%:
> $ ../../build/python
> Pystone(1.1) time for 100000 passes = 8.11
> This machine benchmarks at 12330.5 pystones/second
> $ ../../build/python
> Pystone(1.1) time for 100000 passes = 7.69
> This machine benchmarks at 13003.9 pystones/second

If I didn't know better, I'd think you ran the same python twice there.

> I can run the vanilla pystone whilst compiling or something if you
> like :)

That's OK, the speedup will be larger on Windows.  I can guarantee that,
since I'll be doing the Windows timings <wink>.

> ...
> It's the old boys club effect: I worked damn hard to get to the point
> of understanding this stuff, so everyone else should bloody well have
> to too!
> ...
> All I can say is that I'd been driven insane by co_lnotab and
> Python/compile.c when I wrote that comment <wink>.

I understand.  It was insanity that drove me to write the co_lnotab comments
that tempted you into believing it was possible to do something rational
with it, and I apologize for that <wink>.  I like the new comments!  Thank

From  Sun Aug  4 01:57:45 2002
From: (Tim Peters)
Date: Sat, 03 Aug 2002 20:57:45 -0400
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: <>
Message-ID: <>

[Michael Hudson]
> I've found another annoying problem.  I'm not really expecting someone
> here to sovle it for me, but writing it down might help me think
> clearly.
> This is about the function epilogues that always get generated.  I.e:
> >>> def f():
> ...     if a:
> ...         print 1
> ...
> >>> import dis
> >>> dis.dis(f)
>   2           0 LOAD_GLOBAL              0 (a)
>               3 JUMP_IF_FALSE            9 (to 15)
>               6 POP_TOP
>   3           7 LOAD_CONST               1 (1)
>              10 PRINT_ITEM
>              11 PRINT_NEWLINE
>              12 JUMP_FORWARD             1 (to 16)
>         >>   15 POP_TOP
>         >>   16 LOAD_CONST               0 (None)
>              19 RETURN_VALUE
> You can see here that the epilogue gets associated with line 3,
> whereas it shouldn't really be associated with any line at all.

It has to be associated with some line >= 3, as c_lnotab isn't capable of
expressing anything other than that.  It *could* associate it with "line 4",
though, if the compiler were changed to pump out another c_lntab entry at
the epilogue.  That would be better than saying the time is charged to line
3, since it isn't on line 3 then.  I'd be happy to trade away total insanity
for partial insanity <wink>.

> For why this is a problem:
> $ cat
> a = 0
> def f():
>     if a:
>         print 1
> >>> pdb.runcall(t.f)
> > /home/mwh/src/sf/python/dist/src/build/
> -> if a:
> (Pdb) s
> > /home/mwh/src/sf/python/dist/src/build/
> -> print 1
> (Pdb)
> --Return--
> > /home/mwh/src/sf/python/dist/src/build/>None
> -> print 1
> (Pdb)
> The debugger stopping on the "print 1" is confusing.

It stops on the "if a:" for me twice today, and I doubt that's any less
confusing.  If it were set to line 4 instead, an unaltered pdb would
presumably show a blank line (whatever) after the function body, and an
altered pdb could be taught that "the last line" c_lnotab claims exists is
really devoted to exit code not associated with any source-file line.

From  Sun Aug  4 02:06:47 2002
From: (Tim Peters)
Date: Sat, 03 Aug 2002 21:06:47 -0400
Subject: [Python-Dev] Sorting
In-Reply-To: <>
Message-ID: <>

> ...
> It seems pretty clear by now that the new mergesort is going to replace
> samplesort, but since nobody else has said this, I figured I'd add one
> more comment:
> I actually understood your description of the new mergesort algorithm.
> Unless you can come up with similar docs for samplesort, the Beer Truck
> scenario dictates that mergesort be the new gold standard.
> Or to quote Tim Peters, "Complex is better than complicated."

Good observation!  I wish I'd thought of that <wink>.  The mergesort is more
complex, but it doesn't have so many fiddly little complications obscuring
it.  There were was an extensive description of the samplesort hybrid in
listobject.c's comments, but you know you're in Complication Heaven when
you've got to document half a dozen distinct "tuning macros" in hand-wavy
terms.  The only tuning parameter in the meregesort is MIN_GALLOP, and the
tradeoff it makes is explainable.

From  Sun Aug  4 02:27:06 2002
From: (Christian Tismer)
Date: Sun, 04 Aug 2002 03:27:06 +0200
Subject: [Python-Dev] timsort for jython
References: <>
Message-ID: <>

Tim Peters wrote:
> [Finn Bock]
>>The old sorting code in jython was the 1.5 code from CPython with a
>>quicksort implementaion also inspired by Tim Peters.
> The sad thing is, that was a very good quicksort -- I thought I was done
> when I wrote that <wink>.

I'd like to pet you for your new version, and your split personality
which manages to create so much creativeness out of being
you best own enemy.

- chris

Christian Tismer             :^)   <>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship*
14109 Berlin                 :     PGP key ->
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?

From  Sun Aug  4 02:44:21 2002
From: (Tim Peters)
Date: Sat, 03 Aug 2002 21:44:21 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

[Christian Tismer]
> I'd like to pet you for your new version,

LOL -- that comes off a bit, umm, endearing to American English ears.  But I
understand the sentiment, and thank you for it.

> and your split personality which manages to create so much creativeness
> out of being you best own enemy.

Yes, it takes one to know one indeed <wink>.

let's-everyone-get-together-for-a-big-Group-Pet!-ly y'rs  - tim

From  Sun Aug  4 04:32:30 2002
From: (Tim Peters)
Date: Sat, 03 Aug 2002 23:32:30 -0400
Subject: [Python-Dev] split('') revisited
In-Reply-To: <>
Message-ID: <>


> It's the last line in the loop body that makes empty matches
> a wart if allowed:  they wouldn't advance the position at all, and an
> infinite loop would result.  In order to make them do what you think you
> want, we'd have to add, at the end of the loop body
>    ah, and if the match was emtpy, advance the position again, by, oh,
>    i don't know, how about 1?  That's close to 0 <wink>.

[Andrew Koenig]
> Indeed, that's an arbitrary rule -- just about as arbitrary as the one
> that you abbreviated above, which should really be
> 	    find the next match, but if the match is empty, disregard it;
> 	    instead, find the next match with a length of at least,
> 	    oh, I don't know, how about 1?  That's close to 0 <wink>.

You really think so?  I expect almost all programmers would understand what
"find next non-empty match" means at first glance -- and especially
regexp-slingers, who are often burned in their matching lives by the
consequences of having large pieces of their patterns unexpectedly match an
empty string.  That makes "non-empty match" seem a natural concept to me.

> What I'm trying to do is come up with a useful example to convince
> myself that one is better than the other.

Have you found one yet?  I confess that re.findall() implements a "if the
match was empty, advance the position by 1" rule, as in

>>> re.findall("x?", "abc")
['', '', '', '']

But I don't think we're doing anyone a favor with stuff like that.  I think
it's a dubious idea that

>>> "abc".find('')

"works" too.  If a program does s1.find(s2) and s2 is an empty string, I
expect the chances are good it's a logic error in the program.  Analogies
to, e.g., i+j when j happens to be 0 leave me cold, since I can think of a
thousand reasons for why j might naturally be 0.  But I've had a hard time
thinking of a reasonable algorithm where the expression s1.find(s2) could be
expected to have s2=="" in normal operation (and am sure it would have been
a logic error elsewhere in any uses of string.find() I've made; ditto
searching for, or splitting on, empty strings via regexps).

From  Sun Aug  4 07:53:12 2002
From: (Aahz)
Date: Sun, 4 Aug 2002 02:53:12 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Sat, Aug 03, 2002, Tim Peters wrote:
> let's-everyone-get-together-for-a-big-Group-Pet!-ly y'rs  - tim

Anything you say, Commodore.
Aahz (           <*>

Project Vote Smart:

From  Sun Aug  4 13:00:22 2002
From: (Skip Montanaro)
Date: Sun, 4 Aug 2002 07:00:22 -0500
Subject: [Python-Dev] Weekly Python Bug/Patch Summary
Message-ID: <>

Bug/Patch Summary

272 open / 2715 total bugs (+8)
131 open / 1632 total patches (-12)

New Bugs

Invalid mmap crashes Python interpreter (2002-07-24)
re.finditer (2002-07-24)
macfs.FSSpec fails for "new" files (2002-07-24)
Two corrects for weakref docs (2002-07-25)
-S hides standard dynamic modules (2002-07-25)
site-packages & build-dir python (2002-07-25)
email package does not work with mailbox (2002-07-26)
Empty genindex.html pages (2002-07-26)
pydoc -w fails with path specified (2002-07-26)
references to email package (2002-07-26)
OSA Python integration (2002-07-26)
IBCarbon module (2002-07-26)
ur'\u' not handled properly (2002-07-26)
python-mode and nested indents (2002-07-26)
imaplib: prefix-quoted strings (2002-07-30)
python should obey the FHS (2002-07-30)
add main to py_pycompile (2002-07-30), better error message (2002-07-30)
Memory leakage in SAX with exception (2002-07-31) wrapper needs a class (2002-07-31)
shared libpython & dependant libraries (2002-07-31)
standard include paths on command line (2002-07-31)
"".split() ignores maxsplit arg (2002-08-01)
PyMapping_Keys unexported in dll (2002-08-02)
preconvert AppleSingle resource files (2002-08-02)

New Patches

PEP 282 Implementation (2002-07-07)
Adds Galeon support to (2002-07-24)
galeon support in webbrowser (2002-07-25)
Better token-related error messages (2002-07-25)
alternative SET_LINENO killer (2002-07-29)
Cygwin _hotshot patch (2002-07-30)
_locale library patch (2002-07-30)
LDFLAGS support for (2002-07-30)
Mindless editing, DL_EXPORT/IMPORT (2002-07-31) rewrite (2002-08-01)
types.BoolType (2002-08-02)
os._execvpe security fix (2002-08-02)
py2texi.el update (2002-08-02)
db4 include not found (2002-08-02)
Add popen2 like functionality to (2002-08-03)
New codecs: html, asciihtml (2002-08-03)

Closed Bugs

Need user-centered info for Windows users. (2000-11-27)
profiling with xml parsing asserts (2002-03-25)
Compile error _sre.c on Cray T3E (2002-05-19)
DL_EXPORT on VC7 broken (2002-05-20)
crash on gethostbyaddr (2002-06-07)
asynchat module undocumented (2002-06-12)
socket module htonl/ntohl bug (2002-06-12)
.PYO files not imported unless -OO used (2002-06-18)
CGIHTTPServer flushes read-only file. (2002-06-18)
Negative __len__ provokes SystemError (2002-06-30)
Sig11 in cPickle (stack overflow) (2002-07-01)
Infinite recursion in Pickle (2002-07-02)
GC Changes not mentioned in What's New (2002-07-12)
pty.spawn - wrong error caught (2002-07-15)
os.getlogin() fails (2002-07-21)

Closed Patches

timestamp function for time module (2001-08-17)
Unambiguous import for encodings (2001-09-06)
no '_d' ending for mingw32 (2001-09-18)
HTML version of the Idle "documentation" (2001-10-12)
whichdb unittest (2002-04-09)
s/Copyright/License/ in (2002-04-13)
merging sorted sequences (2002-04-15)
Read/Write buffers from buffer() (2002-04-30)
Better description in "python -h" for -u (2002-05-06)
Cygwin make install patch (2002-05-08)
__va_copy patches (2002-05-10)
Ebcdic compliancy in stringobject source (2002-05-19)
README additions for Cray T3E (2002-05-28)
Fix bug in encodings.search_function (2002-06-20)
Executable .pyc-files with hashbang (2002-06-23)
Changing owner of symlinks (2002-06-25)
Make python-mode.el use jython (2002-06-27)
list.extend docstring fix (2002-06-27)
SSL release GIL (2002-06-30)
Extend PyErr_SetFromWindowsErr (2002-07-02)
Remove PyArg_Parse() and METH_OLDARGS (2002-07-03)
Merge xrange() into slice() (2002-07-05)
fix for problems with test_longexp (2002-07-06)
less restrictive HTML comments (2002-07-12)
Canvas "select_item" always returns None (2002-07-14)
info reader bug (2002-07-14)
fix to pty.spawn error on Linux (2002-07-15)
get python to link on OSF1 (Dec Unix) (2002-07-20)

From  Sun Aug  4 13:01:47 2002
From: (Christian Tismer)
Date: Sun, 04 Aug 2002 14:01:47 +0200
Subject: [Python-Dev] On C inheritance
Message-ID: <>

Hi Guido,

as you know, I love your new type/class implementation very
much as it is right now, probably not completely ready, but
performing great.
Yesterday, at the Berlin Python Community Meeting, we were
discussing several aspects of this.
A special issue was overloading of methods from C.
With the current design, it appears to be "correct" to call my
own methods always vial the dictionary interface of the type,
since users might have derived from it and want their versions
to be called.

Now, this is a performance issue, and there are of course special
cased things like "getattr" already, which make use of an extra slot
in the type structure to speed it up.

Now, all my new stackless objects are made inheritable from, and
I'd like to support it from C code as well, but I hesitate to
spend the extra dictionary lookup for a probably seldom case.
Therefore, I intended to extend my types in a way, that they
provide some extra type slots for overridden builtin methods.

Unfortunately, this is not supported at the moment, due to some
extension class compatibility issues. I'd like to patch this,
and allow metatypes to be extzended with extra function fields.

Would you support this? Or is something already on your boiler plate?

thanks - chris
Christian Tismer             :^)   <>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship*
14109 Berlin                 :     PGP key ->
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
       whom do you want to sponsor today?

From  Sun Aug  4 15:24:32 2002
From: (Christian Tismer)
Date: Sun, 04 Aug 2002 16:24:32 +0200
Subject: [Python-Dev] timsort for jython
References: <>
Message-ID: <>

Tim Peters wrote:
> [Christian Tismer]
>>I'd like to pet you for your new version,
> LOL -- that comes off a bit, umm, endearing to American English ears.  But I
> understand the sentiment, and thank you for it.

I meant "to tap s.o. on the shoulder". Does this have the
meaning of encouraging and honoring?

Christian Tismer             :^)   <>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship*
14109 Berlin                 :     PGP key ->
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?

From  Sun Aug  4 11:52:18 2002
From: (Jeremy Hylton)
Date: Sun, 4 Aug 2002 06:52:18 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
References: <>
Message-ID: <>

>>>>> "CT" == Christian Tismer <> writes:

  CT> Tim Peters wrote:
  >> [Christian Tismer]
  >>> I'd like to pet you for your new version,
  >> LOL -- that comes off a bit, umm, endearing to American English
  >> ears.  But I understand the sentiment, and thank you for it.

  CT> I meant "to tap s.o. on the shoulder". Does this have the
  CT> meaning of encouraging and honoring?

The verb pet is most often used to mean stroking or caressing an
animal -- a pet dog or cat.

There is also a slang usage of pet that is straight out of the
Hungarian phrasebook.  Not "My hovercraft is full of eels," but "Drop
your panties, Sir William, I cannot wait 'til lunchtime."

Your meaning was clear, but it was impossible to suppress a wry grin.


From  Sun Aug  4 16:41:14 2002
From: (Gordon McMillan)
Date: Sun, 4 Aug 2002 11:41:14 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <3D4D12DA.24979.39B66004@localhost>

On 4 Aug 2002 at 16:24, Christian Tismer wrote:

> >>I'd like to pet you for your new version,
> > 
> > 
> > LOL -- that comes off a bit, umm, endearing to
> > American English ears.  But I understand the
> > sentiment, and thank you for it.
> I meant "to tap s.o. on the shoulder". Does this
> have the meaning of encouraging and honoring? 

For that you'd use "pat", as in "pat on the back".

"Pet" means (idiomatically) "stroke affectionately",
which is what you do to household animals & sexual

And, incidentally, "tap" is "light blow" as with
hammer or finger, where "blow" used in any other
context will likely be taken to mean "oral sex"
unless you're obviously discussing movement
of a gaseous media or the act of setting off a

And that just covers those words as verbs (and
worse, I've probably missed a few meanings).

Don't you wish German were so, er, expressive <wink>?

-- Gordon

From  Sun Aug  4 16:43:15 2002
From: (David Goodger)
Date: Sun, 04 Aug 2002 11:43:15 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib
In-Reply-To: <>
Message-ID: <>

>>> I was originally going to make it an unnamed string
>>> literal -- maybe that's better?

>> In PEP 258 I call those "Additional Docstrings":

> Ah, I thought there had to be something like that. :-)
> Do you also recognize this if there are comments between?  Or blank
> lines?  E.g.
>    def f():
>        """
>        foo
>        """
>        # blah
>        """
>        bar
>        """

We haven't gotten that far get.  I see no problems with blank lines, but
comments may block recognition, unless we choose to ignore them.  On the
other hand, comments themselves may be used in some circumstances; HappyDoc
recognizes comments *before* a def/class statement if there's no docstring.
There's still much to be thought out.

David Goodger  <>  Open-source projects:
  - Python Docutils:
    (includes reStructuredText:
  - The Go Tools Project:

From  Sun Aug  4 19:09:02 2002
From: (Tim Peters)
Date: Sun, 04 Aug 2002 14:09:02 -0400
Subject: [Python-Dev] New encoding error in debug build
Message-ID: <>

This assert near the end of get_coding_spec() in tokenizer.c triggers when
running test_heapq in a debug build:

					assert(strlen(r) >= strlen(q));

It's a very new failure.  Note that begins with the line

# -*- coding: Latin-1 -*-

I assume that's relevant, but Latin-1 is way beyond my personal experience
with strange character sets <wink>.

From  Sun Aug  4 19:30:46 2002
From: (Oren Tirosh)
Date: Sun, 4 Aug 2002 21:30:46 +0300
Subject: [Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml
In-Reply-To: <>; from on Sun, Aug 04, 2002 at 08:54:05AM -0700
References: <>
Message-ID: <>

(I'm moving this to python-dev)

On Sun, Aug 04, 2002 at 08:54:05AM -0700, wrote:
> >Comment By: Martin v. Löwis (loewis)
> Date: 2002-08-04 17:54
> I'm in favour of exposing this via a search functions, for
> generated codec names, on top of PEP 293 (I would not like
> your codec to compete with the alternative mechanism). My
> dislike for the current patch also comes from the fact that
> it singles-out ASCII, which the search function would not.

I find PEP 293 too complex while my solution is, admittedly, too 

Some of my reservations about PEP 293:

It overloads the meaning of the error handling argument in an unintuitive
way.  It gets to the point where it's much more than just error handling - 
it's actually extending the functionality of the codec. 

Why implement yet another name-based registry?  There must be a simpler way 
to do it.

Generating an exception for each character that isn't handled by simple 
lookup probably adds quite a lot of overhead.

What are the use cases?  Maybe a simple extension to charmap would be enough 
for all the practical cases?

> In anycase, I'd encourage you to contribute to the progress
> of PEP 293 first - this has been an issue for several years
> now, and I would be sorry if it would fail.

Me too.  But if you really don't want it to be rejected you should try to
find a way to make it simpler.

> While you are waiting for PEP 293 to complete, please do
> consider cleaning up htmlentitydefs to provide mappings from
> and to Unicode characters.

No problem.  The question is whether anyone depends on its current form.  
My proposed changes:

1. Use all lowercase entity names as keys.
2. Map "entityname" to u"\uXXXX" (currently it's mapped to "&#nnnn;")

In its current form I find pretty useless. Names in the
input in arbitrary case will not match the MixedCase keys in the entitydefs 
dictionary and the decimal character reference isn't really more useful than 
the named entity reference. 


From  Sun Aug  4 20:30:06 2002
From: (Martin v. Loewis)
Date: 04 Aug 2002 21:30:06 +0200
Subject: [Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml
In-Reply-To: <>
References: <>
Message-ID: <>

Oren Tirosh <> writes:

> It overloads the meaning of the error handling argument in an
> unintuitive way.  It gets to the point where it's much more than
> just error handling - it's actually extending the functionality of
> the codec.

Isn't that precisely the meaning fo "to handle"?

3 : to act on or perform a required function with regard to 
   <handle the day's mail>

It produces a replacement text, just in the same way as "ignore" or
"replace" produce replacement texts.

> Why implement yet another name-based registry?  

Namespaces are one honking great idea -- let's do more of those!

> There must be a simpler way to do it.

Propose one.

> What are the use cases?  Maybe a simple extension to charmap would
> be enough for all the practical cases?

The primary use case is XML: how do you efficiently use xml charrefs.
Notice that you can *not* use the charmap codec, since the underlying
encoding may not be based on the charmap codec.

In addition, it allows to give a more detailed analysis of an encoding
error, as it exposes the string position where the error occurs. This
allows to determine a "best" encoding (i.e. one that needs the fewest
amounts of exceptions, or the one that has the longest sequences of
same encodings).

> Me too.  But if you really don't want it to be rejected you should
> try to find a way to make it simpler.

Can you please elaborate why you think this is difficult? Is this a
concern about 
- the implementation of the PEP, or
- the implementation of error handlers, or
- the usage of error handlers?

I couldn't really believe that you find usage of this feature
difficult: just pass an error handling string to your codec just as
you currently do.

> > While you are waiting for PEP 293 to complete, please do
> > consider cleaning up htmlentitydefs to provide mappings from
> > and to Unicode characters.
> No problem.  The question is whether anyone depends on its current form.  
> My proposed changes:
> 1. Use all lowercase entity names as keys.

That is probably a bad idea. Atleast for XHTML, the case of entity
references is normative. Even for HTML 4, it would be good if this
precisely matches the DTD.

You could provide a case-insensitive lookup function in addition.

> 2. Map "entityname" to u"\uXXXX" (currently it's mapped to "&#nnnn;")

I think htmlentitydefs.entitydefs must stay as-is, for
compatibility. Instead, I'd suggest to add additional
objects/functions. Of course, the data should be present only once -
all other functions/dictionaries could be derived.

> In its current form I find pretty useless. Names in the
> input in arbitrary case will not match the MixedCase keys in the entitydefs 
> dictionary and the decimal character reference isn't really more useful than 
> the named entity reference. 

Indeed. However, people probably rely on its specific contents, so any
more useful access to the data must preserve entitydefs in its current


From  Sun Aug  4 21:12:17 2002
From: (Martin v. Loewis)
Date: 04 Aug 2002 22:12:17 +0200
Subject: [Python-Dev] New encoding error in debug build
In-Reply-To: <>
References: <>
Message-ID: <>

Tim Peters <> writes:

> # -*- coding: Latin-1 -*-
> I assume that's relevant, but Latin-1 is way beyond my personal experience
> with strange character sets <wink>.

It was normalizing that to "iso-8859-1", and was then surprised that
it got longer.


From  Sun Aug  4 22:07:08 2002
From: (Patrick K. O'Brien)
Date: Sun, 4 Aug 2002 16:07:08 -0500
Subject: [Python-Dev] Single- vs. Multi-pass iterability
In-Reply-To: <>
Message-ID: <>

[Guido van Rossum]
> - There really isn't anything "broken" about the current situation;
>   it's just that "next" is the only method name mapped to a slot in
>   the type object that doesn't have leading and trailing double
>   underscores.

I'm way behind on the email for this list, but I wanted to chime in with an
idea related to this old thread. I know we want to limit the rate of
language/feature changes for the business community. At the same time, this
situation with iterators is proof that even the best thought out new
features can still have a few blemishes that get discovered after they've
been incorporated into Python proper. It's just terribly difficult to get
anything "right" the very first time, and it would be nice to fix these
blemishes sooner, rather than later.

So perhaps we need some sort of concept of a "grace period" on brand-new
features during which blemishes can be polished off, even if the polishing
breaks backward compatibility. After the grace period, breaking backward
compatibility becomes a higher priority. Since we are talking about backward
compatibility only as it relates to the brand-new features themselves,
Python-In-A-Tie folks can avoid the issue altogether by not using the new
features during the grace period.

Would something like this be an acceptable compromise?

Patrick K. O'Brien
"Your source for Python programming expertise."

From  Mon Aug  5 02:34:26 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 04 Aug 2002 21:34:26 -0400
Subject: [Python-Dev] Re: Single- vs. Multi-pass iterability
In-Reply-To: <>
References: <>
Message-ID: <>

[Patrick K. O'Brien]

> So perhaps we need some sort of concept of a "grace period" on brand-new
> features during which blemishes can be polished off, even if the polishing
> breaks backward compatibility.  [...]  Would something like this be an
> acceptable compromise?

I know it was not the original intent of importing from __future__,
but maybe this could be linked with __future__ as well.  People wanting
guaranteed stability should just never import from __future__ at all,
It's just an idea, I'm not pushing for it.  I do not even like it...

For one, I'm quite ready to adjust the things I'm responsible for, whenever
the need arises, not being part of the bullet tied to Python development.
On the other hand, I know administrators not far from me that get very
upset when they learn about specification changes for any software they
rely on for production, and I do understand their need for a peaceful life.
Surely, it's not easy to please everybody.  In French, there is this nice
proverb, which is in fact the last verses of one of LaFontaine's fables:

   "On ne peut, dit le meunier,
    plaire à tout le monde et à son père:
    bien faire et laisser braire!".

In word, that means "do well and let mumble!". :-)

François Pinard

From  Mon Aug  5 04:39:34 2002
From: (Tim Peters)
Date: Sun, 04 Aug 2002 23:39:34 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

[Jeremy Hylton]
> The verb pet is most often used to mean stroking or caressing an
> animal -- a pet dog or cat.

So *that's* what it means!  Boy, is my face red.  Christian can think of me
like that all he likes.  I was afraid he meant it in the other sense, and I
never drop my panties before lunchtime, stackless be damned <wink>.

From  Mon Aug  5 04:59:16 2002
From: (Greg Ewing)
Date: Mon, 05 Aug 2002 15:59:16 +1200 (NZST)
Subject: [Python-Dev] timsort for jython
In-Reply-To: <3D4D12DA.24979.39B66004@localhost>
Message-ID: <>

Gordon McMillan <>:

> where "blow" used in any other
> context will likely be taken to mean "oral sex"

Which is a very odd usage, when you think about it --
I mean, more of a sucking action is involved than

And I am *not* going to be the first person to
mention the song "Sit On My Face" in this thread.
Oops... dash it...

From  Mon Aug  5 06:51:51 2002
From: (Oren Tirosh)
Date: Mon, 5 Aug 2002 01:51:51 -0400
Subject: [Python-Dev] Single- vs. Multi-pass iterability
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Sun, Aug 04, 2002 at 04:07:08PM -0500, Patrick K. O'Brien wrote:
> [Guido van Rossum]
> >
> > - There really isn't anything "broken" about the current situation;
> >   it's just that "next" is the only method name mapped to a slot in
> >   the type object that doesn't have leading and trailing double
> >   underscores.
> I'm way behind on the email for this list, but I wanted to chime in with an
> idea related to this old thread. I know we want to limit the rate of
> language/feature changes for the business community. At the same time, this
> situation with iterators is proof that even the best thought out new
> features can still have a few blemishes that get discovered after they've
> been incorporated into Python proper. 

I think I have a reasonable solution for the re-iteration blemish in the
iteration protcol without breaking backward compatibility:

> So perhaps we need some sort of concept of a "grace period" on brand-new
> features during which blemishes can be polished off, even if the polishing
> breaks backward compatibility. After the grace period, breaking backward
> compatibility becomes a higher priority.

Giving more people a chance to play with new features before they are 
finalized is a very good idea. When a significant new feature is checked in 
to the CVS a preview version can be released in source and precompiled form
to encourage more people to test it. Most CVS snapshots seem stable enough 
for a programmer's daily use.

A good example of such a significant new feature is the source encoding 
just checked in.


From  Mon Aug  5 09:03:16 2002
From: (M.-A. Lemburg)
Date: Mon, 05 Aug 2002 10:03:16 +0200
Subject: [Python-Dev] Re: [ python-Patches-590682 ] New codecs: html,
References: <> <>
Message-ID: <>

Oren Tirosh wrote:
> (I'm moving this to python-dev)

I've already answered on the SF tracker. Won't repeat things

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug  5 09:12:30 2002
From: (M.-A. Lemburg)
Date: Mon, 05 Aug 2002 10:12:30 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <>
Message-ID: <>

I'd like to put the following PEP up for pronouncement. Walter
is currently on vacation, but he asked me to already go ahead
with the process.

I like the patch a lot and the implementation strategy is very
interesting as well (just wish that classes were new types --
then things could run a tad faster and the patch would be

The basic idea of the patch is to provide a way to elegantly
handle error situations in codecs which go beyond the standard
cases 'ignore', 'replace' and 'strict', e.g. to automagically
escape problem case, to log errors for later review or to
fetch additional information for the proper handling at coding
time (for example, fetching entity definitions from a URL).

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug  5 09:54:22 2002
From: (Michael Hudson)
Date: 05 Aug 2002 09:54:22 +0100
Subject: [Python-Dev] seeing off SET_LINENO
In-Reply-To: Tim Peters's message of "Sat, 03 Aug 2002 20:57:45 -0400"
References: <>
Message-ID: <>

Tim Peters <> writes:

> [Michael Hudson]
> > I've found another annoying problem.  I'm not really expecting someone
> > here to sovle it for me, but writing it down might help me think
> > clearly.
> >
> > This is about the function epilogues that always get generated.  I.e:
[snip example]
> > You can see here that the epilogue gets associated with line 3,
> > whereas it shouldn't really be associated with any line at all.
> It has to be associated with some line >= 3, as c_lnotab isn't capable of
> expressing anything other than that.


> It *could* associate it with "line 4", though, if the compiler were
> changed to pump out another c_lntab entry at the epilogue.  That
> would be better than saying the time is charged to line 3, since it
> isn't on line 3 then.  I'd be happy to trade away total insanity for
> partial insanity <wink>.

This would be bad if you had

def f():
    print 1
def g():
    print 2

Anyway, I think I've found a way to get around this (see the patch).

> It stops on the "if a:" for me twice today, and I doubt that's any less
> confusing.  If it were set to line 4 instead, an unaltered pdb would
> presumably show a blank line (whatever) after the function body, and an
> altered pdb could be taught that "the last line" c_lnotab claims exists is
> really devoted to exit code not associated with any source-file line.

Yes.  I didn't really like the idea of heavily hacking pdb, as I don't
understand it.


39. Re graphics:  A picture is worth 10K words - but only those
    to describe the picture.  Hardly any sets of 10K words can be
    adequately described with pictures.
  -- Alan Perlis,

From  Mon Aug  5 11:43:37 2002
From: (Christian Tismer)
Date: Mon, 05 Aug 2002 12:43:37 +0200
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
References: <>
Message-ID: <>

Hi Guido:

here a simpler formulation of my question:

I would like to create types with overridable methods.
This is supported by the new type system.

But I'd also like to make this as fast as possible and
therefore to avoid extra dictionary lookups for methods,
especially if they are most likely not overridden.

This would mean to create an extra meta type which creates
types with a couple of extra slots, for caching overridden

My problem is now that type objects are already variable
sized and cannot support slots in the metatype.
Is there a workaround on the boilerplate, or is there
interest in a solution?
Any suggestion how to implement it?

thanks - chris

Christian Tismer             :^)   <>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship*
14109 Berlin                 :     PGP key ->
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
       whom do you want to sponsor today?

From  Mon Aug  5 11:45:08 2002
From: (Christian Tismer)
Date: Mon, 05 Aug 2002 12:45:08 +0200
Subject: [Python-Dev] timsort for jython
References: <3D4D12DA.24979.39B66004@localhost>
Message-ID: <>

Gordon McMillan wrote:

 > For that you'd use "pat", as in "pat on the back".

Should have looked into, before sending :)

 > "Pet" means (idiomatically) "stroke affectionately",
 > which is what you do to household animals & sexual
 > partners.
 > And, incidentally, "tap" is "light blow" as with
 > hammer or finger, where "blow" used in any other
 > context will likely be taken to mean "oral sex"
 > unless you're obviously discussing movement
 > of a gaseous media or the act of setting off a
 > bomb.
 > And that just covers those words as verbs (and
 > worse, I've probably missed a few meanings).

Any chance to learn all about that?

 > Don't you wish German were so, er, expressive <wink>?

In fact, it is. But I guess you have to be born here,
to know about all+1 of the nuances.

Christian Tismer             :^)   <>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship*
14109 Berlin                 :     PGP key ->
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
       whom do you want to sponsor today?

From  Mon Aug  5 12:25:49 2002
From: (Steve Holden)
Date: Mon, 5 Aug 2002 07:25:49 -0400
Subject: [Python-Dev] Single- vs. Multi-pass iterability
References: <> <> <>
Message-ID: <01c801c23c72$da6e3870$>

[Oren Tirosh]
> On Sun, Aug 04, 2002 at 04:07:08PM -0500, Patrick K. O'Brien wrote:
> > [Guido van Rossum]
> > >
> > > - There really isn't anything "broken" about the current situation;
> > >   it's just that "next" is the only method name mapped to a slot in
> > >   the type object that doesn't have leading and trailing double
> > >   underscores.
> >
But would you define it as __next__() if you had to do it again? A
__next__()/next() relationship does seem to fit more neatly.

> > I'm way behind on the email for this list, but I wanted to chime in with
> > idea related to this old thread. I know we want to limit the rate of
> > language/feature changes for the business community. At the same time,
> > situation with iterators is proof that even the best thought out new
> > features can still have a few blemishes that get discovered after
> > been incorporated into Python proper.
> I think I have a reasonable solution for the re-iteration blemish in the
> iteration protcol without breaking backward compatibility:

> > So perhaps we need some sort of concept of a "grace period" on brand-new
> > features during which blemishes can be polished off, even if the
> > breaks backward compatibility. After the grace period, breaking backward
> > compatibility becomes a higher priority.
> Giving more people a chance to play with new features before they are
> finalized is a very good idea. When a significant new feature is checked
> to the CVS a preview version can be released in source and precompiled
> to encourage more people to test it. Most CVS snapshots seem stable enough
> for a programmer's daily use.
Given the general lack of alpha- and beta-testing there'd be very little
feedback. I seem to remember that the CVS snapshots went missing in action
recently without anyone noticing, which shows that they aren't much used,
and I guess the same would be true of preview versions. Tracking the CVS
repository will test such features, but getting more testing than that would
be difficult.

I *do* agree that such feature testing would be inestimably useful.

> A good example of such a significant new feature is the source encoding
> just checked in.

Steve Holden                       
Python Web Programming      

From  Mon Aug  5 12:46:36 2002
From: (Oren Tirosh)
Date: Mon, 5 Aug 2002 14:46:36 +0300
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
Message-ID: <>

On Mon, Aug 05, 2002 at 10:12:30AM +0200, M.-A. Lemburg wrote:
> I'd like to put the following PEP up for pronouncement. Walter
> is currently on vacation, but he asked me to already go ahead
> with the process.
> I like the patch a lot and the implementation strategy is very
> interesting as well (just wish that classes were new types --
> then things could run a tad faster and the patch would be
> simpler).

Here's another implementation strategy:

Charmap entries can currently be None, an integer or a unicode string. I
suggest adding another option: a function or other callable. The function
will be called with the input string and current position as arguments and
return a 2-tuple of the replacement string and number of characters
consumed.  This will make it very easy to take the decoding charmap of an 
existing codec and patch it with a special-case for one character like '&'
to generate character references, for example. 

The function may raise an exception.  The error strategy argument will 
not be overloaded with new functionality - it will just determine whether 
this exception will be ignored or passed on.

An existing encoding charmap (usually a dictionary) can also be patched for 
special characters like <,>,&.  A special entry with a None key will be
the default entry used on a KeyError and will usually be mapped to a 
function.  If no None key is present the charmap will behave exactly the way 
it does now.  

Tying it all together:

A codec that does both charmap and entity reference translations may be 
dynamically generated.  A function will be registered that intercepts 
any codec name that looks like 'xmlcharref.CODECNAME', import that codec, 
create patched charmaps and return the (enc, dec, reader, writer) tuple.
The dynamically created entry will be cached for later use. 


From  Mon Aug  5 14:13:08 2002
From: (Fredrik Lundh)
Date: Mon, 5 Aug 2002 15:13:08 +0200
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to process PEPs
References: <Pine.LNX.4.44.0208010343380.7588-100000@ziggy>
Message-ID: <024401c23c84$1f8b07b0$05d141d5@hagrid>

Ka-Ping Yee wrote:

> I would be very unhappy about having to enter and edit inline
> documentation in an XML-based markup language.

have you tried it?

I suggest taking a look at 2.3's module.

does the comments that start with a single ## line look scary
to you?

it's javadoc-style markup, which is based on HTML.  if you've
ever written a webpage, you can learn the rest in a couple of


From  Mon Aug  5 14:29:19 2002
From: (Fredrik Lundh)
Date: Mon, 5 Aug 2002 15:29:19 +0200
Subject: [Python-Dev] timsort for jython
References: <>
Message-ID: <024701c23c84$2037c270$05d141d5@hagrid>

tim wrote:

> > You also don't need to hold back on giving stability garanties in the
> > documentation for jython's sake.
> I didn't <wink>.  Stability doesn't come free, and for all I know, in
> another 3 years a method will be discovered that's 3x faster but not
> stable.

sounds like yet another reason to add two methods; one that
guarantees stability, and one that doesn't.

the only counter-argument I've seen from you is code bloat, but
I cannot see what stops us from mapping *both* methods to a
single implementation in CPython 2.3.

an alternative would be to add a sortlib module:

    $ more Lib/

    def stablesort(list):
        list.sort() # 2.3's timsort is stable!

and a regression test script that makes sure that it really is stable
(can a test program ever be sure?)


From David Abrahams" <  Mon Aug  5 14:32:51 2002
From: David Abrahams" < (David Abrahams)
Date: Mon, 5 Aug 2002 09:32:51 -0400
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
References: <> <>
Message-ID: <01d601c23c84$d2783c80$>

From: "Christian Tismer" <>

> Hi Guido:
> here a simpler formulation of my question:
> I would like to create types with overridable methods.
> This is supported by the new type system.
> But I'd also like to make this as fast as possible and
> therefore to avoid extra dictionary lookups for methods,
> especially if they are most likely not overridden.
> This would mean to create an extra meta type which creates
> types with a couple of extra slots, for caching overridden
> methods.
> My problem is now that type objects are already variable
> sized and cannot support slots in the metatype.
> Is there a workaround on the boilerplate, or is there
> interest in a solution?
> Any suggestion how to implement it?

I believe this is roughly the same thing I was bugging Guido about just
before Python-dev. I wanted types which acted like new-style classes, but
with room for an 'C' int to store some extra information -- namely, whether
there were multiple 'C' extension classes being used as bases. IIRC the
verdict was, "you can't do that today, but there should be a way to do it".
Also if I remember anything about my hasty analysis at the time, the
biggest challenge would be getting code which accesses types to rely on
their tp_basicsize in order to find the beginning of the variable stuff.

FWIW, I'm still interested in seeing this addressed.


           David Abrahams * Boost Consulting *

From  Mon Aug  5 14:48:24 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 09:48:24 -0400
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: Your message of "Mon, 05 Aug 2002 10:12:30 +0200."
References: <>
Message-ID: <>

> I'd like to put the following PEP up for pronouncement. Walter
> is currently on vacation, but he asked me to already go ahead
> with the process.
> I like the patch a lot and the implementation strategy is very
> interesting as well (just wish that classes were new types --
> then things could run a tad faster and the patch would be
> simpler).
> The basic idea of the patch is to provide a way to elegantly
> handle error situations in codecs which go beyond the standard
> cases 'ignore', 'replace' and 'strict', e.g. to automagically
> escape problem case, to log errors for later review or to
> fetch additional information for the proper handling at coding
> time (for example, fetching entity definitions from a URL).

I know you want me to pronounce on this, but I'd like to abstain.

I have no experience in using codecs to have any kind of sense about
whether this is good or not.  If you feel confident that it's good,
you can make the decision on your own.  If you'r not yet confident, I
suggest getting more review.  I do note that the patch is humungous
(isn't everything related to Unicode? :-) so might need more review
before it goes it.

--Guido van Rossum (home page:

From  Mon Aug  5 14:57:10 2002
From: (Fredrik Lundh)
Date: Mon, 5 Aug 2002 15:57:10 +0200
Subject: [Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml
References: <> <>
Message-ID: <029301c23c88$02a287a0$05d141d5@hagrid>

Oren Tirosh wrote:

> In its current form I find pretty useless.

I use it a lot, and find it reasonably useful.  sure beats typing in
the HTML character tables myself, or writing a DTD parser.

> Names in the input in arbitrary case will not match the MixedCase
> keys in the entitydefs dictionary

people who use oddball characters may prefer to keep uppercase
letters separate from lowercase letters.  if I type "Link=F6ping" using
a named entity, I don't want it to come out as "Link=D6ping".

if you don't care, nothing stops you from using  the "lower" string

> and the decimal character reference isn't really more useful than
> the named entity reference.

really?  converting a decimal character reference to a unicode character
is trivial, but how do you convert a named entity reference to a unicode
character?  (look it up in the htmlentitydefs?)

here's a trivial piece of code that converts the entitydefs dictionary to
a entity->unicode mapping:

    entitydefs_unicode =3D {}
    for entity, char in entitydefs.items():
        if char[:2] =3D=3D "&#":
            char =3D unichr(int(char[2:-1]))
            char =3D unicode(char, "iso-8859-1")
        entitydefs_unicode[entity] =3D char


From  Mon Aug  5 15:08:04 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 10:08:04 -0400
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
In-Reply-To: Your message of "Mon, 05 Aug 2002 09:32:51 EDT."
References: <> <>
Message-ID: <>

[Christian Tismer]
> > I would like to create types with overridable methods.
> > This is supported by the new type system.
> >
> > But I'd also like to make this as fast as possible and
> > therefore to avoid extra dictionary lookups for methods,
> > especially if they are most likely not overridden.
> >
> > This would mean to create an extra meta type which creates
> > types with a couple of extra slots, for caching overridden
> > methods.
> >
> > My problem is now that type objects are already variable
> > sized and cannot support slots in the metatype.
> > Is there a workaround on the boilerplate, or is there
> > interest in a solution?
> > Any suggestion how to implement it?

[David Abrahams]
> I believe this is roughly the same thing I was bugging Guido about
> just before Python-dev. I wanted types which acted like new-style
> classes, but with room for an 'C' int to store some extra
> information -- namely, whether there were multiple 'C' extension
> classes being used as bases. IIRC the verdict was, "you can't do
> that today, but there should be a way to do it".  Also if I remember
> anything about my hasty analysis at the time, the biggest challenge
> would be getting code which accesses types to rely on their
> tp_basicsize in order to find the beginning of the variable stuff.

Yes, we need a solution for this, but I still haven't figured out how
to do it.  Help (best in the form of a suggested strategy) would be

>From Christian's post I can't tell if he wants his types to be dynamic
or static (i.e. if he's creating an arbitrary number of them at
run-time or only a fixed number that's known at compile-time).

Here's a hack.

For static extensions, you could extend one of the extension structs,
e.g. PyMappingMethods (which is the smallest and also least likely to
grow new methods), with additional fields.  Then you'd have to know
whether you can access those extra fields; I suggest checking for the
metatype.  A few casts and you're done.

For dynamic extensions, you might be able to do the same: after
type_new() has given you an object, allocate memory for an extended
PyMappingMethods struct, copy the existing PyMappingMethods struct
into it (if it exists), and replace the pointer.  Then in your
deallocation function, make sure to free the pointer.

Hope this helps in the short run.

--Guido van Rossum (home page:

From  Mon Aug  5 15:15:24 2002
From: (Aahz)
Date: Mon, 5 Aug 2002 10:15:24 -0400
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to process PEPs
In-Reply-To: <024401c23c84$1f8b07b0$05d141d5@hagrid>
References: <Pine.LNX.4.44.0208010343380.7588-100000@ziggy> <024401c23c84$1f8b07b0$05d141d5@hagrid>
Message-ID: <>

On Mon, Aug 05, 2002, Fredrik Lundh wrote:
> Ka-Ping Yee wrote:
>> I would be very unhappy about having to enter and edit inline
>> documentation in an XML-based markup language.
> have you tried it?


> I suggest taking a look at 2.3's module.
> does the comments that start with a single ## line look scary
> to you?
> it's javadoc-style markup, which is based on HTML.  if you've
> ever written a webpage, you can learn the rest in a couple of
> minutes.

That's not XML, and I wouldn't even call it XML-based.  It's yet another
structured text markup that includes bits of XML (or HTML or whatever)
and can be converted to XML.  I don't know what exactly you're using in, but I took a look at the javadoc docs when the discussion
of reST came up because I wanted to know what reST had that javadoc
didn't (and vice-versa) -- it's clear to me that javadoc is at least
somewhat limited compared to reST, and that using javadoc for any kind of
heavily marked-up docs looks far uglier than reST.

The part of reST that's as limited as what you're using in
can also be learned in a couple of minutes.
Aahz (           <*>

Project Vote Smart:

From  Mon Aug  5 15:24:54 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 10:24:54 -0400
Subject: [Python-Dev] Single- vs. Multi-pass iterability
In-Reply-To: Your message of "Sun, 04 Aug 2002 16:07:08 CDT."
References: <>
Message-ID: <>

> > - There really isn't anything "broken" about the current situation;
> >   it's just that "next" is the only method name mapped to a slot in
> >   the type object that doesn't have leading and trailing double
> >   underscores.

> I'm way behind on the email for this list, but I wanted to chime in
> with an idea related to this old thread. I know we want to limit the
> rate of language/feature changes for the business community. At the
> same time, this situation with iterators is proof that even the best
> thought out new features can still have a few blemishes that get
> discovered after they've been incorporated into Python proper. It's
> just terribly difficult to get anything "right" the very first time,
> and it would be nice to fix these blemishes sooner, rather than
> later.
> So perhaps we need some sort of concept of a "grace period" on
> brand-new features during which blemishes can be polished off, even
> if the polishing breaks backward compatibility. After the grace
> period, breaking backward compatibility becomes a higher
> priority. Since we are talking about backward compatibility only as
> it relates to the brand-new features themselves, Python-In-A-Tie
> folks can avoid the issue altogether by not using the new features
> during the grace period.
> Would something like this be an acceptable compromise?

I guess we could explicitly label certain features as experimental in
2.3.  I don't think we can interpret 2.2 like this retroactively --
while the new type stuff was labeled experimental at some point, the
iterators and generators were not, and the new type stuff was pretty
much fixed by releasing 2.2.1 (and by declaring 2.2 the tie-wearing

--Guido van Rossum (home page:

From  Mon Aug  5 15:47:03 2002
From: (Oren Tirosh)
Date: Mon, 5 Aug 2002 17:47:03 +0300
Subject: [Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml
In-Reply-To: <029301c23c88$02a287a0$05d141d5@hagrid>; from on Mon, Aug 05, 2002 at 03:57:10PM +0200
References: <> <> <029301c23c88$02a287a0$05d141d5@hagrid>
Message-ID: <>

On Mon, Aug 05, 2002 at 03:57:10PM +0200, Fredrik Lundh wrote:
> > and the decimal character reference isn't really more useful than
> > the named entity reference.
> really?  converting a decimal character reference to a unicode character
> is trivial, but how do you convert a named entity reference to a unicode
> character?  (look it up in the htmlentitydefs?)
> here's a trivial piece of code that converts the entitydefs dictionary to
> a entity->unicode mapping:
>     entitydefs_unicode = {}
>     for entity, char in entitydefs.items():
>         if char[:2] == "&#":
>             char = unichr(int(char[2:-1]))
>         else:
>             char = unicode(char, "iso-8859-1")
>         entitydefs_unicode[entity] = char

Sure it's trivial but why should I be forced to do this conversion? I'm
sorry if I didn't explain myself so well. What I meant is not that the
entitydefs dictionary is useless but that decimal character references are
not useful by themselves - they are just another intermediate form.  Why
does the dictionary convert from "&alpha;" to "&#945;" and not to the
fully decoded form which is the single unicode character u'\u03b1'?

I can't think of a case where numeric references are really useful by
themselves and not as some intermediate form.  Browsers understand
"&alpha;" and "&#945;" equally well. Humans find the named references
easier to understand. Processing programs can't understand "&#945;"
without first isolating the digits and converting them to a number. 

About case sensitivity you're right - smashing case does lose some
information. If a parser needs to understand sloppy manually-generated
HTML with tags like "&GT;" it should be a little smarter than that.


From  Mon Aug  5 16:01:34 2002
From: (M.-A. Lemburg)
Date: Mon, 05 Aug 2002 17:01:34 +0200
Subject: [Python-Dev] Re: [ python-Patches-590682 ] New codecs: html,
References: <> <> <029301c23c88$02a287a0$05d141d5@hagrid> <>
Message-ID: <>

Oren Tirosh wrote:
> On Mon, Aug 05, 2002 at 03:57:10PM +0200, Fredrik Lundh wrote:
>>>and the decimal character reference isn't really more useful than
>>>the named entity reference.
>>really?  converting a decimal character reference to a unicode character
>>is trivial, but how do you convert a named entity reference to a unicode
>>character?  (look it up in the htmlentitydefs?)
>>here's a trivial piece of code that converts the entitydefs dictionary to
>>a entity->unicode mapping:
>>    entitydefs_unicode = {}
>>    for entity, char in entitydefs.items():
>>        if char[:2] == "&#":
>>            char = unichr(int(char[2:-1]))
>>        else:
>>            char = unicode(char, "iso-8859-1")
>>        entitydefs_unicode[entity] = char
> Sure it's trivial but why should I be forced to do this conversion? 

Maybe because users of htmlentitydefs don't want to pay for
the extra table even though they don't use it ?

 > I'm
> sorry if I didn't explain myself so well. What I meant is not that the
> entitydefs dictionary is useless but that decimal character references are
> not useful by themselves - they are just another intermediate form.  Why
> does the dictionary convert from "&alpha;" to "&#945;" and not to the
> fully decoded form which is the single unicode character u'\u03b1'?

Because that only works for Unicode and not all applications
are written to work with Unicode. The table maps entities to
Latin-1 which is HTML's default encoding.

> I can't think of a case where numeric references are really useful by
> themselves and not as some intermediate form.  Browsers understand
> "&alpha;" and "&#945;" equally well. Humans find the named references
> easier to understand. Processing programs can't understand "&#945;"
> without first isolating the digits and converting them to a number. 
> About case sensitivity you're right - smashing case does lose some
> information. If a parser needs to understand sloppy manually-generated
> HTML with tags like "&GT;" it should be a little smarter than that.

That is application specific. The htmlentitydefs were generated
from the HTML spec files themselves, so they provide the basics
needed to work from. It's easy enough for you to write a function
which translates the basic table into anything you need.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug  5 16:06:50 2002
From: (Fredrik Lundh)
Date: Mon, 5 Aug 2002 17:06:50 +0200
Subject: [Python-Dev] Re: Docutils/reStructuredText is ready to process PEPs
References: <Pine.LNX.4.44.0208010343380.7588-100000@ziggy> <024401c23c84$1f8b07b0$05d141d5@hagrid> <>
Message-ID: <035801c23c91$bcaf6150$05d141d5@hagrid>

Aahz wrote:

> > have you tried it?
> Yes.

Details, please.

We've recently used javadoc/pythondoc in a relatively large Python
project (currently 30ksloc python, about 350 pages extracted docs)
with good results.  Most people involved had some exposure to html,
but not javadoc.  I don't think we've seen any markup errors at all.

> > it's javadoc-style markup, which is based on HTML.  if you've
> > ever written a webpage, you can learn the rest in a couple of
> > minutes.
> That's not XML, and I wouldn't even call it XML-based.

It all ends up in an XML infoset, and the mapping is can be
described in a single sentence.  Close enough for me.

> using javadoc for any kind of heavily marked-up docs looks far
> uglier than reST.

Why would anyone put heavily marked-up documentation in
docstrings?  Are you doing that?  Any reason you cannot use
a word processor (interactive or batch) for those parts?

> The part of reST that's as limited as what you're using in
> can also be learned in a couple of minutes.

Perhaps, but I already know HTML and JavaDoc; why waste brain
cells on learning yet another homebrewn markup language?  


From  Mon Aug  5 16:20:07 2002
From: (M.-A. Lemburg)
Date: Mon, 05 Aug 2002 17:20:07 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <>              <> <>
Message-ID: <>

Guido van Rossum wrote:
>>I'd like to put the following PEP up for pronouncement. Walter
>>is currently on vacation, but he asked me to already go ahead
>>with the process.
>>I like the patch a lot and the implementation strategy is very
>>interesting as well (just wish that classes were new types --
>>then things could run a tad faster and the patch would be
>>The basic idea of the patch is to provide a way to elegantly
>>handle error situations in codecs which go beyond the standard
>>cases 'ignore', 'replace' and 'strict', e.g. to automagically
>>escape problem case, to log errors for later review or to
>>fetch additional information for the proper handling at coding
>>time (for example, fetching entity definitions from a URL).
> I know you want me to pronounce on this, but I'd like to abstain.


> I have no experience in using codecs to have any kind of sense about
> whether this is good or not.  If you feel confident that it's good,
> you can make the decision on your own.  If you'r not yet confident, I
> suggest getting more review.  I do note that the patch is humungous
> (isn't everything related to Unicode? :-) so might need more review
> before it goes it.

Walter has written a pretty good test suite for the patch
and I have a good feeling about it. I'd like Walter to check
it into CVS and then see whether the alpha tests bring up any
quirks. The patch only touches the codecs and adds some new
exceptions. There are no other changes involved.

I think that together with PEP 263 (source code encoding) this
is a great step forward in Python's i18n capabilities.

BTW, the test script contains some examples of how to put the
error callbacks to use:

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug  5 16:27:00 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 11:27:00 -0400
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: Your message of "Mon, 05 Aug 2002 17:20:07 +0200."
References: <> <> <>
Message-ID: <>

> Walter has written a pretty good test suite for the patch
> and I have a good feeling about it. I'd like Walter to check
> it into CVS and then see whether the alpha tests bring up any
> quirks. The patch only touches the codecs and adds some new
> exceptions. There are no other changes involved.
> I think that together with PEP 263 (source code encoding) this
> is a great step forward in Python's i18n capabilities.
> BTW, the test script contains some examples of how to put the
> error callbacks to use:

Sounds like a plan then.

--Guido van Rossum (home page:

From  Mon Aug  5 16:27:05 2002
From: (M.-A. Lemburg)
Date: Mon, 05 Aug 2002 17:27:05 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <>
Message-ID: <>

Oren Tirosh wrote:
> Here's another implementation strategy:
 > [hacking charmap codec]
> Tying it all together:
> A codec that does both charmap and entity reference translations may be 
> dynamically generated.  A function will be registered that intercepts 
> any codec name that looks like 'xmlcharref.CODECNAME', import that codec, 
> create patched charmaps and return the (enc, dec, reader, writer) tuple.
> The dynamically created entry will be cached for later use. 

Even though that's possible, why add more magic to the codec registry ?
u.encode('latin-1', 'xmlcharrefreplace') looks much clearer to me.

You are of course free to write a codec which implements this
directly. No change to the core is needed for that.

However, PEP 293 addresses a much wider application space than
just escaping unmappable characters.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug  5 16:31:45 2002
From: (Jack Jansen)
Date: Mon, 5 Aug 2002 17:31:45 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
Message-ID: <>

Having to register the error handler first and then finding it by name 
smells like a very big hack to me. I understand the reasoning (that you 
don't want to modify the API of a gazillion C routines to add an error 
object argument) but it still seems like a hack....
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- Emma 
Goldman -

From  Mon Aug  5 16:47:21 2002
From: (M.-A. Lemburg)
Date: Mon, 05 Aug 2002 17:47:21 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <>
Message-ID: <>

Jack Jansen wrote:
> Having to register the error handler first and then finding it by name 
> smells like a very big hack to me. I understand the reasoning (that you 
> don't want to modify the API of a gazillion C routines to add an error 
> object argument) but it still seems like a hack....

Well, in that case, you would have to call the whole codec registry
a hack ;-)

I find having the callback available by an alias name very user
friendly, but YMMV. The main reason behind this way of doing it
is to maintain C API compatibility without adding a complete
new b/w compatiblity layer (Walter started out this way; see the
SF patch page).

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug  5 17:06:28 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 12:06:28 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <024701c23c84$2037c270$05d141d5@hagrid>
Message-ID: <>

> sounds like yet another reason to add two methods; one that
> guarantees stability, and one that doesn't.

I haven't heard the first reason, only people latching on to that a
distinction *can* be drawn, "so therefore it must be" (or something like
that ...).  The only portable way you can get stability is to do the DSU
business anyway.

> the only counter-argument I've seen from you is code bloat, but
> I cannot see what stops us from mapping *both* methods to a
> single implementation in CPython 2.3.

I passed that suggestion on in the patch report, when I asked Guido to
Pronounce, and he didn't want that.  Perl 5.8 "has a sort pragma for limited
control of the sort ... [which] may not persist into future perls",
according to <>.  Maybe
we should do that too <wink>.

> an alternative would be to add a sortlib module:
>     $ more Lib/
>     def stablesort(list):
>         list.sort() # 2.3's timsort is stable!
> and a regression test script that makes sure that it really is stable
> (can a test program ever be sure?)

I've suggested before that you may very well want to use DSU indices even if
you *know* the underlying sort is stable, in order to prevent massive
increase in sort time due to equal keys falling back to comparing records
(some sorts from Kevin Altis's database showed that dramatically).  So the
use cases for relying on stability *in Python* aren't all that clear:
passing an explicit comparison function is way slower, but sorting (key,
record) tuples instead is also prone to major slowdown surprises.  Sorting
(key, index, record) tuples remains your safest bet (unless you don't care
about speed).

So I'd like to see some real use cases.  An appropriate design for a sortlib
module may (or may not) suggest itself then.

BTW, list.sort() is stable in CPython iff


is true.  Short of that, the stability test in Lib/test/ will
almost certainly determine whether it's stable (not 100% certain, but
99.999999% easy <wink>).

From  Mon Aug  5 17:16:49 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 12:16:49 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: Your message of "Mon, 05 Aug 2002 12:06:28 EDT."
References: <>
Message-ID: <>

> BTW, list.sort() is stable in CPython iff
>     [].sort.__doc__.find('stable')
> is true.

Um, you meant "is >= 0".  The find() method doesn't return a bool, it
returns the first index where the string is found, and -1 if it is not

--Guido van Rossum (home page:

From  Mon Aug  5 17:28:25 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 12:28:25 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

>> BTW, list.sort() is stable in CPython iff
>>     [].sort.__doc__.find('stable')
>> is true^H^H^H^H> 0.

> Um, you meant "is >= 0".  The find() method doesn't return a bool, it
> returns the first index where the string is found, and -1 if it is not
> found.

What, you mean you haven't retroactively redefined -1 to be False yet?  For
shame <wink>.

From  Mon Aug  5 17:30:36 2002
From: (Aahz)
Date: Mon, 5 Aug 2002 12:30:36 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Mon, Aug 05, 2002, Guido van Rossum wrote:
>Tim Peters:
>> BTW, list.sort() is stable in CPython iff
>>     [].sort.__doc__.find('stable')
>> is true.
> Um, you meant "is >= 0".  The find() method doesn't return a bool, it
> returns the first index where the string is found, and -1 if it is not
> found.

Which only goes to prove that the people who've been whining about that
characteristic of find() were right all along.  ;-)
Aahz (           <*>

Project Vote Smart:

From  Mon Aug  5 17:35:43 2002
From: (Barry A. Warsaw)
Date: Mon, 5 Aug 2002 12:35:43 -0400
Subject: [Python-Dev] timsort for jython
References: <>
Message-ID: <>

>>>>> "TP" == Tim Peters <> writes:

    TP> What, you mean you haven't retroactively redefined -1 to be
    TP> False yet?  For shame <wink>.

Have you checked current cvs?

Python 2.3a0 (#1, Aug  5 2002, 12:06:31) 
[GCC egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> int(True)
>>> int(False)
>>> int(Maybe)


From  Mon Aug  5 17:38:35 2002
From: (Aahz)
Date: Mon, 5 Aug 2002 12:38:35 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

On Mon, Aug 05, 2002, Guido van Rossum wrote:
>>> Um, you meant "is >= 0".  The find() method doesn't return a bool, it
>>> returns the first index where the string is found, and -1 if it is not
>>> found.
>> Which only goes to prove that the people who've been whining about that
>> characteristic of find() were right all along.  ;-)
> So what would you like it to return?  True/False, with no possibility
> of finding where the substring starts?  That defeats a common use
> case.

Well, of course it can't be changed, but if Tim of all people made that
mistake, I think it's a good indicator that something's wrong.  I believe
the suggestion has been made to add an exists() method or something
similar; it's probably better to have that in the core under some
standard name instead of each person who needs it implementing the
one-liner under different names.
Aahz (           <*>

Project Vote Smart:

From  Mon Aug  5 17:41:35 2002
From: (Barry A. Warsaw)
Date: Mon, 5 Aug 2002 12:41:35 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
References: <>
Message-ID: <>

>>>>> "A" == Aahz  <> writes:

    A> Well, of course it can't be changed, but if Tim of all people
    A> made that mistake, I think it's a good indicator that
    A> something's wrong.  I believe the suggestion has been made to
    A> add an exists() method or something similar; it's probably
    A> better to have that in the core under some standard name
    A> instead of each person who needs it implementing the one-liner
    A> under different names.

What about extending `in' to allow strings longer than a single
character?  E.g.

>>> 'lo' in 'hello world'

?  That seems like the most natural way to want to spell it, and is an
extension of what you can already do.

From  Mon Aug  5 17:43:20 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 12:43:20 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Mon, 05 Aug 2002 12:38:35 EDT."
References: <> <> <> <>
Message-ID: <>

> Well, of course it can't be changed, but if Tim of all people made that
> mistake, I think it's a good indicator that something's wrong.

I'm not arguing with that, but I'm not sure how to fix it.  We've
already got two substring test methods (index() and find()).  Do we
really need a third?

> I believe
> the suggestion has been made to add an exists() method or something
> similar; it's probably better to have that in the core under some
> standard name instead of each person who needs it implementing the
> one-liner under different names.

Nobody writes the one-liner, everybody tries to remember to use

I don't like exists().  Maybe we should finally implement "s1 in s2"
as "s2.find(s1) >= 0", i.e. add a __contains__ method to strings?

--Guido van Rossum (home page:

From  Mon Aug  5 17:47:37 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 12:47:37 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

> ...
> I don't like exists().  Maybe we should finally implement "s1 in s2"
> as "s2.find(s1) >= 0", i.e. add a __contains__ method to strings?

I asked you about that a few weeks ago, and you were agreeable.  I posted
that info to, saying that if anyone cared enough to submit a patch,
the idea was pre-approved.  AFAIK, nobody bit (but I didn't pay much
attention to patches last week -- hope springs eternal <wink>).

From  Mon Aug  5 17:48:53 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 12:48:53 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

> Well, of course it can't be changed, but if Tim of all people made that
> mistake, I think it's a good indicator that something's wrong.

Na, I make a lot of mistakes at these ungodly early hours.  "str1 in str2"
is the right solution now.

From  Mon Aug  5 17:52:56 2002
From: (Skip Montanaro)
Date: Mon, 5 Aug 2002 11:52:56 -0500
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <>
Message-ID: <15694.44392.580198.319793@localhost.localdomain>

    aahz> Well, of course it can't be changed, but if Tim of all people made
    aahz> that mistake, I think it's a good indicator that something's
    aahz> wrong.

I don't think that means any such thing.  First, bot or not, Tim is allowed
to make the occasional mistake.  Everybody does.  Making a mistake doesn't
mean the language is flawed in this case.  "Find" seems like the perfect
name ("tell me where this is") and its return value is absolutely correct
for further operation on the found substring (where it was found).  I don't
believe strings need to grow an .exists() method which in effect does

    def exists(self, sub, start=None, end=None):
        return self.find(sub, start, end) >= 0

which would probably be used a lot less than .find() anyway.


From  Mon Aug  5 17:33:12 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 12:33:12 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: Your message of "Mon, 05 Aug 2002 12:30:36 EDT."
References: <> <>
Message-ID: <>

> > Um, you meant "is >= 0".  The find() method doesn't return a bool, it
> > returns the first index where the string is found, and -1 if it is not
> > found.
> Which only goes to prove that the people who've been whining about that
> characteristic of find() were right all along.  ;-)

So what would you like it to return?  True/False, with no possibility
of finding where the substring starts?  That defeats a common use

--Guido van Rossum (home page:

From  Mon Aug  5 18:22:54 2002
From: (Eric S Raymond)
Date: Mon, 5 Aug 2002 13:22:54 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

Guido van Rossum <>:
> > > Um, you meant "is >= 0".  The find() method doesn't return a bool, it
> > > returns the first index where the string is found, and -1 if it is not
> > > found.
> > 
> > Which only goes to prove that the people who've been whining about that
> > characteristic of find() were right all along.  ;-)
> So what would you like it to return?  True/False, with no possibility
> of finding where the substring starts?  That defeats a common use
> case.

True.  On the other hand, this is a very common gotcha.  I've been bitten by 
it three times in the last week, and I should know better.  Fact is that
missing > -1 is hard to spot.

I think the right answer is to leave find() as it is and have a different
notation that returns bool.  How about `a in b' whenever a and b are
both string-valued?  Seems the most natural candidate.
		<a href="">Eric S. Raymond</a>

From  Mon Aug  5 18:23:06 2002
From: (Damien Morton)
Date: Mon, 5 Aug 2002 13:23:06 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
Message-ID: <000a01c23ca4$c32a5630$6a906c42@damien>

There was a thread on this a while back on

From  Mon Aug  5 18:39:06 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 13:39:06 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

[Eric S Raymond, breaking a too-long silence]
> ...
> I think the right answer is to leave find() as it is and have a different
> notation that returns bool.  How about `a in b' whenever a and b are
> both string-valued?  Seems the most natural candidate.

I want to raise one other issue here:  should

    '' in 'xyz'

return True or raise an exception?  I've been burned, e.g., by

    >>> 'xyz'.startswith('')

when '' was computed by an expression that didn't "expect to" reduce to
nothingness, and I expect *everyone* here has been saved more than once by

    '' in 'xyz'

currently raises an exception.  If we make __contains__ act like

    'xyz'.find('') >= 0

that (very probable) error will pass silently in the future:

    >>> 'xyz'.find('')

IOW, do we follow find() rigidly, or retain "str1 in str2"'s current
behavior when str1 is empty?

From  Mon Aug  5 18:44:35 2002
From: (Barry A. Warsaw)
Date: Mon, 5 Aug 2002 13:44:35 -0400
Subject: [Python-Dev] timsort for jython
References: <>
Message-ID: <>

>>>>> "TP" == Tim Peters <> writes:

    TP> IOW, do we follow find() rigidly, or retain "str1 in str2"'s
    TP> current behavior when str1 is empty?

Is the nothing part of the everything?

I'm not sure what the natural interpretation should be, but why would
you ever want to know if '' is in somestring?  Usually I think you'd
only want to know if '' == somestring, so perhaps we should break the
symmetry here.

yin-yang-ly y'rs,

From  Mon Aug  5 18:49:42 2002
From: (Eric S Raymond)
Date: Mon, 5 Aug 2002 13:49:42 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
References: <> <>
Message-ID: <>

Tim Peters <>:
> [Eric S Raymond, breaking a too-long silence]

Thank you, Tim!

> > I think the right answer is to leave find() as it is and have a different
> > notation that returns bool.  How about `a in b' whenever a and b are
> > both string-valued?  Seems the most natural candidate.
> I want to raise one other issue here:  should
>     '' in 'xyz'
> return True or raise an exception?

> IOW, do we follow find() rigidly, or retain "str1 in str2"'s current
> behavior when str1 is empty?

Raise an exception.  Definitely.  There is no reason to follow find() 
rigidly when the whole point is to have semantics different from find().  
Besides, you're right to point out that changing this behavior could 
break existing code, and that is a big no-no.
		<a href="">Eric S. Raymond</a>

From  Mon Aug  5 18:53:42 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 13:53:42 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

[Barry, on  '' in 'xyz']
> Is the nothing part of the everything?

That's right, and if Python is a programming language for mystics that's
clearly the best answer <wink>.

> I'm not sure what the natural interpretation should be,

    s1 in s2

if and only if there exists an int i such that

    s2[i : i+len(s1)] == s1

is the acade^H^H^H^H^Hmystic meaning, and that's true of every i
in -sys.maxint .. sys.maxint when s1 is empty.

> but why would you ever want to know if '' is in somestring?

That's the practical rub indeed.  You never want to know that, so if you end
up asking it it's almost certainly a logic error in preceding code.

> Usually I think you'd only want to know if '' == somestring, so perhaps
> we should break the symmetry here.

That is the question.

From  Mon Aug  5 18:56:42 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 13:56:42 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Mon, 05 Aug 2002 13:39:06 EDT."
References: <>
Message-ID: <>

> I want to raise one other issue here:  should
>     '' in 'xyz'
> return True or raise an exception?  I've been burned, e.g., by
>     >>> 'xyz'.startswith('')
>     True
>     >>>
> when '' was computed by an expression that didn't "expect to" reduce to
> nothingness, and I expect *everyone* here has been saved more than once by
> that
>     '' in 'xyz'
> currently raises an exception.

I dunno.  The exception has annoyed me too.

> If we make __contains__ act like
>     'xyz'.find('') >= 0
> that (very probable) error will pass silently in the future:
>     >>> 'xyz'.find('')
>     0
>     >>>
> IOW, do we follow find() rigidly, or retain "str1 in str2"'s current
> behavior when str1 is empty?

I expect that Andrew Koenig would delight in this question. :-)

I personally see no way to defend ('' in 'x') returning false; it's so
clearly a substring that any definition of substring-ness that
excludes this seems mathematically wrong, despite your good

I guess we'll have to cope in the same way as we cope with the
behavior of find() and startswith() in similar cases.

--Guido van Rossum (home page:

From  Mon Aug  5 19:03:58 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 14:03:58 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Mon, 05 Aug 2002 13:56:42 EDT."
References: <>
Message-ID: <>

I wrote:
> I personally see no way to defend ('' in 'x') returning false; it's so
> clearly a substring that any definition of substring-ness that
> excludes this seems mathematically wrong, despite your good
> intentions.

However, the backwards compatibility argument makes sense.  It used to
raise an exception and it would probably break code if it stopped
doing so; longer strings are much less likely to be passed by accident
so the need for the exception there is less strong.  I'm of two minds
on this now...

--Guido van Rossum (home page:

From  Mon Aug  5 19:13:02 2002
From: (Neil Schemenauer)
Date: Mon, 5 Aug 2002 11:13:02 -0700
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>; from on Mon, Aug 05, 2002 at 01:39:06PM -0400
References: <> <>
Message-ID: <>

Tim Peters wrote:
> IOW, do we follow find() rigidly, or retain "str1 in str2"'s current
> behavior when str1 is empty?

I vote for the former.


From  Mon Aug  5 19:17:11 2002
From: (Neil Schemenauer)
Date: Mon, 5 Aug 2002 11:17:11 -0700
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>; from on Mon, Aug 05, 2002 at 11:13:02AM -0700
References: <> <> <>
Message-ID: <>

Neil Schemenauer wrote:
> Tim Peters wrote:
> > IOW, do we follow find() rigidly, or retain "str1 in str2"'s current
> > behavior when str1 is empty?
> I vote for the former.

D'oh.  I meant the LATER (e.g. raise an error for an empty LHS).


From  Mon Aug  5 19:51:45 2002
From: (Andrew Koenig)
Date: 05 Aug 2002 14:51:45 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
References: <>
Message-ID: <>

Eric> Raise an exception.  Definitely.  There is no reason to follow
Eric> find() rigidly when the whole point is to have semantics
Eric> different from find().  Besides, you're right to point out that
Eric> changing this behavior could break existing code, and that is a
Eric> big no-no.

Changing the meaning of ('ab' in 'abc') might also break existing code.

Andrew Koenig,,

From  Mon Aug  5 20:00:27 2002
From: (Bjorn Pettersen)
Date: Mon, 5 Aug 2002 13:00:27 -0600
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
Message-ID: <>

> From: Tim Peters []=20
> [Guido]
> > ...
> > I don't like exists().  Maybe we should finally implement=20
> "s1 in s2"=20
> > as "s2.find(s1) >=3D 0", i.e. add a __contains__ method to strings?
> I asked you about that a few weeks ago, and you were=20
> agreeable.  I posted that info to, saying that if=20
> anyone cared enough to submit a patch, the idea was=20
> pre-approved.  AFAIK, nobody bit (but I didn't pay much=20
> attention to patches last week -- hope springs eternal <wink>).

Well, there was

I'll see if I can find time next weekend to figure out the compilation
warnings, adding back special casing for single char containment, and
adding test and doc patches...

-- bjorn

From  Mon Aug  5 19:44:50 2002
From: (Steve Holden)
Date: Mon, 5 Aug 2002 14:44:50 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
References: <>              <>  <>
Message-ID: <042501c23cb0$302a62b0$>

> I wrote:
> > I personally see no way to defend ('' in 'x') returning false; it's so
> > clearly a substring that any definition of substring-ness that
> > excludes this seems mathematically wrong, despite your good
> > intentions.
If you are serious about this proposal then clearly it would be as well to
have "in" agree with find(), and currently anystring.find('') returns zero,
suggesting the null string first appears at the beginning.

> However, the backwards compatibility argument makes sense.  It used to
> raise an exception and it would probably break code if it stopped
> doing so; longer strings are much less likely to be passed by accident
> so the need for the exception there is less strong.  I'm of two minds
> on this now...

However, I'm somewhat horrified to see this being discussed seriously. You
can take pragmatism too far, you know ;-)

Are you also proposing to allow

    if [2, 3] in [1, 2, 3, 4]

which is effectively the meaning you seem to be proposing for strings? Where
else in the language does the keyword "in" refer to anything other than
membership? Why do we need another way to do what find() and index() already

Should we also ensure that

    for s in "abc":
        print s



Should it also print a blank line because "'' in anystring" is true? I can
see why users might want to be able to use a "string in string" construct,
but it would seem to confuse the "for" semantics. Is there some other
construct for which

    for v in object_or_instance:

does not assign to v all x such that "x in object_or_instance" is true? I
can see a few teaching problems here.

my-god-*am*-i-really-a-bigot-ly y'rs  - steve
Steve Holden                       
Python Web Programming      

From  Mon Aug  5 20:03:55 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 15:03:55 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Mon, 05 Aug 2002 13:00:27 MDT."
References: <>
Message-ID: <>

> Well, there was
> &hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=mailman.1024100114.13008.python-list%4
> I'll see if I can find time next weekend to figure out the compilation
> warnings, adding back special casing for single char containment, and
> adding test and doc patches...

Cool.  Please use the SourceForge patch manager!

--Guido van Rossum (home page:

From  Mon Aug  5 20:05:43 2002
From: (Steve Holden)
Date: Mon, 5 Aug 2002 15:05:43 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
References: <><><> <>
Message-ID: <043901c23cb3$1b680690$>

[Andrew Koenig]
> Eric> Raise an exception.  Definitely.  There is no reason to follow
> Eric> find() rigidly when the whole point is to have semantics
> Eric> different from find().  Besides, you're right to point out that
> Eric> changing this behavior could break existing code, and that is a
> Eric> big no-no.
> Changing the meaning of ('ab' in 'abc') might also break existing code.

True, but it does seem unlikely (though not impossible) that many are
relying on "ab" in "abc" raising an exception.

Steve Holden                       
Python Web Programming      

From  Mon Aug  5 20:11:15 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 15:11:15 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Mon, 05 Aug 2002 14:44:50 EDT."
References: <> <> <>
Message-ID: <>

> [GvR]
> > > I personally see no way to defend ('' in 'x') returning false;
> > > it's so clearly a substring that any definition of
> > > substring-ness that excludes this seems mathematically wrong,
> > > despite your good intentions.

> If you are serious about this proposal then clearly it would be as
> well to have "in" agree with find(), and currently
> anystring.find('') returns zero, suggesting the null string first
> appears at the beginning.

Yes, consistency strongly suggests that.

> > However, the backwards compatibility argument makes sense.  It used to
> > raise an exception and it would probably break code if it stopped
> > doing so; longer strings are much less likely to be passed by accident
> > so the need for the exception there is less strong.  I'm of two minds
> > on this now...
> However, I'm somewhat horrified to see this being discussed
> seriously. You can take pragmatism too far, you know ;-)
> Are you also proposing to allow
>     if [2, 3] in [1, 2, 3, 4]
> which is effectively the meaning you seem to be proposing for
> strings?

No, since it's not a common thing to need.

> Where else in the language does the keyword "in" refer to anything
> other than membership?

Dictionary keys?  That's certainly something very different from
sequence membership!

> Why do we need another way to do what find() and index() already do?

You must've missed the earlier thread -- it's because a substring test
is a common operation and the way to spell it with find() requires you
to tack on ">= 0" which many people accidentally leave out when in a

> Should we also ensure that
>     for s in "abc":
>         print s
> prints
>     a
>     ab
>     abc
>     b
>     bc
>     c
> Should it also print a blank line because "'' in anystring" is true? I can
> see why users might want to be able to use a "string in string" construct,
> but it would seem to confuse the "for" semantics. Is there some other
> construct for which
>     for v in object_or_instance:
> does not assign to v all x such that "x in object_or_instance" is true? I
> can see a few teaching problems here.

To this latter example I can only say, "A foolish consistency is the
hobgoblin of little minds."

At least this still holds (unless x is an iterator or otherwise
mutated by access :-):

  for v in x:
     assert v in x

--Guido van Rossum (home page:

From  Mon Aug  5 20:12:03 2002
From: (Neal Norwitz)
Date: Mon, 05 Aug 2002 15:12:03 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
References: <>
 <> <>
Message-ID: <>

Guido van Rossum wrote:
> I wrote:
> > I personally see no way to defend ('' in 'x') returning false; it's so
> > clearly a substring that any definition of substring-ness that
> > excludes this seems mathematically wrong, despite your good
> > intentions.
> However, the backwards compatibility argument makes sense.  It used to
> raise an exception and it would probably break code if it stopped
> doing so; longer strings are much less likely to be passed by accident
> so the need for the exception there is less strong.  I'm of two minds
> on this now...

Here's a patch:

In testing this patch, I ran across this:

	>>> 's' in 's'
	>>> 's' in 's' == True
	>>> 's' in 's' is True
	>>> id('s' in 's')
	>>> id(True)

What's up with that?  Am I missing something?  
Note: this occurs before the patch too.


From  Mon Aug  5 20:14:30 2002
From: (Andrew Koenig)
Date: Mon, 5 Aug 2002 15:14:30 -0400 (EDT)
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <043901c23cb3$1b680690$>
References: <><><> <> <043901c23cb3$1b680690$>
Message-ID: <>

>> Changing the meaning of ('ab' in 'abc') might also break existing code.

Steve> True, but it does seem unlikely (though not impossible) that many are
Steve> relying on "ab" in "abc" raising an exception.

How many are relying on '' in 'abc' raising an exception?

From  Mon Aug  5 20:14:37 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 15:14:37 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Mon, 05 Aug 2002 15:11:15 EDT."
References: <> <> <> <042501c23cb0$302a62b0$>
Message-ID: <>

> > Are you also proposing to allow
> > 
> >     if [2, 3] in [1, 2, 3, 4]
> > 
> > which is effectively the meaning you seem to be proposing for
> > strings?
> No, since it's not a common thing to need.

Of course, there's another reason why that can't be done even if it
*was* a common need: [2, 3] could be a list item, e.g. [1, [2, 3], 4].
This kind of thing can't happen for strings.

--Guido van Rossum (home page:

From  Mon Aug  5 20:15:28 2002
From: (Jeff Epler)
Date: Mon, 5 Aug 2002 14:15:28 -0500
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

On Mon, Aug 05, 2002 at 03:12:03PM -0400, Neal Norwitz wrote:
> 	>>> 's' in 's' == True
> 	False

>>> ('s' in 's') == True
>>> ('s' in 's') and ('s' == True)

short-circuit comparisons include == and is.

From  Mon Aug  5 20:16:18 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 15:16:18 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Mon, 05 Aug 2002 15:12:03 EDT."
References: <> <> <>
Message-ID: <>

> In testing this patch, I ran across this:
> 	>>> 's' in 's'
> 	True
> 	>>> 's' in 's' == True
> 	False
> 	>>> 's' in 's' is True
> 	False
> 	>>> id('s' in 's')
> 	135246792
> 	>>> id(True)
> 	135246792
> What's up with that?  Am I missing something?  

Yes, 'is' and'in' and '==' are all comparison operators, and the
chaining syntax makes this interpreted as (roughly)

    ('s' in 's') and ('s' == True)
    ('s' in 's') and ('s' is True)

--Guido van Rossum (home page:

From  Mon Aug  5 20:19:37 2002
From: (Neal Norwitz)
Date: Mon, 05 Aug 2002 15:19:37 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
References: <> <> <>
 <> <>
Message-ID: <>

Guido van Rossum wrote:
> > In testing this patch, I ran across this:
> >
> >       >>> 's' in 's' is True
> >       False
> >
> > What's up with that?  Am I missing something?
> Yes, 'is' and'in' and '==' are all comparison operators, and the
> chaining syntax makes this interpreted as (roughly)

Thanks (to Jeff too).  I knew I had to be missing something.
Well there's a still the patch with a working test.


From  Mon Aug  5 20:25:51 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 15:25:51 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: Your message of "Mon, 05 Aug 2002 15:14:30 EDT."
References: <> <> <> <> <043901c23cb3$1b680690$>
Message-ID: <>

> How many are relying on '' in 'abc' raising an exception?

That's impossible to know.

The case that I am familiar with is roughly as follows.  Suppose you
want to check whether a string begins with a certain character, and
you write something like this:

  c = s[0] stuff with c...
  if c in string.letters:
     ...parse it further...

The first time this is called with s being empty, the assignment to c
fails because the empty string doesn't have a first item.

So you "fix" that by changing it to this:

  c = s[:1]

But the code is still broken.  Currently, the "if c in string.letters"
will raise an exception, and you'll figure out that s=="" should be
special-cased earlier on.  With the proposed "in" semantics, this
failure is only detected when the "parse it further" code does the
wrong thing -- either it raises another exception, or it produces the
wrong result without raising an exception.  I expect that that will be
harder to debug because the source of the error is farther away from
the detection.

Note that we make similar exceptions for the empty string in other

  >>> "xxx".islower()
  >>> "xx".islower()
  >>> "x".islower()
  >>> "".islower()

Somehow this reminds me of the 0**0 debate recently in edu-sig...

--Guido van Rossum (home page:

From  Mon Aug  5 20:27:46 2002
From: (Zack Weinberg)
Date: Mon, 5 Aug 2002 12:27:46 -0700
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

On Mon, Aug 05, 2002 at 03:12:03PM -0400, Neal Norwitz wrote:
>	>>> 's' in 's'
>	True
>	>>> 's' in 's' == True
>	False

The operator precedence is not what you expect.

	>>> ('s' in 's') == True


From  Mon Aug  5 20:27:11 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 15:27:11 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

>> and I expect *everyone* here has been saved more than once by that
>>     '' in 'xyz'
>> currently raises an exception.

> I dunno.  The exception has annoyed me too.

Annoyed because it pointed out an error in your code, or because True would
have been a useful result?  It's annoyed me too, but it was always for the
former reason.

> I expect that Andrew Koenig would delight in this question. :-)

Believe me, he already did <wink>.

> I personally see no way to defend ('' in 'x') returning false;

The suggestion is not that it return False, but that it raise an exception,
as in "errors should never pass silently".

> it's so clearly a substring that any definition of substring-ness that
> excludes this seems mathematically wrong, despite your good intentions.

I'd like to see a plausible use case for

    '' in str

returning True, then.  Do keep in mind that nobody can be more anal about
mathematical consistency than me <0.9 wink>, but the real world isn't much
impressed with our abstractions.

From  Mon Aug  5 20:30:52 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 15:30:52 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

[Bjorn Pettersen]
> Well, there was
> &hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=mailman.1024100114.13008.python-list%4

Nice URL <wink>.  Please put patches on SourceForge -- anywhere else and
they may as well not exist:

> I'll see if I can find time next weekend to figure out the compilation
> warnings, adding back special casing for single char containment, and
> adding test and doc patches...


From  Mon Aug  5 20:43:18 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 15:43:18 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
Message-ID: <>

> Somehow this reminds me of the 0**0 debate recently in edu-sig...

Not quite yet:  there are good domain-specific reasons for wanting 0**0 to
do one of {return 0, return 1, raise an exception}.  In the

    '' in str

case we know that returning True can cause problems, but haven't seen an
example where returning True is useful.

From  Mon Aug  5 20:43:49 2002
From: (Steve Holden)
Date: Mon, 5 Aug 2002 15:43:49 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
References: <> <> <>              <042501c23cb0$302a62b0$>  <>
Message-ID: <048101c23cb8$6e0e86d0$>

> > [GvR]
> > > > I personally see no way to defend ('' in 'x') returning false;
> > > > it's so clearly a substring that any definition of
> > > > substring-ness that excludes this seems mathematically wrong,
> > > > despite your good intentions.
> [SteveH]
> > If you are serious about this proposal then clearly it would be as
> > well to have "in" agree with find(), and currently
> > anystring.find('') returns zero, suggesting the null string first
> > appears at the beginning.
> Yes, consistency strongly suggests that.
And of course, that wouldn't be a foolish consistency :-)

> > > However, the backwards compatibility argument makes sense.  It used to
> > > raise an exception and it would probably break code if it stopped
> > > doing so; longer strings are much less likely to be passed by accident
> > > so the need for the exception there is less strong.  I'm of two minds
> > > on this now...
> >
[ ... ]
> > Why do we need another way to do what find() and index() already do?
> You must've missed the earlier thread -- it's because a substring test
> is a common operation and the way to spell it with find() requires you
> to tack on ">= 0" which many people accidentally leave out when in a
> hurry.
Nope, I didn't miss it. As I said, I just found it hard to believe this was
a serious discussion.

> > Should we also ensure that
> >
> >     for s in "abc":
> >         print s
> >
> > prints
> >
> >     a
> >     ab
> >     abc
> >     b
> >     bc
> >     c
> >
> > Should it also print a blank line because "'' in anystring" is true? I
> > see why users might want to be able to use a "string in string"
> > but it would seem to confuse the "for" semantics. Is there some other
> > construct for which
> >
> >     for v in object_or_instance:
> >
> > does not assign to v all x such that "x in object_or_instance" is true?
> > can see a few teaching problems here.
> To this latter example I can only say, "A foolish consistency is the
> hobgoblin of little minds."
Of course the string has always been an anomalous sequence anyway, but it
seems to be becoming less of a sequence.

> At least this still holds (unless x is an iterator or otherwise
> mutated by access :-):
>   for v in x:
>      assert v in x
Indeed. A rather weaker assertion, though. Anyhoo, no other arguments
against s1 in s2, so I'll make one parting comment. While I understand
perfectly well the pragmatic case for this change, it appears to blur the
borders between set membership and subsetting; if it's so desirable, why
didn't the need arise earlier?.

Steve Holden                       
Python Web Programming      

From  Mon Aug  5 20:54:57 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 15:54:57 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Mon, 05 Aug 2002 15:43:49 EDT."
References: <> <> <> <042501c23cb0$302a62b0$> <>
Message-ID: <>

> While I understand perfectly well the pragmatic case for this
> change, it appears to blur the borders between set membership and
> subsetting; if it's so desirable, why didn't the need arise
> earlier?.

Couldn't be done before __contains__ was a separately overloadable
operator.  That's relatively recent (Python 2.0).  And playing with it
in innovative ways is even more recent (Python 2.2, for "has_key").
But the satisfaction that spelling "has_key" as "in" gives me suggests
that there's more potential to it.

--Guido van Rossum (home page:

From  Mon Aug  5 21:03:09 2002
From: (Gordon McMillan)
Date: Mon, 5 Aug 2002 16:03:09 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <>
Message-ID: <3D4EA1BD.11180.3FCC867A@localhost>

On 5 Aug 2002 at 15:27, Tim Peters wrote:

> >> and I expect *everyone* here has been saved more
> >> than once by that
> >>
> >>     '' in 'xyz'
> >> currently raises an exception.

Not that I can recall.

The exception, however, is a TypeError saying the left
operand isn't a character. It's not a
TrueButYourProbablyMakingAMistakeException <wink>.
> I'd like to see a plausible use case for
>     '' in str
> returning True, then.  

Any code that currently does 
 str.find(x) >= 0

I tend to use:

 pos = str.find(x)
 if pos > -1:

because I'm normally interested in where.
If it's a pure membership test, I tend not
to use a string but a tuple:

 if c in ('a', 'b', 'c'):

This is, at least partially, because "character"
is not an official Python type so I always expect
 str1 in str2 
to work when it only sometimes does.

-- Gordon

From  Mon Aug  5 21:03:39 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 16:03:39 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <048101c23cb8$6e0e86d0$>
Message-ID: <>

[Steve Holden]
> ...
> While I understand perfectly well the pragmatic case for this change, it
> appears to blur the borders between set membership and subsetting; if
> it's so desirable, why didn't the need arise earlier?.

Mostly because the possibility for a type to define a __contains__
implementation didn't used to exist.  Now that any type can define "x in y"
to do what makes most sense for its instances, the rationale for strings
retaining strained (for strings) "I'm just a sequence, you see, exactly like
any other sequence" __contains__ semantics has grown much weaker.

From  Mon Aug  5 21:08:43 2002
From: (Martin v. Loewis)
Date: 05 Aug 2002 22:08:43 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
References: <>
Message-ID: <>

Oren Tirosh <> writes:

> Charmap entries can currently be None, an integer or a unicode string. I
> suggest adding another option: a function or other callable.

That helps only for a subset of all codecs (the charmap based ones),
and thus is unacceptable. I want it to work for, say, big5 also.


From  Mon Aug  5 21:15:41 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 16:15:41 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <3D4EA1BD.11180.3FCC867A@localhost>
Message-ID: <>

> I'd like to see a plausible use case for
>     '' in str
> returning True, then.

[Gordon McMillan]
> Any code that currently does
>  str.find(x) >= 0

You're saying that you actually do that in cases where x may be an empty
string, and that it's useful to get a True result in at least one such case?
If you are saying that, it needs more details; but if you're not saying
that, it's not a relevant use case.

> I tend to use:
>  pos = str.find(x)
>  if pos > -1:
> because I'm normally interested in where.

Sure -- that's what .find() is for, after all.  But you're also saying that
your algorithms expect to search for empty strings?  Like in:

    index = option_letter_string.find(letter)
    if index >= 0:
        raise UnknownOptionLetter(letter)

you make sure that list_of_option_functions[0] is suitable for processing
both the first option in option_letter_string and an empty "option letter"?

From  Mon Aug  5 21:18:37 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 16:18:37 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

[Neil Schemenauer]
> I vote for the former.

[Another Neil Schemenauer]
> D'oh.  I meant the LATER (e.g. raise an error for an empty LHS).

Damn -- too bad your votes cancelled ut.  Next time just give me your proxy

From  Mon Aug  5 21:44:04 2002
From: (Oren Tirosh)
Date: Mon, 5 Aug 2002 23:44:04 +0300
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>; from on Mon, Aug 05, 2002 at 10:08:43PM +0200
References: <> <>
Message-ID: <>

On Mon, Aug 05, 2002 at 10:08:43PM +0200, Martin v. Loewis wrote:
> Oren Tirosh <> writes:
> > Charmap entries can currently be None, an integer or a unicode string. I
> > suggest adding another option: a function or other callable.
> That helps only for a subset of all codecs (the charmap based ones),
> and thus is unacceptable. I want it to work for, say, big5 also.

With the ability to embed functions inside a charmap big5 and other encodings
could be converted to be charmap based, too :-)

I just feel that there must be *some* simpler way. A patch with 87k of code 
scares the hell out of me. 

"There are no complex things. Only things that I haven't yet understood 
why they are really simple."


From  Mon Aug  5 21:50:20 2002
From: (Gordon McMillan)
Date: Mon, 5 Aug 2002 16:50:20 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <3D4EA1BD.11180.3FCC867A@localhost>
Message-ID: <3D4EACCC.31934.3FF7B782@localhost>

On 5 Aug 2002 at 16:15, Tim Peters wrote:

> [Tim]
> > I'd like to see a plausible use case for
> >
> >     '' in str
> >
> > returning True, then.
> [Gordon McMillan]
> > Any code that currently does
> >  str.find(x) >= 0
> You're saying that you actually do that in cases
> where x may be an empty string, and that it's useful
> to get a True result in at least one such case? 

What I'm really saying is that I almost never use
 x in str
because it's semantics have always been peculiar.
Thus, I don't *really* care whether '' in str raises
an exception, because if it does, I won't train myself
to use it <wink>.


> Sure -- that's what .find() is for, after all.  But
> you're also saying that your algorithms expect to
> search for empty strings?  Like in:
>     index = option_letter_string.find(letter)
>     if index >= 0:
>         list_of_option_functions[index]()
>     else:
>         raise UnknownOptionLetter(letter)
> you make sure that list_of_option_functions[0] is
> suitable for processing both the first option in
> option_letter_string and an empty "option letter"?

Say we have a sequence of objects where obj.options
uses a string to hold (orthogonal) option codes. We're
selecting a subset based on the user's criteria, and
empty means "don't care".

 for obj in seq:
   if obj.options.find(criteria):

makes perfect sense.

I rather doubt I have code in that exact
form, because I'd probably special case
it if it were that obvious.

if not criteria:
  return seq
for obj in seq:

OTOH, I use find() a lot, and since I can't
recall having been bit by find('') returning 0, I
have to conclude that the mystically / mathematically
correct answer is, in my case at least, also
the pragmatically correct one.

But you solved a similar problem once
already, by noting that a large quantity had
to have at least 537 objects in it.

-- Gordon

From  Mon Aug  5 21:59:54 2002
From: (Barry A. Warsaw)
Date: Mon, 5 Aug 2002 16:59:54 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
References: <>
Message-ID: <>

>>>>> "NN" == Neal Norwitz <> writes:

    NN> Here's a patch:

Updated with a few fixes and nits, and some additional tests.

All that's left is the documentation.

From  Mon Aug  5 22:03:33 2002
From: (Eric S Raymond)
Date: Mon, 5 Aug 2002 17:03:33 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <> <> <> <042501c23cb0$302a62b0$> <> <048101c23cb8$6e0e86d0$> <>
Message-ID: <>

Guido van Rossum <>:
> But the satisfaction that spelling "has_key" as "in" gives me suggests
> that there's more potential to it.

Not a trivial datum.  Tools that feel good in the hand are not mere
self-indulgence; they promote relaxation and creativity in the user. 
		<a href="">Eric S. Raymond</a>

From  Mon Aug  5 22:06:25 2002
From: (Martin v. Loewis)
Date: 05 Aug 2002 23:06:25 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
References: <>
Message-ID: <>

Oren Tirosh <> writes:

> With the ability to embed functions inside a charmap big5 and other
> encodings could be converted to be charmap based, too :-)

This is precisely what PEP 293 does: allow to embed functions in any

> I just feel that there must be *some* simpler way. 

Why do you think so? It is not difficult.

> A patch with 87k of code scares the hell out of me.

Ah, so it is the size of the patch? Some of it could be moved to
Python perhaps, thus reducing the size of the patch (e.g. the registry
comes to mind)

If you look at the patch, you see that it precisely does what you
propose to do: add a callback to the charmap codec:

- it deletes charmap_decoding_error
- it adds state to feed the callback function
- it replaces the old call to charmap_decoding_error by

! 	    outpos = p-PyUnicode_AS_UNICODE(v);
! 	    startinpos = s-starts;
! 	    endinpos = startinpos+1;
! 	    if (unicode_decode_call_errorhandler(
! 		 errors, &errorHandler,
! 		 "charmap", "character maps to <undefined>",
! 		 starts, size, &startinpos, &endinpos, &exc, &s,
! 		 (PyObject **)&v, &outpos, &p)) {#

  (original code was)

! 	    if (charmap_decoding_error(&s, &p, errors, 
! 				       "character maps to <undefined>")) {

- likewise for encoding.

Now, apply the same change to all other codecs (as you propose to do
for big5), and you obtain the patch for PEP 293.

In doing so, you find that the modifications needed for each codec are
so similar that you add some supporting infrastructure, and correct
errors in the existing codecs that you spot, and so on. 

The diffstat is

 Include/codecs.h        |   37
 Include/pyerrors.h      |   67 +
 Lib/           |    5
 Modules/_codecsmodule.c |   61 +
 Objects/stringobject.c  |    7
 Objects/unicodeobject.c | 1794 +++++++++++++-------!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 Python/codecs.c         |  399 ++++++++++
 Python/exceptions.c     |  603 ++++++++++++++++
 8 files changed, 1678 insertions(+), 236 deletions(-), 1059 modifications(!)

If you look at the large blocks of new code, you find that it is in

- charmap_encoding_error, which insists on implementing known error
  handling algorithms inline,

- the default error handlers, of which atleast
  PyCodec_XMLCharRefReplaceErrors should be pure-Python

- PyCodec_BackslashReplaceErrors, likewise,

- the UnicodeError exception methods (which could be omitted, IMO).

So, if you look at the patch, it isn't really that large.


From  Mon Aug  5 22:04:53 2002
From: (Eric S Raymond)
Date: Mon, 5 Aug 2002 17:04:53 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

Andrew Koenig <>:
> Eric> Raise an exception.  Definitely.  There is no reason to follow
> Eric> find() rigidly when the whole point is to have semantics
> Eric> different from find().  Besides, you're right to point out that
> Eric> changing this behavior could break existing code, and that is a
> Eric> big no-no.
> Changing the meaning of ('ab' in 'abc') might also break existing code.

I could construct a try/except case that would change, yes.  Are you
being pedantic, or is this intended as a serious objection?
		<a href="">Eric S. Raymond</a>

From  Mon Aug  5 22:07:12 2002
From: (Andrew Koenig)
Date: Mon, 5 Aug 2002 17:07:12 -0400 (EDT)
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <> (message from Eric S Raymond
 on Mon, 5 Aug 2002 17:04:53 -0400)
References: <> <> <> <> <>
Message-ID: <>

>> Changing the meaning of ('ab' in 'abc') might also break existing code.

Eric> I could construct a try/except case that would change, yes.  Are you
Eric> being pedantic, or is this intended as a serious objection?

I think it's very nearly as serious as the objection that changing
the meaning of ('' in 'abc') might break code.

The reason for the "very nearly" is that is is easier to obtain empty
strings by accident than it is to obtain nonempty ones.

From  Mon Aug  5 22:23:27 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 17:23:27 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: Your message of "Mon, 05 Aug 2002 17:04:53 EDT."
References: <> <> <> <>
Message-ID: <>

> > Eric> Raise an exception.  Definitely.  There is no reason to follow
> > Eric> find() rigidly when the whole point is to have semantics
> > Eric> different from find().  Besides, you're right to point out that
> > Eric> changing this behavior could break existing code, and that is a
> > Eric> big no-no.

> Andrew Koenig <>:
> > Changing the meaning of ('ab' in 'abc') might also break existing code.

> I could construct a try/except case that would change, yes.  Are you
> being pedantic, or is this intended as a serious objection?

Andrew appears to say that if you object against '' in 'abc' not
raising an exception, you should also object against the other one;
but his real point is the corollary: since you don't object against
giving 'ab' in 'abc' new meaning, you shouldn't object against a new
meaning for '' in 'abc' either -- at least not based on the argument
of breaking code.  Whenever we say that a change doesn't break code,
we almost always imply "except code that depends on a particular thing
raising an exception".

That '' in 'abc' or 'ab' in 'abc' raises TypeError tells me that it is
okay to change this behavior into doing something useful, *if* we have
a useful thing to substitute for the exception.

Tim is arguing that '' in 'abc' is not a useful question to ask.  The
usefulness of the exception is not that it's a feature on which
correct programs depend, but that it's an early warning that your
program is broken.  Losing that early warning sign would mean more
time wasted debugging.

OTOH I'm worried that some code doing some mathematical proof using
substring relationships would find it irritating to have to work
around the irregularity.  But I admit that this is a purely
theoretical fear for now.

--Guido van Rossum (home page:

From  Mon Aug  5 22:24:39 2002
From: (Jack Jansen)
Date: Mon, 5 Aug 2002 23:24:39 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
Message-ID: <>

On maandag, augustus 5, 2002, at 05:47 , M.-A. Lemburg wrote:

> Jack Jansen wrote:
>> Having to register the error handler first and then finding it 
>> by name smells like a very big hack to me. I understand the 
>> reasoning (that you don't want to modify the API of a 
>> gazillion C routines to add an error object argument) but it 
>> still seems like a hack....
> Well, in that case, you would have to call the whole codec registry
> a hack ;-)

No, not really. For codecs I think that there needn't be much of 
a connection between the codec-supplier and the codec-user. 
Conceivably the encoding-identifying string being passed to 
encode() could even have been read from a data file or something.

For error handling this is silly: the code calling encode() or 
decode() will know how it wants errors handled. And if you argue 
that it isn't really error handling but an extension to the 
encoding name then maybe it should be treated as such (by 
appending it to the codec name in the string, as in 
"ascii;xmlentitydefs" or so?).
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Mon Aug  5 22:29:35 2002
From: (Andrew Koenig)
Date: 05 Aug 2002 17:29:35 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
References: <>
Message-ID: <>

Guido> Andrew appears to say that if you object against '' in 'abc' not
Guido> raising an exception, you should also object against the other one;
Guido> but his real point is the corollary: since you don't object against
Guido> giving 'ab' in 'abc' new meaning, you shouldn't object against a new
Guido> meaning for '' in 'abc' either -- at least not based on the argument
Guido> of breaking code.  Whenever we say that a change doesn't break code,
Guido> we almost always imply "except code that depends on a particular thing
Guido> raising an exception".


Guido> Tim is arguing that '' in 'abc' is not a useful question to ask.  The
Guido> usefulness of the exception is not that it's a feature on which
Guido> correct programs depend, but that it's an early warning that your
Guido> program is broken.  Losing that early warning sign would mean more
Guido> time wasted debugging.


Guido> OTOH I'm worried that some code doing some mathematical proof using
Guido> substring relationships would find it irritating to have to work
Guido> around the irregularity.  But I admit that this is a purely
Guido> theoretical fear for now.

Also yes.

On the other hand, I have a practical fear: There are lots of
different ways of asking whether a string s contains a substring s1.
If those ways behave in diverse manners when s1 is empty, I am going
to have to remember which way to obtain which behavior.  I would
really like to avoid having to do that.

Andrew Koenig,,

From  Mon Aug  5 22:38:09 2002
From: (Eric S Raymond)
Date: Mon, 5 Aug 2002 17:38:09 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
References: <> <> <> <> <> <>
Message-ID: <>

Guido van Rossum <>:
> Andrew appears to say that if you object against '' in 'abc' not
> raising an exception, you should also object against the other one;
> but his real point is the corollary: since you don't object against
> giving 'ab' in 'abc' new meaning, you shouldn't object against a new
> meaning for '' in 'abc' either -- at least not based on the argument
> of breaking code.

I understand.  But there is a difference between changes that seem likely to 
silently break a lot iof things and changes for which one almost has to
contrive an example that would break.  I think this one is in the latter

>                Whenever we say that a change doesn't break code,
> we almost always imply "except code that depends on a particular thing
> raising an exception".


> That '' in 'abc' or 'ab' in 'abc' raises TypeError tells me that it is
> okay to change this behavior into doing something useful, *if* we have
> a useful thing to substitute for the exception.

Also agreed; I parallel your reasoning as well is your conclusion, and
in fact thought this issue through before before I raised the possibility

> Tim is arguing that '' in 'abc' is not a useful question to ask.  The
> usefulness of the exception is not that it's a feature on which
> correct programs depend, but that it's an early warning that your
> program is broken.  Losing that early warning sign would mean more
> time wasted debugging.

Yes.  Best for things to fail noisily if they're going to fail.
> OTOH I'm worried that some code doing some mathematical proof using
> substring relationships would find it irritating to have to work
> around the irregularity.  But I admit that this is a purely
> theoretical fear for now.

This doesn't concern me, and I used to be a mathematical logician
myself.  Don't worry about my ex-colleagues -- you're designing a tool
for programming, not a formalism for doing proof theory.
		<a href="">Eric S. Raymond</a>

From  Mon Aug  5 22:49:14 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 17:49:14 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
Message-ID: <>

[Andrew Koenig]
> On the other hand, I have a practical fear: There are lots of
> different ways of asking whether a string s contains a substring s1.
> If those ways behave in diverse manners when s1 is empty, I am going
> to have to remember which way to obtain which behavior.  I would
> really like to avoid having to do that.

I don't count that as "a practical fear" unless you actually search for
empty strings, and I don't believe that you do (or at least not on
purpose -- you can change my mind in a hurry by posting your Python code
that does do so, though!).  If searching for empty strings isn't something
you do, then all methods of asking about substrings yield the same outcome.

This isn't, e,g., SNOBOL4, where matching againt a pattern variable that
somtimes contains a null pattern can be useful for its control-flow side
effects.  These are "just strings" in Python, and searching is just

From  Mon Aug  5 23:10:27 2002
From: (Eric S Raymond)
Date: Mon, 5 Aug 2002 18:10:27 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
References: <> <>
Message-ID: <>

Tim Peters <>:
> This isn't, e,g., SNOBOL4, where matching againt a pattern variable that
> somtimes contains a null pattern can be useful for its control-flow side
> effects.  These are "just strings" in Python, and searching is just
> searching.

<voice accent="Viennese"> And sometimes, a cigar is just a cigar. </voice>
		<a href="">Eric S. Raymond</a>

From  Mon Aug  5 23:30:56 2002
From: (Andrew Koenig)
Date: Mon, 5 Aug 2002 18:30:56 -0400 (EDT)
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <> (message from
 Tim Peters on Mon, 05 Aug 2002 17:49:14 -0400)
References: <>
Message-ID: <>

Tim> I don't count that as "a practical fear" unless you actually
Tim> search for empty strings, and I don't believe that you do (or at
Tim> least not on purpose -- you can change my mind in a hurry by
Tim> posting your Python code that does do so, though!).  If searching
Tim> for empty strings isn't something you do, then all methods of
Tim> asking about substrings yield the same outcome.

Unless you're trying to teach the language to someone else, in which
case you have to explain the behavior regardless of whether you've
written programs that depend on it.

I doubt you've ever written a program that searches for the string
'asoufnyqcynreqywrycq98746qwh', yet I imagine that you would still
object to a search function that throws an exception when presented
with that particular string.

I'm not trying to be flip here -- I'm trying to make the point that in
my opinion, having a uniform rule is preferable to catching particular
cases that are sometimes mistakes.

From  Mon Aug  5 23:49:03 2002
From: (Aahz)
Date: Mon, 5 Aug 2002 18:49:03 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Mon, Aug 05, 2002, Andrew Koenig wrote:
> I'm not trying to be flip here -- I'm trying to make the point that in
> my opinion, having a uniform rule is preferable to catching particular
> cases that are sometimes mistakes.

It's not so much that '' in 'abc' is a mistake as that there's no
sensible answer to be given.  When Python can't figure out how to
deliver a sensible answer, it raises an exception: "In the face of
ambiguity, refuse the temptation to guess."
Aahz (           <*>

Project Vote Smart:

From  Mon Aug  5 19:14:07 2002
From: (Jeremy Hylton)
Date: Mon, 5 Aug 2002 14:14:07 -0400
Subject: [Python-Dev] framer tool
In-Reply-To: <15694.63696.655874.808626@localhost.localdomain>
References: <>
Message-ID: <>

>>>>> "SM" == Skip Montanaro <> writes:

  [From a checkin that I made recently of Tools/framer]
  >>> framer is a tool to generate boilerplate code for C extension
  >>> types.

  Jack> how does framer relate to modulator? Is it a replacement?
  Jack> Should modulator be adapted to framer? (And, if so, who's
  Jack> going to do it? :-)

Framer could be a replacement for modulator.  The original impetus for
framer came from Jim Fulton, who suggested that modulator be updated
so that it could be used for C extension types.

I thought that Zope-style interfaces would be a nice way to specify
the signatures of the extension module and types.  Since modulator
didn't handle the specifications or the new 2.2/2.3 features, I didn't
really look at it.

Should I try to make framer a modulator replacement?  I've got some
time to work on it, but checked in the current progress in hopes of
finding more help.

  SM> How does framer relate to Pyrex?

Pyrex is a tool to generate a complete C module from a variant of
Python source.  Framer is a tool to generate just the boilerplate --
the frame.  Framer is intended to support people who are going to
maintain a C extension by hand.  The code it generates is easy to read
and edit.  I wouldn't want to read the Pyrex-generated C code.

Pyrex is intended for converting existing Python code to C, for
performance.  (I think.)  Framer is intended for C programmers who
don't want to type all the boilerplate for an extension.  In some
ways, it's closer to SWIG than to Pyrex.

I think there is a common subset of functionality to Pyrex, SWIG, and
Framer -- namely generating the basic wrapper code to make C code
callable from Python.  It might be worthwhile to share that code among
the projects; Greg certainly seems to have covered a lot of ground
handling __methods__ with Pyrex.


From  Tue Aug  6 00:17:04 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 19:17:04 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
Message-ID: <>

[Andrew Koenig]
> Unless you're trying to teach the language to someone else, in which
> case you have to explain the behavior regardless of whether you've
> written programs that depend on it.

I would tell them that searching for an empty string is silly, that they're
never going to need to do it, but if they do then they should consider
whatever happens an accident.

> I doubt you've ever written a program that searches for the string
> 'asoufnyqcynreqywrycq98746qwh', yet I imagine that you would still
> object to a search function that throws an exception when presented
> with that particular string.

Of course, but I can easily conceive of *wanting* to search for
'asoufnyqcynreqywrycq98746qwh'.  Indeed, I just did a grep over my Python
code to be sure that I never had searched for it before <wink>.  But I can't
conceive of wanting to search for an empty string, despite effort after
suspension of disbelief.

> I'm not trying to be flip here -- I'm trying to make the point that in
> my opinion, having a uniform rule is preferable to catching particular
> cases that are sometimes mistakes.

The distinction between empty and non-empty is the only one being made here,
and (unlike picking on 'asouf'etc) is a natural distinction in its domain.

From  Tue Aug  6 00:27:50 2002
From: (Patrick K. O'Brien)
Date: Mon, 5 Aug 2002 18:27:50 -0500
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
Message-ID: <>

> >
> > I'm not trying to be flip here -- I'm trying to make the point that in
> > my opinion, having a uniform rule is preferable to catching particular
> > cases that are sometimes mistakes.
> It's not so much that '' in 'abc' is a mistake as that there's no
> sensible answer to be given.  When Python can't figure out how to
> deliver a sensible answer, it raises an exception: "In the face of
> ambiguity, refuse the temptation to guess."

In what way does find('') return a sensible answer?

>>> 'help'.find('')
>>> 'help'.find('h')
>>> 'help'.find('e')
>>> 'help'.find('l')
>>> 'help'[0]
>>> 'help'[1]
>>> 'help'[2]
>>> s = 'help'
>>> s[s.find('')]
>>> s[s.find('h')]

I don't see the logic in this and I couldn't find anything in the docs to
explain this behavior. I'm guessing this is old hat for most of you, but I
find this a bit surprising myself.

Patrick K. O'Brien
"Your source for Python programming expertise."

From  Tue Aug  6 00:27:30 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 19:27:30 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
Message-ID: <>

> It's not so much that '' in 'abc' is a mistake as that there's no
> sensible answer to be given.


returns the smallest non-negative int i such that

    s2[i : i+len(s1] == s1

provided such an i exists.  That's as sensible as any answer, and more
sensible than most <wink>, if you have to give a meaning when s1 is empty.

> When Python can't figure out how to deliver a sensible answer,

Well, i==0 isn't a *compelling* answer when s1=="".  "It falls out of the
forumla" is about the best that can be said for it.

> it raises an exception: "In the face of ambiguity, refuse the temptation
> to guess."

That's pretty much my view.  The user has just given us reason to doubt they
know what their program is doing, and I'd rather be *helpful* then than push
on in the interest of purity.

The most plausible use case I've been able to dream up is representing small
finite sets as sorted strings of characters.  Then having

    s1 in s2

raise an exception when s1 is "" doesn't do the right thing for "the empty
set".  OTOH, it doesn't do the right thing in most other cases either, like

    "ac" in "abc" -> False

so it's hard to get too upset about the empty set failing <wink>.

From  Tue Aug  6 00:36:48 2002
From: (Patrick K. O'Brien)
Date: Mon, 5 Aug 2002 18:36:48 -0500
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
Message-ID: <>

[Patrick K. O'Brien]
> I don't see the logic in this and I couldn't find anything in the docs to
> explain this behavior. I'm guessing this is old hat for most of you, but I
> find this a bit surprising myself.

This one is even more fun. (Apologies in advance if I'm pouring salt on old

>>> s = 'help'
>>> s.rfind('')
>>> s[s.rfind('')]
Traceback (most recent call last):
  File "<input>", line 1, in ?
IndexError: string index out of range

Patrick K. O'Brien
"Your source for Python programming expertise."

From  Tue Aug  6 00:33:37 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 19:33:37 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
Message-ID: <>

[Patrick K. O'Brien]
> In what way does find('') return a sensible answer?
> >>> 'help'.find('')
> 0
> >>> 'help'.find('h')
> 0
> >>> 'help'.find('e')
> 1
> >>> 'help'.find('l')
> 2
> >>> 'help'[0]
> 'h'
> >>> 'help'[1]
> 'e'
> >>> 'help'[2]
> 'l'
> >>> s = 'help'
> >>> s[s.find('')]
> 'h'
> >>> s[s.find('h')]
> 'h'
> I don't see the logic in this

In what?  The meaning of "this" isn't clear.  Do you mean that not a single
one of those results makes sense to you, or that some particular ones don't
make sense to you?  If the latter case, which particular ones?

Note that searching for any prefix of 'help' returns 0:

>>> 'help'.find('help')
>>> 'help'.find('hel')
>>> 'help'.find('he')
>>> 'help'.find('h')
>>> 'help'.find('')

Of course '' is a prefix of any string whatsoever, so it's not like the
final result is of much use (it's more like "no information in, no
information out").

From  Tue Aug  6 00:35:01 2002
From: (Andrew Koenig)
Date: Mon, 5 Aug 2002 19:35:01 -0400 (EDT)
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <> (message from
 Tim Peters on Mon, 05 Aug 2002 19:17:04 -0400)
References: <>
Message-ID: <>

Tim> The distinction between empty and non-empty is the only one being made here,
Tim> and (unlike picking on 'asouf'etc) is a natural distinction in its domain.

Fair enough.

Nevertheless, you have not convinced me that this distinction
is useful in this context.

I will agree with you that (a) Many times, people search for literals
in strings, and (b) it is hard to imagine why an empty literal would
be useful.

However, that says nothing to me about why an expression of the form
(s in t) should be considered an error when s has no characters.
And it says even less about why it would be a good idea to have
the result of such a search yield different results in different contexts.

From  Tue Aug  6 00:49:45 2002
From: (Patrick K. O'Brien)
Date: Mon, 5 Aug 2002 18:49:45 -0500
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
Message-ID: <>

[Tim Peters]
> Of course '' is a prefix of any string whatsoever, so it's not like the
> final result is of much use (it's more like "no information in, no
> information out").

I just never thought of a Python string as beginning and ending with a null.
So the fact that find('') and rfind('') both return something other than -1
was surprising to me. My "plain English" intrepretation of the docstring for
find gave me the impression that if I searched for a single character and
got back an index other than -1 that I could then retrieve that character
from the string and it would equal the character used in the original

Patrick K. O'Brien
"Your source for Python programming expertise."

From  Tue Aug  6 00:58:27 2002
From: (Greg Ewing)
Date: Tue, 06 Aug 2002 11:58:27 +1200 (NZST)
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
In-Reply-To: <>
Message-ID: <>

> Here's a hack.
> For static extensions, you could extend one of the extension structs,
> e.g. PyMappingMethods

This perhaps suggests a way of handling this in a more
general way in the future:

Add a slot to the typeobject which points to a variable-sized
array of pointers. There is one entry in the array for each
level of inheritance, and it points to a struct containing
whatever extra stuff you want to add at that level.

This would only handle single inheritance, but I think
that's all you can have at the C level anyway, isn't

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Tue Aug  6 01:05:53 2002
From: (Patrick K. O'Brien)
Date: Mon, 5 Aug 2002 19:05:53 -0500
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
Message-ID: <>

[Tim Peters]
>     s2.find(s1)
> returns the smallest non-negative int i such that
>     s2[i : i+len(s1)] == s1  # Fixed typo in original.
> provided such an i exists.  That's as sensible as any answer, and more
> sensible than most <wink>, if you have to give a meaning when s1 is empty.

That clarified things for me, thanks.

(But if you squint while thinking about finding single characters and using
the result to access the original string via index notation, rather than
slice, and you ignore the fact that '' isn't even a single character and you
do this late in the day... you might see why I wasn't seeing the logic of
'whatever'.find('') returning 0.)

Patrick K. O'Brien
"Your source for Python programming expertise."

From  Tue Aug  6 01:06:04 2002
From: (Guido van Rossum)
Date: Mon, 05 Aug 2002 20:06:04 -0400
Subject: [Python-Dev] framer tool
In-Reply-To: Your message of "Mon, 05 Aug 2002 14:14:07 EDT."
References: <> <> <15694.63696.655874.808626@localhost.localdomain>
Message-ID: <>

>   Jack> how does framer relate to modulator? Is it a replacement?
>   Jack> Should modulator be adapted to framer? (And, if so, who's
>   Jack> going to do it? :-)
> Framer could be a replacement for modulator.  The original impetus for
> framer came from Jim Fulton, who suggested that modulator be updated
> so that it could be used for C extension types.
> I thought that Zope-style interfaces would be a nice way to specify
> the signatures of the extension module and types.  Since modulator
> didn't handle the specifications or the new 2.2/2.3 features, I didn't
> really look at it.

Jeremy points out that framer's *output* is different.  I'd like to
mention that framer's *input* is also different; Modulator is  GUI
tool, framer reads .py files.

--Guido van Rossum (home page:

From  Tue Aug  6 01:11:21 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 20:11:21 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
Message-ID: <>

[Patrick K. O'Brien]
> I just never thought of a Python string as beginning and ending
> with a null.

Oh, it's worse than just *that*.  There's a null string at s[i:i] for every
value of i, although the *implementation* of find() seems flawed in this
respect; e.g.,

>>> 'abc'.find('', 3)

violates the doc's promise that the result returned (when not -1) is an
"index in s" (but 3 is not an index in 'abc'), while

>>> 'abc'.find('', 4)

is anybody's guess ('' is certainly a substring of 'abc'[4:]).

However, when you're in the business of returning results that don't have
concrete meaning, things like this happen <wink>.

> So the fact that find('') and rfind('') both return something
> other than -1 was surprising to me.

Then you'll be glad to hear that we're going to make

    '' in 'abc'

return True too to help you build on your now-clear understanding <wink>.

From  Tue Aug  6 01:31:33 2002
From: (Greg Ewing)
Date: Tue, 06 Aug 2002 12:31:33 +1200 (NZST)
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

> But the satisfaction that spelling "has_key" as "in" gives me suggests
> that there's more potential to it.

I thought you'd always argued against this before, on
the grounds that the convenience wasn't worth the
inconsistency. Are you starting to change your mind?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Tue Aug  6 02:36:48 2002
From: (Gordon McMillan)
Date: Mon, 5 Aug 2002 21:36:48 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <3D4EACCC.31934.3FF7B782@localhost>
References: <>
Message-ID: <3D4EEFF0.27145.40FDFEA8@localhost>

On 5 Aug 2002 at 16:50, Gordon McMillan wrote:

> What I'm really saying is that I almost never use x
> in str because it's semantics have always been
> peculiar. Thus, I don't *really* care whether '' in
> str raises an exception, because if it does, I
> won't train myself to use it <wink>. 

Turns out that's not true. When I want set membership,
I first write "char in ('a', 'b', 'c')", then
sometimes change it because "char in 'abc'" is more

So whether '' in 'abc' will work or not is a red
herring. The real issue is that membership gets
conflated with subsetting.

-- Gordon

From  Tue Aug  6 03:18:31 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 22:18:31 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
Message-ID: <>

[Andrew Koenig]
> ...
> Nevertheless, you have not convinced me that this distinction
> is useful in this context.

That will have to wait until it burns you in practice.

> I will agree with you that (a) Many times, people search for literals
> in strings, and (b) it is hard to imagine why an empty literal would
> be useful.
> However, that says nothing to me about why an expression of the form
> (s in t) should be considered an error when s has no characters.

Nor should it -- literals are the simplest form of expression, used here
just for concreteness.  The question for you is, *however* the value of s
was obtained, if you end up doing "s in t" when s happens to be an empty
string, is it more likely that your program has strayed from your intent, or
that a result of True *was* your intent?

    if s[j+k1:j+k2] in t:

Assuming type correctness, if I know that raises an exception whenever k1 >=
k2, then I have confidence I know what the code is trying to do, and rest
easy knowing it won't do something nuts if the index expressions go crazy.
If instead it never(!) raises an exception, no matter what the values of j,
k1 and k2, this code scares me.

When Python switched to allowing negative indices as sequence subscripts (it
didn't always -- they used to raise exceptions), it introduced a nasty class
of bug caused by conceptually non-negative indices going negative by
mistake, but no longer complaining.  Overall I think negative indices added
enough expressiveness to outweigh that drawback, but it was far from a pure
win.  This is a case where we're also keen to make a formerly exceptional
operation "mean something", but there's one particular case of it where I
know doing so will create similar new problems -- and it's a case that's of
no *real* use to allow.

> And it says even less about why it would be a good idea to have
> the result of such a search yield different results in different
> contexts.

I agree that's not a good thing at all, and it may well win Guido in the
end.  I just hope he feels rotten about it, because the children will suffer
as a result <wink>.

From  Tue Aug  6 03:24:35 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 22:24:35 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

> But the satisfaction that spelling "has_key" as "in" gives me suggests
> that there's more potential to it.

[Greg Ewing]
> I thought you'd always argued against this before, on
> the grounds that the convenience wasn't worth the
> inconsistency. Are you starting to change your mind?

This one's a done deal; it was released in 2.2:

>>> 2 in {2: 3}

Similarly, "for k in dict" is like "for k in dict.iterkeys()" in 2.2.  Guido
never changes his mind <wink>.

From  Tue Aug  6 03:33:04 2002
From: (Tim Peters)
Date: Mon, 05 Aug 2002 22:33:04 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <3D4EEFF0.27145.40FDFEA8@localhost>
Message-ID: <>

[Gordon McMillan]
> Turns out that's not true. When I want set membership,
> I first write "char in ('a', 'b', 'c')", then
> sometimes change it because "char in 'abc'" is more
> efficient.

"char in dict_acting_as_a_set" is faster still, if you're really keen to
speed it.

> So whether '' in 'abc' will work or not is a red
> herring.

For your particular use, possibly.  If "char" is computed and may become
empty by mistake, then it's not a red herring (it's the difference between
getting True and getting an exception).

> The real issue is that membership gets conflated with subsetting.

For strings, yes, if you change it to "the membership meaning goes away
entirely in general, and a substring meaning replaces it".  If "char" is
computed and may become longer than one character by mistake, then in your
use something that used to raise an exception would instead return True or
False, depending on the data values.

From  Tue Aug  6 03:33:45 2002
From: (Greg Ewing)
Date: Tue, 06 Aug 2002 14:33:45 +1200 (NZST)
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

> [Greg Ewing]
> > I thought you'd always argued against this before, on
> > the grounds that the convenience wasn't worth the
> > inconsistency. Are you starting to change your mind?
> This one's a done deal; it was released in 2.2:
> >>> 2 in {2: 3}
> 1

I'm talking about making "for x in string" do a substring
test. This is different from "for x in dict", because at
least the latter is still a kind of membership test.

I thought Guido was against having "in" do anything
other than membership tests, but his last message sounded
like he was changing his mind.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Tue Aug  6 04:25:18 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 05 Aug 2002 23:25:18 -0400
Subject: [Python-Dev] Pyrex praise [was: Re: framer tool]
In-Reply-To: <>
References: <>
Message-ID: <>

[Jeremy Hylton]

> Pyrex is intended for converting existing Python code to C, for
> performance.  (I think.)

Here is how I see Pyrex, and why I have much interest in it.

I do not see Pyrex as a tool whose main goal is converting existing Python
code to C, yet to some extent, it could be used with this goal in head.
It is a tool for the programmer to express various interfaces between C and
Python, using a variant of the Python language augmented with C declarations,
_instead_ of the more usual C language augmented with macros and an API.
The fact that Pyrex produces C code along the road is only part of the
internal mechanics, but is not much of practical interest to the programmer.

Pyrex can be used to extend Python with C code, written in C for the
circumstance, or to wrap pre-existing C libraries.  Pyrex can also be used
to embed Python functions, written and interpreted by Python, within what
would otherwise be a pure C application.  I would guess that Pyrex could
also be used with Python only or (with proper care) with C only, and not
to build an Python-C interface, but these cases are probably not goint to
be usual for me.

The same as it is generally easier and more comfortable to develop and
debug an algorithm or program in Python than in C, would it be only because
C forces you into many details of memory management intricacies, it is
much more easier and more comfortable developing and debugging a Python-C
interface using Pyrex than using more traditional ways: you concentrate
on the interface without having to cautiously swim among reference counts
and the various and numerous API functions or macros.

A neat advantage at using Python instead of C to write your interface is that
you are much less likely to have bugs.  Pyrex knows how to break apart Python
structures and how to rebuild them, it takes care of properly maintaining
reference count invariants, etc. so as long as Pyrex is not itself buggy,
your interface is really on the safe side, bug-wise.  As Python-C interface
bugs might be painful to track down, this is big incentive towards Pyrex.
Being allowed to forget (or avoid learning) all the details of the C API
is yet another good selling point for Pyrex: it would be a spoiling of
resources having many members of a development team to learn the C API for
Python, while I can expect everybody in a programmer team to know Pyrex,
because the learning curve is so small.  Pyrex is more democratic! :-)

A final point, which looks important to me, is that any good wrapping of a
pre-existing C library is best done while giving a Python flavour to the
interface, would it be only for a nicer and natural object orientation.
If the wrapping is done using C to express the interface, the effort of
programming the interface in C while adding more Python-typical paradigms
is complicated by the language distance between C and Python.  But as Pyrex
is very close to Python, Pyrex allows natural and speedy building of more
proper interfaces.  The Pyrex code itself, which holds the glue between
C and Python, is exactly the right place for implementing that necessary
layer meant to reshape the C facilities into Python ways.

François Pinard

From David Abrahams" <  Tue Aug  6 04:29:43 2002
From: David Abrahams" < (David Abrahams)
Date: Mon, 5 Aug 2002 23:29:43 -0400
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
References: <>
Message-ID: <01fa01c23cfa$373cd7a0$>

From: "Greg Ewing" <>

> > Here's a hack.
> >
> > For static extensions, you could extend one of the extension structs,
> > e.g. PyMappingMethods
> This perhaps suggests a way of handling this in a more
> general way in the future:
> Add a slot to the typeobject which points to a variable-sized
> array of pointers. There is one entry in the array for each
> level of inheritance, and it points to a struct containing
> whatever extra stuff you want to add at that level.
> This would only handle single inheritance, but I think
> that's all you can have at the C level anyway, isn't
> it?

1. I'm pretty sure the answer to the above question is no
2. The scheme you propose is more costly in memory and cycles than I'd like


           David Abrahams * Boost Consulting *

From  Tue Aug  6 05:12:07 2002
From: (Andrew Koenig)
Date: 06 Aug 2002 00:12:07 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
References: <>
Message-ID: <>

Tim>     s2.find(s1)

Tim> returns the smallest non-negative int i such that

Tim>     s2[i : i+len(s1] == s1

Tim> provided such an i exists.  That's as sensible as any answer, and more
Tim> sensible than most <wink>, if you have to give a meaning when s1 is empty.

Well, no -- you have to put the missing parenthesis in first.

Here it is.....   :-)

Andrew Koenig,,

From  Tue Aug  6 05:19:53 2002
From: (Andrew Koenig)
Date: 06 Aug 2002 00:19:53 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
References: <>
Message-ID: <>

Tim> The question for you is, *however* the value of s was obtained,
Tim> if you end up doing "s in t" when s happens to be an empty
Tim> string, is it more likely that your program has strayed from your
Tim> intent, or that a result of True *was* your intent?

It isn't *the* question; it's *a* question.

Another question is whether adding additional complexity to the rules
helps or hurts in genera.

Tim>     if s[j+k1:j+k2] in t:

Tim> Assuming type correctness, if I know that raises an exception whenever k1 >=
Tim> k2, then I have confidence I know what the code is trying to do, and rest
Tim> easy knowing it won't do something nuts if the index expressions go crazy.
Tim> If instead it never(!) raises an exception, no matter what the values of j,
Tim> k1 and k2, this code scares me.

Then why not remove your fear by executing

        assert k1 < k2


Tim> When Python switched to allowing negative indices as sequence
Tim> subscripts (it didn't always -- they used to raise exceptions),
Tim> it introduced a nasty class of bug caused by conceptually
Tim> non-negative indices going negative by mistake, but no longer
Tim> complaining.  Overall I think negative indices added enough
Tim> expressiveness to outweigh that drawback, but it was far from a
Tim> pure win.  This is a case where we're also keen to make a
Tim> formerly exceptional operation "mean something", but there's one
Tim> particular case of it where I know doing so will create similar
Tim> new problems -- and it's a case that's of no *real* use to allow.

Well, we don't know that yet.  We just know that you haven't seen one.
And I must say that I don't expect (s1 in s2) to be all that common
an operation anyway when s1 and s2 are strings.

>> And it says even less about why it would be a good idea to have
>> the result of such a search yield different results in different
>> contexts.

Tim> I agree that's not a good thing at all, and it may well win Guido
Tim> in the end.  I just hope he feels rotten about it, because the
Tim> children will suffer as a result <wink>.

This whole issue feels to me like the way APL behaves when you ask it
for the number of elements in a scalar:  Instead of giving the obvious
answer (a scalar has 1 element), it gives a much deeper answer (the
number of elements in a scalar is an empty vector, because a scalar
has no dimensions).  That behavior bites novices all the time, but
I have encountered programs that become much simpler as a result.

Andrew Koenig,,

From  Tue Aug  6 05:24:52 2002
From: (Greg Ewing)
Date: Tue, 06 Aug 2002 16:24:52 +1200 (NZST)
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
In-Reply-To: <01fa01c23cfa$373cd7a0$>
Message-ID: <>

David Abrahams <>:

> > This would only handle single inheritance, but I think
> > that's all you can have at the C level anyway, isn't
> > it?
> 1. I'm pretty sure the answer to the above question is no

Er, you mean it *is* possible to inherit from multiple
extension types? How?

> 2. The scheme you propose is more costly in memory and cycles than I'd
> like

It's only one memory cycle more than it takes to access
the existing sub-structures. And it's a lot better than
the alternative, which is doing a Python dict lookup!

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Tue Aug  6 05:34:22 2002
From: (Oren Tirosh)
Date: Tue, 6 Aug 2002 07:34:22 +0300
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>; from on Mon, Aug 05, 2002 at 11:06:25PM +0200
References: <> <> <> <>
Message-ID: <>

On Mon, Aug 05, 2002 at 11:06:25PM +0200, Martin v. Loewis wrote:
> If you look at the patch, you see that it precisely does what you
> propose to do: add a callback to the charmap codec:
> - it deletes charmap_decoding_error
> - it adds state to feed the callback function
> - it replaces the old call to charmap_decoding_error by

But it's NOT an error. It's new encoding functionality.  What if the new 
functionality you've added this way has an error of its own? Perhaps you
would like to have a flag to tell it whether to ignore error or raise an
exception?  Sorry, that argument has been taken over for another purpose.  

The real problem was some missing functionality in codecs. Here are two 
approaches to solve the problem:

1. Add the missing functionality.

2. Keep the old, limited functionality, let it fail, catch the error,
re-use an argument originally intended for an error handling strategy to 
shoehorn a callback that can implement the missing functionality, add a new 
name-based registry to overcome the fact that the argument must be a string.
Since this approach is conceptually stuck on treating it as an error it 
actually creates and discards a new exception object for each character 
converted via this path.

Ummm... <scratches head>, tough choice.


From David Abrahams" <  Tue Aug  6 06:22:52 2002
From: David Abrahams" < (David Abrahams)
Date: Tue, 6 Aug 2002 01:22:52 -0400
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
References: <>
Message-ID: <022a01c23d09$65999840$>

----- Original Message -----
From: "Greg Ewing" <>

> David Abrahams <>:
> > > This would only handle single inheritance, but I think
> > > that's all you can have at the C level anyway, isn't
> > > it?
> >
> > 1. I'm pretty sure the answer to the above question is no
> Er, you mean it *is* possible to inherit from multiple
> extension types? How?

One way is by invoking the metatype with a bases tuple which includes the
extension types.
I think you can also fill in tp_bases explicitly in a new extension type,
but it's been a long time since I crawled through that code and discussed
it with Guido.

> > 2. The scheme you propose is more costly in memory and cycles than I'd
> > like
> It's only one memory cycle more than it takes to access
> the existing sub-structures. And it's a lot better than
> the alternative, which is doing a Python dict lookup!

When I spoke of memory, I was talking about the extra pointer per level of
When I spoke of cycles, I was talking about the cycles to manage that
memory (probably moot).

It's not too terrible, but I'd like it a lot better if types would just use
tp_basicsize to find the beginning of the variable stuff so we could embed
the memory in the type itself. 'Course, I've forgotten more than I knew
about that code, so I might be barking up the wrong banyan.


           David Abrahams * Boost Consulting *

From  Tue Aug  6 08:36:40 2002
From: (M.-A. Lemburg)
Date: Tue, 06 Aug 2002 09:36:40 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <>
Message-ID: <>

Jack Jansen wrote:
> On maandag, augustus 5, 2002, at 05:47 , M.-A. Lemburg wrote:
>> Jack Jansen wrote:
>>> Having to register the error handler first and then finding it by 
>>> name smells like a very big hack to me. I understand the reasoning 
>>> (that you don't want to modify the API of a gazillion C routines to 
>>> add an error object argument) but it still seems like a hack....
>> Well, in that case, you would have to call the whole codec registry
>> a hack ;-)
> No, not really. For codecs I think that there needn't be much of a 
> connection between the codec-supplier and the codec-user. Conceivably 
> the encoding-identifying string being passed to encode() could even have 
> been read from a data file or something.
> For error handling this is silly: the code calling encode() or decode() 
> will know how it wants errors handled. And if you argue that it isn't 
> really error handling but an extension to the encoding name then maybe 
> it should be treated as such (by appending it to the codec name in the 
> string, as in "ascii;xmlentitydefs" or so?).

You are omitting the fact, though, that different codecs may need
different implementations of a specific error handler. Now the
error handler will always implement the same logic, so to the users
it's all the same thing. And by using the string alias he needn't
worry about where to get the error handler from (it typically
lives with the codec itself).

Note that error handling is not really an extension to the encoding
itself. It just happens that it can be put to use that way for
e.g. escaping non-representable characters. Other applications
like fetching extra information from a external sources or logging
the positions of coding problems do not fall into this category.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Tue Aug  6 09:06:13 2002
From: (M.-A. Lemburg)
Date: Tue, 06 Aug 2002 10:06:13 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <> <> <> <> <>
Message-ID: <>

Oren Tirosh wrote:
> On Mon, Aug 05, 2002 at 11:06:25PM +0200, Martin v. Loewis wrote:
>>If you look at the patch, you see that it precisely does what you
>>propose to do: add a callback to the charmap codec:
> But it's NOT an error. It's new encoding functionality.  What if the new 
> functionality you've added this way has an error of its own? Perhaps you
> would like to have a flag to tell it whether to ignore error or raise an
> exception?  Sorry, that argument has been taken over for another purpose.  
> The real problem was some missing functionality in codecs. Here are two 
> approaches to solve the problem:
> 1. Add the missing functionality.
> 2. Keep the old, limited functionality, let it fail, catch the error,
> re-use an argument originally intended for an error handling strategy to 
> shoehorn a callback that can implement the missing functionality, add a new 
> name-based registry to overcome the fact that the argument must be a string.
> Since this approach is conceptually stuck on treating it as an error it 
> actually creates and discards a new exception object for each character 
> converted via this path.
> Ummm... <scratches head>, tough choice.

Oren, if you just want a codec which encodes and decodes
HTML entities, then this can be done easily by writing a codec
which works on Unicode only and is stacked on top of the other
existing codecs, e.g. if you first encode all non-printable
and non-ASCII code points using entity escapes and then pass
this Unicode string to one of the other codecs, you have
a solution to your problem.

Note that this is different from trying to
provide a work-around for encoding code points from Unicode
for which there are no corresponding mappings in a given
encoding. These situations would normally result in an
exception. Now HTML and XML offer you the possibility to
use special escapes for these, so that you can still encode
the complete Unicode set into e.g. ASCII, but only under
the premises that the encoded data is HTML or XML text.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Tue Aug  6 09:25:34 2002
From: (Martin v. Loewis)
Date: 06 Aug 2002 10:25:34 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
References: <>
Message-ID: <>

Oren Tirosh <> writes:

> > If you look at the patch, you see that it precisely does what you
> > propose to do: add a callback to the charmap codec:
> > 
> > - it deletes charmap_decoding_error
> > - it adds state to feed the callback function
> > - it replaces the old call to charmap_decoding_error by
> But it's NOT an error. It's new encoding functionality.  

What is not an error? The handling? Certainly: the error and the error
handler are different things; error handlers are not errors. "ignore"
and "replace" are not errors, either, they are also new encoding
functionality. That is the very nature of handlers: they add

> The real problem was some missing functionality in codecs. Here are two 
> approaches to solve the problem:
> 1. Add the missing functionality.

That is not feasible, since you want that functionality also for
codecs you haven't heard of.

> 2. Keep the old, limited functionality, let it fail, catch the error,
> re-use an argument originally intended for an error handling strategy to 
> shoehorn a callback that can implement the missing functionality, add a new 
> name-based registry to overcome the fact that the argument must be a string.

That is possible, but inefficient. It is also the approach that people
use today, and the reason for this PEP to exist. The current
UnicodeError does not report any detail on the state that the codec
was in.

> Since this approach is conceptually stuck on treating it as an error it 
> actually creates and discards a new exception object for each character 
> converted via this path.

It's worth: If you find that the entire string cannot be encoded, you
have typically two choices:
- you perform a binary search. That may cause log n exceptions.
- you encode every character on its own. That reduce the number of
  exceptions to the number of unencodable characters, but it will also
  mean that the encoding is wrong for some encodings: You will always
  get the shift-in/shift-out sequences that your encoding may specify.

On decoding, this is worse: feeding a byte at a time may fail
altogether if you happen to break a multibyte character - when feeding
the entire string happily consumes long sequences of characters, and
only runs into a single problem byte.


From  Tue Aug  6 10:20:12 2002
From: (Oren Tirosh)
Date: Tue, 6 Aug 2002 12:20:12 +0300
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>; from on Tue, Aug 06, 2002 at 10:25:34AM +0200
References: <> <> <> <> <> <>
Message-ID: <>

On Tue, Aug 06, 2002 at 10:25:34AM +0200, Martin v. Loewis wrote:
> > 2. Keep the old, limited functionality, let it fail, catch the error,
> > re-use an argument originally intended for an error handling strategy to 
> > shoehorn a callback that can implement the missing functionality, add a new 
> > name-based registry to overcome the fact that the argument must be a string.
> That is possible, but inefficient. 

I'm confused.

I have just described what PEP 293 is proposing and you say that it's 
inefficient :-? I find it hard to believe that this is what you relly meant 
since you are presumably in favor of this PEP in its current form. 

I can't tell if we actually disagree because apparently we don't 
understand each other.

> > Since this approach is conceptually stuck on treating it as an error it 
> > actually creates and discards a new exception object for each character 
> > converted via this path.
> It's worth: If you find that the entire string cannot be encoded, you
> have typically two choices:

Instead of treating it as a problem ("the string cannot be encoded") and 
getting trapped in the mindset of error handling I suggest approaching it 
from a positive point of view: "how can I make the encoding work the
way I want it to work?".  Let's leave the error handling for real errors.

Treating this as an error-handling issue was so counter-intuitive to me 
that until recently I never bothered to read PEP 293. The title made me 
think that it's completely irrelevant to my needs. After all, what I 
wanted was to translate HTML to/from Unicode, not find a better way to 
handle errors.


From  Tue Aug  6 10:28:42 2002
From: (Samuele Pedroni)
Date: Tue, 6 Aug 2002 11:28:42 +0200
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
Message-ID: <001f01c23d2b$a81a9e40$6d94fea9@newmexico>

[Greg Ewing]
>I thought Guido was against having "in" do anything
>other than membership tests, but his last message sounded
>like he was changing his mind.


"thon" in "python"

then why not

[1,2] in [0,1,2,3]

(it's a purely rhetorical question)

in general I don't think it is a good idea
to have "in" be a membership vs subset/subseq
operator depending on non ambiguity, convenience
or simply implementer taste,
because truly there are data types (ex. sets)
that would need both and disambiguated.

Either python grows a new subset/subseq operator
but probably this is overkill (keyword issue, new
__magic__ method, not meaningful, con
venient for a lot of types)

or strings (etc) should simply grow a new
method with an appropriate name.

"py"-in-"python"-is-dark-side-sexy-ly y'rs - Samuele Pedroni.

From  Tue Aug  6 10:46:22 2002
From: (Jack Jansen)
Date: Tue, 6 Aug 2002 11:46:22 +0200
Subject: [Python-Dev] Re: framer tool
In-Reply-To: <>
Message-ID: <>

On Monday, August 5, 2002, at 08:14 , Jeremy Hylton wrote:
> Should I try to make framer a modulator replacement?  I've got some
> time to work on it, but checked in the current progress in hopes of
> finding more help.

I think that would be a good idea. Modulator was something I quickly 
threw together years ago, I think that it may even have been the first 
Tkinter program I did (that may even have been the main reason for 
writing it:-). The code quality shows this, and it hasn't been 
maintained in aeons. Still, because it's such a quick and dirty tool it 
has it's place. It would be good if framer could grow similar 
functionality (a GUI where you tap a couple of buttons to create objects 
and methods, plus a couple of switches to select the protocols the 
objects should adhere to) so we can lay modulator to rest.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- Emma 
Goldman -

From  Tue Aug  6 10:51:38 2002
From: (Christian Tismer)
Date: Tue, 06 Aug 2002 11:51:38 +0200
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
References: <> <022a01c23d09$65999840$>
Message-ID: <>

David Abrahams wrote:

[Greg Ewing, adding a level of indirection]

[David Abrahams]
>>>2. The scheme you propose is more costly in memory and cycles than I'd

>>It's only one memory cycle more than it takes to access
>>the existing sub-structures. And it's a lot better than
>>the alternative, which is doing a Python dict lookup!

Of course it is better. But since it it possible to
do more better, a sub-optimal solution will not
make me forget about it.

> When I spoke of memory, I was talking about the extra pointer per level of
> inheritance.
> When I spoke of cycles, I was talking about the cycles to manage that
> memory (probably moot).

Since we are talking of types and meta-types, I believe
memory issues are of minor interest.
There will not be more then a few hundred classes,
and they will be created just once.
The reason why I want to have extra data and function
caches in the types is that this is *very* memory
efficient, in comparison to stuffing things into the
instances (which would be easy to implement).

> It's not too terrible, but I'd like it a lot better if types would just use
> tp_basicsize to find the beginning of the variable stuff so we could embed
> the memory in the type itself. 'Course, I've forgotten more than I knew
> about that code, so I might be barking up the wrong banyan.

That's exactly what I want to do, but I have to find
out how the variable part of types is used at the moment,
and I admit I didn't understand it, yet.

The place where user stuff should go is where instances
have their slots. With meta-types, it now happens that
types become instances, but types refuse to have slots.
This needs to be changed, everything else is a workaround.

regards - chris

Christian Tismer             :^)   <>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship*
14109 Berlin                 :     PGP key ->
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?

From  Tue Aug  6 10:54:22 2002
From: (Jack Jansen)
Date: Tue, 6 Aug 2002 11:54:22 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
Message-ID: <>

On Tuesday, August 6, 2002, at 11:20 , Oren Tirosh wrote:
> Treating this as an error-handling issue was so counter-intuitive to me
> that until recently I never bothered to read PEP 293. The title made me
> think that it's completely irrelevant to my needs. After all, what I
> wanted was to translate HTML to/from Unicode, not find a better way to
> handle errors.

I think that this is really also the gist of my misgiving about the 
design: enhancing a codec/adding extra filtering is a different thing 
than error handling. The PEP uses "error handing" in the prose, but the 
API is geared towards adding extra filtering.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- Emma 
Goldman -

From  Tue Aug  6 11:02:14 2002
From: (Christian Tismer)
Date: Tue, 06 Aug 2002 12:02:14 +0200
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
References: <> <>              <01d601c23c84$d2783c80$> <>
Message-ID: <>

Guido van Rossum wrote:

>>From Christian's post I can't tell if he wants his types to be dynamic
> or static (i.e. if he's creating an arbitrary number of them at
> run-time or only a fixed number that's known at compile-time).

I'm not absolutely sure what I meant. Actually, I wanted to
cache existing methods, which are known at compile-time.
At run-time, they would be replaced for derived types.
But a run-time solution might make sense, to generate
very fast class variables, maybe.

> Here's a hack.
> For static extensions, you could extend one of the extension structs,
> e.g. PyMappingMethods (which is the smallest and also least likely to
> grow new methods), with additional fields.  Then you'd have to know
> whether you can access those extra fields; I suggest checking for the
> metatype.  A few casts and you're done.
> For dynamic extensions, you might be able to do the same: after
> type_new() has given you an object, allocate memory for an extended
> PyMappingMethods struct, copy the existing PyMappingMethods struct
> into it (if it exists), and replace the pointer.  Then in your
> deallocation function, make sure to free the pointer.
> Hope this helps in the short run.

Thanks a lot. Yes, it helps in the short run, but stays
a hack. I'm trying to find a way that allows meta-types
to support slots for its type instances without introducing
too much special-casing.
What I do not understand yet is who uses the variable type
part and in which way. I'd like to collaborate with it.

ciao - chris

Christian Tismer             :^)   <>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship*
14109 Berlin                 :     PGP key ->
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?

From  Tue Aug  6 11:12:54 2002
From: (Martin v. Loewis)
Date: 06 Aug 2002 12:12:54 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
References: <>
Message-ID: <>

Oren Tirosh <> writes:

> > > 2. Keep the old, limited functionality, let it fail, catch the
> > > error, re-use an argument originally intended for an error
> > > handling strategy to shoehorn a callback that can implement the
> > > missing functionality, add a new name-based registry to overcome
> > > the fact that the argument must be a string.

> > That is possible, but inefficient. 
> I'm confused.
> I have just described what PEP 293 is proposing and you say that it's 
> inefficient :-? 

Perhaps I have misunderstood your description. I was assuming an
algorithm like

def new_encode(str, encoding, errors):
  return dispatch[errors](str, encoding)

def xml_encode(str, encoding):
    return str.encode(encoding, "strict")
  except UnicodeError:
    if len(str) == 1:
      return "&#%d;" % ord(str)
    return xml_encode(str[:len(str)/2], encoding) + \
           xml_encode(str[len(str)/2:], encoding)

dispatch['xmlcharref'] = xml_encode

This seems to match the description "keep the old, limited
functionality, let it fail, catch the error", and it has all the
deficiencies I mentioned. 

It also is not the meaning of PEP 293. The whole idea is that the
handler is invoked *before* something has failed.

> Instead of treating it as a problem ("the string cannot be encoded") and 
> getting trapped in the mindset of error handling I suggest approaching it 
> from a positive point of view: "how can I make the encoding work the
> way I want it to work?".  Let's leave the error handling for real errors.

Sounds good, but how does this help in finding a solution?

> Treating this as an error-handling issue was so counter-intuitive to me 
> that until recently I never bothered to read PEP 293. The title made me 
> think that it's completely irrelevant to my needs. After all, what I 
> wanted was to translate HTML to/from Unicode, not find a better way to 
> handle errors.

If you think this is a documentation issue - I'm fine with documenting
the feature differently.


From  Tue Aug  6 11:33:30 2002
From: (M.-A. Lemburg)
Date: Tue, 06 Aug 2002 12:33:30 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <>
Message-ID: <>

Jack Jansen wrote:
> On Tuesday, August 6, 2002, at 11:20 , Oren Tirosh wrote:
>> Treating this as an error-handling issue was so counter-intuitive to me
>> that until recently I never bothered to read PEP 293. The title made me
>> think that it's completely irrelevant to my needs. After all, what I
>> wanted was to translate HTML to/from Unicode, not find a better way to
>> handle errors.
> I think that this is really also the gist of my misgiving about the 
> design: enhancing a codec/adding extra filtering is a different thing 
> than error handling. The PEP uses "error handing" in the prose, but the 
> API is geared towards adding extra filtering.

That's a wrong impression. The new error handling API allows
you to do many different things base on the current position
of the codec in the input stream.

The fact that this can be used to apply escaping to otherwise
illegal mappings stems from the basics behind this new API. It's
an application, not its main purpose. Filtering can be had using
different techniques such as by stacking codecs as well.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Tue Aug  6 12:06:45 2002
From: (Michael Hudson)
Date: 06 Aug 2002 12:06:45 +0100
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
In-Reply-To: Christian Tismer's message of "Mon, 05 Aug 2002 12:43:37 +0200"
References: <> <>
Message-ID: <>

Christian Tismer <> writes:

> Hi Guido:
> here a simpler formulation of my question:
> I would like to create types with overridable methods.
> This is supported by the new type system.
> But I'd also like to make this as fast as possible and
> therefore to avoid extra dictionary lookups for methods,
> especially if they are most likely not overridden.

I would wonder how much this saves.

How many more instructions does

  PyDict_GetItem(ob->ob_type->tp_dict, interned_string)

take than


?  Sure, *some* but not all that many esp. if the called function is
actually doing significant work.

Of course, the first gets you a PyCFunctionObject* (or similar) not a
function pointer and that adds a layer of overhead.  In fact, this is
probably the greater source of overhead (you might have to box up the
arguments, allocate & deallocate the argument tuple, etc).

I doubt my opinion counts here, but I think I'd prefer to see *less*,
not more, methods in type object in future.  Particularly if there's
some way to call functions with known signatures efficiently.
Unfortunately, that seems pretty hard after five minutes thinking.


  I wouldn't trust the Anglo-Saxons for much anything else.  Given
  they way English is spelled, who could trust them on _anything_ that
  had to do with writing things down, anyway?
                                        -- Erik Naggum, comp.lang.lisp

From  Tue Aug  6 12:39:49 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 07:39:49 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Tue, 06 Aug 2002 12:31:33 +1200."
References: <>
Message-ID: <>

> > But the satisfaction that spelling "has_key" as "in" gives me suggests
> > that there's more potential to it.

> I thought you'd always argued against this before, on
> the grounds that the convenience wasn't worth the
> inconsistency. Are you starting to change your mind?


--Guido van Rossum (home page:

From  Tue Aug  6 12:44:17 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 07:44:17 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Mon, 05 Aug 2002 21:36:48 EDT."
References: <>
Message-ID: <>

> On 5 Aug 2002 at 16:50, Gordon McMillan wrote:
> > What I'm really saying is that I almost never use x
> > in str because it's semantics have always been
> > peculiar. Thus, I don't *really* care whether '' in
> > str raises an exception, because if it does, I
> > won't train myself to use it <wink>. 
> Turns out that's not true. When I want set membership,
> I first write "char in ('a', 'b', 'c')", then
> sometimes change it because "char in 'abc'" is more
> efficient.
> So whether '' in 'abc' will work or not is a red
> herring. The real issue is that membership gets
> conflated with subsetting.

Well, in current Python you can only safely make that transformation
when you're damn sure that char is a string of length one, otherwise
you'd risk a TypeError.  So this code (if correct) will continue to
work, assuming you're not cathing TypeError (which is often an
assumption when we say that a new feature "won't break old code").

--Guido van Rossum (home page:

From  Tue Aug  6 12:56:21 2002
From: (Steve Holden)
Date: Tue, 6 Aug 2002 07:56:21 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
References: <001f01c23d2b$a81a9e40$6d94fea9@newmexico>
Message-ID: <05d501c23d40$49043910$>

[Samuele Pedroni]
> [Greg Ewing]
> >I thought Guido was against having "in" do anything
> >other than membership tests, but his last message sounded
> >like he was changing his mind.
> If
> "thon" in "python"
> then why not
> [1,2] in [0,1,2,3]
> (it's a purely rhetorical question)
Which I also asked. But Guido pointed out htat [1, 2] may well be a member
of a list such as [0, [1, 2], [3, 4], 5].

> in general I don't think it is a good idea
> to have "in" be a membership vs subset/subseq
> operator depending on non ambiguity, convenience
> or simply implementer taste,
> because truly there are data types (ex. sets)
> that would need both and disambiguated.
Well, it looks like you lose!

> Either python grows a new subset/subseq operator
> but probably this is overkill (keyword issue, new
> __magic__ method, not meaningful, con
> venient for a lot of types)
> or strings (etc) should simply grow a new
> method with an appropriate name.
> "py"-in-"python"-is-dark-side-sexy-ly y'rs - Samuele Pedroni.
Consistency apparently loses out to pragmatism in this case.

Steve Holden                       
Python Web Programming      

From  Tue Aug  6 13:12:27 2002
From: (Samuele Pedroni)
Date: Tue, 6 Aug 2002 14:12:27 +0200
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
Message-ID: <006901c23d42$87b64340$6d94fea9@newmexico>

[Steve Holden]
>> (it's a purely rhetorical question)
>Which I also asked. But Guido pointed out htat [1, 2] may well be a member
>of a list such as [0, [1, 2], [3, 4], 5].

which just reinfornces my point below,
anyway I knew that before, even
without Guido. You were not supposed
to answer a rethorical question anyway <wink>.
I have not read the entire
unbearably long thread.

>> in general I don't think it is a good idea
>> to have "in" be a membership vs subset/subseq
>> operator depending on non ambiguity, convenience
>> or simply implementer taste,
>> because truly there are data types (ex. sets)
>> that would need both and disambiguated.
>Well, it looks like you lose!

I'm not taking this personally,
the problem one operator, two potential
semantics remains.

>Consistency apparently loses out to pragmatism in this case.

What do you want "in" to do for you today? <wink><wink>.

That's my last input on the matter.


PS: these days I read python-dev through the archives,
it seems that this time I have added to redudance
department myself, oh well...

From  Tue Aug  6 13:30:35 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 08:30:35 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Tue, 06 Aug 2002 11:28:42 +0200."
References: <001f01c23d2b$a81a9e40$6d94fea9@newmexico>
Message-ID: <>

> If
> "thon" in "python"
> then why not
> [1,2] in [0,1,2,3]
> (it's a purely rhetorical question)
> in general I don't think it is a good idea
> to have "in" be a membership vs subset/subseq
> operator depending on non ambiguity, convenience
> or simply implementer taste,
> because truly there are data types (ex. sets)
> that would need both and disambiguated.
> Either python grows a new subset/subseq operator
> but probably this is overkill (keyword issue, new
> __magic__ method, not meaningful, con
> venient for a lot of types)
> or strings (etc) should simply grow a new
> method with an appropriate name.

I recognize this as related to the argument that Ping was (still is?)
making against "for x in <iterator>"; but not because the same
operator "in" is involved.

It has to do with polymorphism (functions that accept different types
of arguments; it's somewhat different from operator overloading).

Suppose we have an operator @.  (Take operator in a wide enough sense,
including other bits of grammar, like "for".)  If there's only one
type (or one narrow set or related types) for which @ makes sense,
human readers of a program will use @ as a clue about the type of the
arguments, and (if correct) that will help reasoning about the
expression in which it occurs.

ABC uses this property of operators to do type inference: if an ABC
expression contains "a+b", a and b must be numbers; and so on.

Python chose to allow operators to be overloaded by different types
with different meanings, and the language gives a+b a very different
meaning for numbers than for sequences, for example.  (And an
important invariant is lost in this example: for numbers, a+b == b+a,
but not so for sequences!)

Is this a problem?

The ease with which we get used to "key in dict" makes me think it is
not.  While Python doesn't require you to declare the types of your
arguments, the type (or set of allowed types) for arguments is usually
strongly known in the mind of the programmer, and most often strong
hints are given either by the choice of argument name or by

While it's possible in theory, in practice nobody writes polymorphic
code that uses + and * on its arguments and yet accepts both numbers
and strings.

The reality is that some types are more related than others, and the
substitutability property only makes sense for types that are
sufficiently related.  We *do* write code that accepts any kind of
sequence, including strings.  We do *not* write code that accepts any
kind of container (sequence or mapping), even though some operations
apply to both kinds of container (len, a[b], and since 2.2, x in a).

In code that applies to all (or even just some) kinds of sequences,
the 'in' operator will continue to stand for membership.  This won't
cause a problem with strings: correct code using 'in' for membership
will never use seq1 in seq2, it will use item in seq, where the type
of item is "whatever the type of seq[0] is, if it exists."  When the
seq is a string, item will be a one-char string -- not a "type" in
Python's type system, but certainly a useful concept.

But there's also lots of code that deals only with strings.  This is
normally be completely clear to the casual reader: either because
string literals are used, compared, etc., or because values are
obtained from functions known to return strings (such as
file.readline()), or because methods unique to strings (e.g. s.lower()
are used, and so on.  Strings are very important in lots of programs,
and we want our notations for string operations to be readable and
expressive.  (Regular expressions are extreme in expressiveness, but
lack readability, which is why they're relegated to an imported module
in Python.)  Substring containment testing is a common operation on
strings, so being able to write it as 's1 in s2' rather than
's2.find(s1) >= 0' is a big win, IMO.

PS. Sets are a different case again.  They are containers but neither
sequences nor mappings (though depending on what you want to do they
can resemble either).  We will have to think about which operators
make sense for them.  I'd say that 'elem in set' is an appropriate way
to spell set membership; how to spell subset is a matter of discussion
(maybe 'set1 <= set2' is a good idea; maybe not).

--Guido van Rossum (home page:

From  Tue Aug  6 13:31:40 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 08:31:40 -0400
Subject: [Python-Dev] Re: framer tool
In-Reply-To: Your message of "Tue, 06 Aug 2002 11:46:22 +0200."
References: <>
Message-ID: <>

> It would be good if framer could grow similar 
> functionality (a GUI where you tap a couple of buttons to create objects 
> and methods, plus a couple of switches to select the protocols the 
> objects should adhere to) so we can lay modulator to rest.

But modulator is such a cool name!  Maybe that part of framer could be
called modulator 2, in honor of the original.

--Guido van Rossum (home page:

From  Tue Aug  6 13:37:32 2002
From: (Gordon McMillan)
Date: Tue, 6 Aug 2002 08:37:32 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: Your message of "Mon, 05 Aug 2002 21:36:48 EDT." <3D4EEFF0.27145.40FDFEA8@localhost>
Message-ID: <3D4F8ACC.18886.435AE7E6@localhost>

On 6 Aug 2002 at 7:44, Guido van Rossum wrote:

> > So whether '' in 'abc' will work or not is a red
> > herring. The real issue is that membership gets
> > conflated with subsetting.
> Well, in current Python you can only safely make
> that transformation when you're damn sure that char
> is a string of length one, otherwise you'd risk a
> TypeError. So this code (if correct) will continue
> to work, assuming you're not cathing TypeError
> (which is often an assumption when we say that a new
> feature "won't break old code"). 

I agree that x in str meaning "subset of" is more
intuitive. I believe you are correct (at least most
old code will still work), but this one makes me
uneasy (I admit possibly because x in str has
always made me uneasy).

And finally, I vote that testing for subset should
work in the mathematically correct way (when
testing for the empty subset). This does not
affect your argument. (In fact, Tim is arguing to
have half[1] the code that catches TypeErrors
continue to work, while the other half doesn't.)

'Nuff said.

-- Gordon

[1]No, probably not by lines of code.

From  Tue Aug  6 13:54:50 2002
From: (Samuele Pedroni)
Date: Tue, 6 Aug 2002 14:54:50 +0200
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
References: <001f01c23d2b$a81a9e40$6d94fea9@newmexico>  <>
Message-ID: <008901c23d48$73b86840$6d94fea9@newmexico>

Thanks for the detailed argument.

> In code that applies to all (or even just some) kinds of sequences,
> the 'in' operator will continue to stand for membership.  This won't
> cause a problem with strings: correct code using 'in' for membership
> will never use seq1 in seq2, it will use item in seq, where the type
> of item is "whatever the type of seq[0] is, if it exists."  When the
> seq is a string, item will be a one-char string -- not a "type" in
> Python's type system, but certainly a useful concept.
> But there's also lots of code that deals only with strings.  This is
> normally be completely clear to the casual reader: either because
> string literals are used, compared, etc., or because values are
> obtained from functions known to return strings (such as
> file.readline()), or because methods unique to strings (e.g. s.lower()
> are used, and so on.  Strings are very important in lots of programs,
> and we want our notations for string operations to be readable and
> expressive.  (Regular expressions are extreme in expressiveness, but
> lack readability, which is why they're relegated to an imported module
> in Python.)  Substring containment testing is a common operation on
> strings, so being able to write it as 's1 in s2' rather than
> 's2.find(s1) >= 0' is a big win, IMO.

My only remark is that this opens the temptation for someone
to subclass say UserList and define "in" as subseq
because it is convenient for the application, for some
value of convenient. And write "seq1 in seq2".
One can generalize saying that it is OK for sequences
that are not full-fledged containers and in particular
do not accept (per contract) subseqs as elements.
All the subtle explanation shows that this is indeed a subtle

Thanks again.

PS: is pure substring testing such a common idiom?
I have not found so many
matches for   find\(.*\)\s*>  in the std lib,
but maybe the re is not general enough or
the std lib is not typical in this respect. Or some
op error.

From  Tue Aug  6 15:01:56 2002
From: (Andrew Koenig)
Date: 06 Aug 2002 10:01:56 -0400
Subject: [Python-Dev] Dafanging the find() gotcha
In-Reply-To: <>
References: <>
Message-ID: <>

Tim> I don't count that as "a practical fear" unless you actually
Tim> search for empty strings, and I don't believe that you do (or at
Tim> least not on purpose -- you can change my mind in a hurry by
Tim> posting your Python code that does do so, though!).

A hypothetical example for you.

Imagine an interactive program that rummages through a pile of
files to find files with particular properties.  Such a program
might allow one to request a search by presenting a form to fill
out.  Suppose that form has a fragment that looks like this:

     Search only files containing this string:  |                   |

If the user doesn't type anything into this part of the form, we would
like the search to cover all files.

If (s in t) yields true whenever s is null, this example just works.
Otherwise, the code needs a special case.

Andrew Koenig,,

From  Tue Aug  6 15:16:07 2002
From: (Barry A. Warsaw)
Date: Tue, 6 Aug 2002 10:16:07 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
References: <001f01c23d2b$a81a9e40$6d94fea9@newmexico>
Message-ID: <>

Great analysis Guido, thanks.

    GvR> Strings are very important in lots of programs, and we want
    GvR> our notations for string operations to be readable and
    GvR> expressive.  (Regular expressions are extreme in
    GvR> expressiveness, but lack readability, which is why they're
    GvR> relegated to an imported module in Python.)  Substring
    GvR> containment testing is a common operation on strings, so
    GvR> being able to write it as 's1 in s2' rather than 's2.find(s1)
    GvR> >= 0' is a big win, IMO.

I agree completely.  The other thing about strings is that they are of
a dual nature, being both a sequence of characters, and an atomic
object.  At least, /I/ usually think about strings as whole units,
except when I want to slice and dice them.  And "substr in str" is
just such a natural extension of "char in str" because when I do the
former, I'm still thinking about looking for a substring, just one of
a single character in length.


From  Tue Aug  6 15:34:34 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 10:34:34 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Tue, 06 Aug 2002 14:54:50 +0200."
References: <001f01c23d2b$a81a9e40$6d94fea9@newmexico> <>
Message-ID: <>

> My only remark is that this opens the temptation for someone
> to subclass say UserList and define "in" as subseq
> because it is convenient for the application, for some
> value of convenient. And write "seq1 in seq2".

Yeah, once you allow overloading, you can't prevent abuse.  I've heard
of bad C++ programmers who write A+B meaning an assignment to A.

> One can generalize saying that it is OK for sequences
> that are not full-fledged containers and in particular
> do not accept (per contract) subseqs as elements.

In the context of a particular application it can be very useful and
completely unambiguous.

> All the subtle explanation shows that this is indeed a subtle
> point.


> Thanks again.

You're welcome.  And thanks for your question -- it made me see this
issue in a different light (the correct one :-).

> PS: is pure substring testing such a common idiom?
> I have not found so many
> matches for   find\(.*\)\s*>  in the std lib,
> but maybe the re is not general enough or
> the std lib is not typical in this respect. Or some
> op error.

The std lib is probably low on string processing ops compared to many
real apps.

--Guido van Rossum (home page:

From  Tue Aug  6 15:47:57 2002
From: (Skip Montanaro)
Date: Tue, 6 Aug 2002 09:47:57 -0500
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <001f01c23d2b$a81a9e40$6d94fea9@newmexico>
References: <001f01c23d2b$a81a9e40$6d94fea9@newmexico>
Message-ID: <15695.57757.655927.698893@localhost.localdomain>

    Samuele> If

    Samuele> "thon" in "python"

    Samuele> then why not

    Samuele> [1,2] in [0,1,2,3]

    Samuele> (it's a purely rhetorical question)

    Samuele> in general I don't think it is a good idea to have "in" be a
    Samuele> membership vs subset/subseq operator depending on non
    Samuele> ambiguity, convenience or simply implementer taste, because
    Samuele> truly there are data types (ex. sets) that would need both and
    Samuele> disambiguated.

Perhaps it makes sense to allow "'thon' in 'python'" to return True, but
still have "[1,2] in [0,1,2,3]" return False if we loosen the steadfast
requirement that strings and lists be as much alike as possible.  That is,
while both are sequences, we take advantage of the distinction between their
basic structures (sequence of characters vs. sequeunce of arbitrary


From  Tue Aug  6 15:46:52 2002
From: (Michael Chermside)
Date: Tue, 06 Aug 2002 10:46:52 -0400
Subject: [Python-Dev] Re: Dafanging the find() gotcha
Message-ID: <>

Tim> I don't count that as "a practical fear" unless you actually
Tim> search for empty strings, and I don't believe that you do (or at
Tim> least not on purpose -- you can change my mind in a hurry by
Tim> posting your Python code that does do so, though!).

Andrew> A hypothetical example for you.
Andrew> Imagine an interactive program that [...] present[s] a form to 
Andrew> fill out [which] looks like this:
Andrew>     Search only files containing this string:
Andrew> If the user doesn't type anything into this part of the form, we 
Andrew> would like the search to cover all files.

I think this is an extremely unconvincing example. You have pushed the 
API up to the user of a program and supposed that they expect the 
behavior which you are trying to defend. In practice, what users expect 
in cases where a field is left blank is for that field to be IGNORED, 
not for it to be processed, but its contents treated as containing an 
empty string.

If you had an algorithm which worked on strings generally but only if 
the null string behavior was as desired, that would be convincing. But 
saying that the user might expect this behavior seems a poor argument... 
user's expectations are usually Do-What-I-Mean, not Do-It-Right. 
Programming languages, though, work better when designed to Do-It-Right.

Perl-being-the-exception-that-proves-the-rule -lly yours,

-- Michael Chermside

From  Tue Aug  6 15:59:28 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 10:59:28 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Tue, 06 Aug 2002 09:47:57 CDT."
References: <001f01c23d2b$a81a9e40$6d94fea9@newmexico>
Message-ID: <>

> Perhaps it makes sense to allow "'thon' in 'python'" to return True,
> but still have "[1,2] in [0,1,2,3]" return False if we loosen the
> steadfast requirement that strings and lists be as much alike as
> possible.

That was never a requirement.  Strings and lists are merely similar
insofar as they have very similar needs for a slicing and subscripting
notation, and to a lesser extent for concatenation, repetition and

Note that the sets of methods supported are almost entirely distinct
(only count and index are shared).

--Guido van Rossum (home page:

From  Tue Aug  6 16:19:18 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 11:19:18 -0400
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
In-Reply-To: Your message of "Tue, 06 Aug 2002 11:51:38 +0200."
References: <> <022a01c23d09$65999840$>
Message-ID: <>

[David A]
> > It's not too terrible, but I'd like it a lot better if types would
> > just use tp_basicsize to find the beginning of the variable stuff
> > so we could embed the memory in the type itself. 'Course, I've
> > forgotten more than I knew about that code, so I might be barking
> > up the wrong banyan.

[Chris T]
> That's exactly what I want to do, but I have to find
> out how the variable part of types is used at the moment,
> and I admit I didn't understand it, yet.
> The place where user stuff should go is where instances
> have their slots. With meta-types, it now happens that
> types become instances, but types refuse to have slots.
> This needs to be changed, everything else is a workaround.

You're right.  And David has the right idea.  The problem is that for
convenience I defined the variable part of a type object as a private
structure (etype).  It's a lot of work to change that -- not very deep
perhaps, but a lot of refactoring code that does deep things.

To remind myself of this task, I've added a new SF bug:

--Guido van Rossum (home page:

From  Tue Aug  6 16:26:31 2002
From: (Andrew Koenig)
Date: 06 Aug 2002 11:26:31 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <001f01c23d2b$a81a9e40$6d94fea9@newmexico>
Message-ID: <>

Guido> Yeah, once you allow overloading, you can't prevent abuse.  I've heard
Guido> of bad C++ programmers who write A+B meaning an assignment to A.

That kind of thing is uncommon, partly because it can't be done for
built-in types.  Such practices are widely derided in the C++
community, too.

Andrew Koenig,,

From  Tue Aug  6 16:28:57 2002
From: (Andrew Koenig)
Date: 06 Aug 2002 11:28:57 -0400
Subject: [Python-Dev] Re: Dafanging the find() gotcha
In-Reply-To: <>
References: <>
Message-ID: <>

Michael> I think this is an extremely unconvincing example. You have
Michael> pushed the API up to the user of a program and supposed that
Michael> they expect the behavior which you are trying to defend. In
Michael> practice, what users expect in cases where a field is left
Michael> blank is for that field to be IGNORED, not for it to be
Michael> processed, but its contents treated as containing an empty
Michael> string.

I understand.  My point is that in this particular example, what the
user perceives as ignoring the request is obtained by the
implementation technique of treating it as an empty string.  The user
doesn't have to know about this implementation technique, of course.

Andrew Koenig,,

From  Tue Aug  6 16:31:45 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 11:31:45 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Tue, 06 Aug 2002 11:26:31 EDT."
References: <001f01c23d2b$a81a9e40$6d94fea9@newmexico> <> <008901c23d48$73b86840$6d94fea9@newmexico> <>
Message-ID: <>

> Guido> Yeah, once you allow overloading, you can't prevent abuse.  I've heard
> Guido> of bad C++ programmers who write A+B meaning an assignment to A.
> That kind of thing is uncommon, partly because it can't be done for
> built-in types.  Such practices are widely derided in the C++
> community, too.

And that's exactly my answer in the Python case, too.  You can't
prevent people from writing bad code, but you can make them look
foolish. :-)

--Guido van Rossum (home page:

From  Tue Aug  6 16:36:13 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 11:36:13 -0400
Subject: [Python-Dev] Re: Dafanging the find() gotcha
In-Reply-To: Your message of "Tue, 06 Aug 2002 11:28:57 EDT."
References: <>
Message-ID: <>

> Michael> I think this is an extremely unconvincing example. You have
> Michael> pushed the API up to the user of a program and supposed that
> Michael> they expect the behavior which you are trying to defend. In
> Michael> practice, what users expect in cases where a field is left
> Michael> blank is for that field to be IGNORED, not for it to be
> Michael> processed, but its contents treated as containing an empty
> Michael> string.
> I understand.  My point is that in this particular example, what the
> user perceives as ignoring the request is obtained by the
> implementation technique of treating it as an empty string.  The user
> doesn't have to know about this implementation technique, of course.

I think it's a poor implementation technique. :-)  Opening the file to
search for an empty string is very inefficient.

My own potential example was some kind of graph traversal algorithm,
representing paths by sequences of letters (the letters labeling
edges), and involving paths that are subpaths of other paths.
Certainly the empty path should be considered a valid subpath of other

BTW, a more fool-proof (though unfortunately slower) way of testing
for substring containment in existing Python would be s2.count(s1) --
this returns the number of occurrences.  And of course,
'abc'.count('') returns 4.

--Guido van Rossum (home page:

From  Tue Aug  6 16:38:21 2002
From: (Andrew Koenig)
Date: Tue, 6 Aug 2002 11:38:21 -0400 (EDT)
Subject: [Python-Dev] Re: Dafanging the find() gotcha
In-Reply-To: <> (message from Guido
 van Rossum on Tue, 06 Aug 2002 11:36:13 -0400)
References: <>
 <> <>
Message-ID: <>

>> I understand.  My point is that in this particular example, what the
>> user perceives as ignoring the request is obtained by the
>> implementation technique of treating it as an empty string.  The user
>> doesn't have to know about this implementation technique, of course.

Guido> I think it's a poor implementation technique. :-)  Opening the file to
Guido> search for an empty string is very inefficient.

I'm assuming that the file is going to be opened anyway, possibly
to check for other search criteria.

Guido> My own potential example was some kind of graph traversal algorithm,
Guido> representing paths by sequences of letters (the letters labeling
Guido> edges), and involving paths that are subpaths of other paths.
Guido> Certainly the empty path should be considered a valid subpath of other
Guido> paths.

I can imagine similar applications that deal with file names

Guido> BTW, a more fool-proof (though unfortunately slower) way of testing
Guido> for substring containment in existing Python would be s2.count(s1) --
Guido> this returns the number of occurrences.  And of course,
Guido> 'abc'.count('') returns 4.

That could be much slower, of course.

Incidentally, one other argument that might be relevant is that in
every other programming language I've ever seen that supports string
searching, the null string is accepted as a search argument and is
always found.

From  Tue Aug  6 16:42:37 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 11:42:37 -0400
Subject: [Python-Dev] Re: Dafanging the find() gotcha
In-Reply-To: Your message of "Tue, 06 Aug 2002 11:38:21 EDT."
References: <> <> <>
Message-ID: <>

> Incidentally, one other argument that might be relevant is that in
> every other programming language I've ever seen that supports string
> searching, the null string is accepted as a search argument and is
> always found.

Same for Python, until now.

--Guido van Rossum (home page:

From  Tue Aug  6 16:58:43 2002
From: (Christian Tismer)
Date: Tue, 06 Aug 2002 17:58:43 +0200
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
References: <> <> <>
Message-ID: <>

Michael Hudson wrote:

[slots in types]

> I would wonder how much this saves.
> How many more instructions does
>   PyDict_GetItem(ob->ob_type->tp_dict, interned_string)
> take than
>   ob->ob_type->tp_my_field->mf_my_method
> ?  Sure, *some* but not all that many esp. if the called function is
> actually doing significant work.

The comparison doesn't hit the nail (as you explain
as well), since what I do right now is to call
a highly optimized C function, directly, and the
speed concerns are mainly for my C API, which is
supposed to be much faster then the Python interface.

Having to call anything but my builtin stuff hurts.
So I want at least to 'know' that my function is
not overridden, and be able to call the builtin stuff.
Doing the call all the time via


would be nice, but I'd even be pleased with some flag.
But there is no space for nothing in a type.

Second, this is most time critical code, since my
tasklet switching is now very fast (half the time
of a function call from Python) for my CFrames.
And now people ask for overriding there, which hurts
me most possible. I will either find the solution,
or leave it as it is and ask C programmers to
"grab the thing if you want the overridden method".


> I doubt my opinion counts here, but I think I'd prefer to see *less*,
> not more, methods in type object in future.  Particularly if there's
> some way to call functions with known signatures efficiently.
> Unfortunately, that seems pretty hard after five minutes thinking.

I'm not going to introduces masses of new methods for
type objects, but a generic way to introduce private

not-easy-to-stop-me-anyway-at-all-ly y'rs -- chris

Christian Tismer             :^)   <>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship*
14109 Berlin                 :     PGP key ->
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?

From David Abrahams" <  Tue Aug  6 16:57:08 2002
From: David Abrahams" < (David Abrahams)
Date: Tue, 6 Aug 2002 11:57:08 -0400
Subject: [Python-Dev] Re: Dafanging the find() gotcha
References: <>              <>  <>
Message-ID: <03a701c23d62$41eaa020$>

A relevant discussion came up recently on the boost list. In our regex
library there's a function which tells you whether a string was a partial
match to the beginning of a pattern. The example used in the docs was a
credit-card number validator which watches what you type and beeps at you
if there's a mistake. Unfortunately, the implementation would return false
if the input string was empty. Of course that required special-casing for
the empty string. Eventually complaints from users caused the library
maintainer to change his mind about the response to the empty string.

FWIW-ly yr's,

           David Abrahams * Boost Consulting *

----- Original Message -----
From: "Guido van Rossum" <>
To: "Andrew Koenig" <>
Cc: "Michael Chermside" <>; "python-dev"
Sent: Tuesday, August 06, 2002 11:36 AM
Subject: Re: [Python-Dev] Re: Dafanging the find() gotcha

> > Michael> I think this is an extremely unconvincing example. You have
> > Michael> pushed the API up to the user of a program and supposed that
> > Michael> they expect the behavior which you are trying to defend. In
> > Michael> practice, what users expect in cases where a field is left
> > Michael> blank is for that field to be IGNORED, not for it to be
> > Michael> processed, but its contents treated as containing an empty
> > Michael> string.
> >
> > I understand.  My point is that in this particular example, what the
> > user perceives as ignoring the request is obtained by the
> > implementation technique of treating it as an empty string.  The user
> > doesn't have to know about this implementation technique, of course.
> I think it's a poor implementation technique. :-)  Opening the file to
> search for an empty string is very inefficient.
> My own potential example was some kind of graph traversal algorithm,
> representing paths by sequences of letters (the letters labeling
> edges), and involving paths that are subpaths of other paths.
> Certainly the empty path should be considered a valid subpath of other
> paths.
> BTW, a more fool-proof (though unfortunately slower) way of testing
> for substring containment in existing Python would be s2.count(s1) --
> this returns the number of occurrences.  And of course,
> 'abc'.count('') returns 4.
> --Guido van Rossum (home page:
> _______________________________________________
> Python-Dev mailing list

From  Tue Aug  6 17:29:21 2002
From: (Eric S Raymond)
Date: Tue, 6 Aug 2002 12:29:21 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <001f01c23d2b$a81a9e40$6d94fea9@newmexico> <> <008901c23d48$73b86840$6d94fea9@newmexico> <>
Message-ID: <>

Guido van Rossum <>:
> The std lib is probably low on string processing ops compared to many
> real apps.

Yes, it is.  I've noticed this myself.
		<a href="">Eric S. Raymond</a>

From  Tue Aug  6 18:09:15 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 13:09:15 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Tue, 06 Aug 2002 10:16:07 EDT."
References: <001f01c23d2b$a81a9e40$6d94fea9@newmexico> <>
Message-ID: <>

I think we've argued about '' in 'abc' long enough.  Tim has failed to
convince me, so '' in 'abc' returns True.  Barry has checked it all

(In other news, I've checked in Oren's latest patch for making a file
its own iterator.  In the process, the xreadlines module has become

--Guido van Rossum (home page:

From  Tue Aug  6 20:35:50 2002
From: (Martin v. Loewis)
Date: 06 Aug 2002 21:35:50 +0200
Subject: [Python-Dev] The memo of pickle
Message-ID: <>

pickle currently puts tuples into the memo on pickling, but only ever
uses the position field ([0]), never the object itself ([1]).

I understand that the reference to the object is needed to keep it
alive while pickling.

Unfortunately, this means one needs to allocate 36 bytes for the

I think this memory consumption could be reduced by saving the objects
in a list, and only saving the position in the memo dictionary. That
would save roughly 32 bytes per memoized object, assuming there is no
malloc overhead.

What do you think?


P.S. It would be even more efficient if there was an identity

From  Tue Aug  6 20:49:03 2002
From: (Guido van Rossum)
Date: Tue, 06 Aug 2002 15:49:03 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: Your message of "Tue, 06 Aug 2002 21:35:50 +0200."
References: <>
Message-ID: <>

> pickle currently puts tuples into the memo on pickling, but only ever
> uses the position field ([0]), never the object itself ([1]).
> I understand that the reference to the object is needed to keep it
> alive while pickling.
> Unfortunately, this means one needs to allocate 36 bytes for the
> tuple.
> I think this memory consumption could be reduced by saving the objects
> in a list, and only saving the position in the memo dictionary. That
> would save roughly 32 bytes per memoized object, assuming there is no
> malloc overhead.
> What do you think?

Is it worth it?  Have you made a patch?  What use case are you
thinking of?

> Regards,
> Martin
> P.S. It would be even more efficient if there was an identity
> dictionary.

Sorry, what's an identity dict?

--Guido van Rossum (home page:

From  Tue Aug  6 21:31:41 2002
From: (M.-A. Lemburg)
Date: Tue, 06 Aug 2002 22:31:41 +0200
Subject: [Python-Dev] The memo of pickle
References: <>
Message-ID: <>

Martin v. Loewis wrote:
> pickle currently puts tuples into the memo on pickling, but only ever
> uses the position field ([0]), never the object itself ([1]).
> I understand that the reference to the object is needed to keep it
> alive while pickling.
> Unfortunately, this means one needs to allocate 36 bytes for the
> tuple.
> I think this memory consumption could be reduced by saving the objects
> in a list, and only saving the position in the memo dictionary. That
> would save roughly 32 bytes per memoized object, assuming there is no
> malloc overhead.

While that may save you some bytes, wouldn't it break pickle
subclasses using the memo as well ?

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Tue Aug  6 22:21:37 2002
From: (Martin v. Loewis)
Date: 06 Aug 2002 23:21:37 +0200
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
References: <>
Message-ID: <>

"M.-A. Lemburg" <> writes:

> While that may save you some bytes, wouldn't it break pickle
> subclasses using the memo as well ?

Yes. Are there such things?


From  Tue Aug  6 22:35:13 2002
From: (M.-A. Lemburg)
Date: Tue, 06 Aug 2002 23:35:13 +0200
Subject: [Python-Dev] The memo of pickle
References: <>	<> <>
Message-ID: <>

Martin v. Loewis wrote:
> "M.-A. Lemburg" <> writes:
>>While that may save you some bytes, wouldn't it break pickle
>>subclasses using the memo as well ?
> Yes. Are there such things?

Sure. I use pickle subclasses with hooks for various special
object types a lot in my applications... would be nice if
I could start subclassing cPickles sometime in the future :-)

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Tue Aug  6 22:49:12 2002
From: (Jack Jansen)
Date: Tue, 6 Aug 2002 23:49:12 +0200
Subject: [Python-Dev] cvs crash while updating Doc
Message-ID: <>

I've stared at this for a while now, and I'm out of good ideas, 
so if anyone has any idea how to debug this please let me know.

As of recently I can't do a full checkout of the Python sources 
anymore *with MacCVS Pro on Mac OS 9*. (distant rumbling and 
cursing of MacCVS Pro is heard)

What happens is that the cvs *server* aborts with a signal 11 
while trying to check out Doc/pyexpat.tex. Of course, if I try 
with a different CVS client the server happily checks the file 
out, otherwise I wouldn't be bothering you. And I inspected the 
last few revisions of pyexpat.tex and there's no obvious changes 
that I can imagine would blow up a cvs server.

I can get rid of MacCVS Pro and switch back to the 
much-more-pro-in-my-mind MacCVS (as it supports ssh nowadays, 
finally) but that'll be a hassle, so if anyone has any bright 
ideas please fire away!
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Tue Aug  6 22:52:22 2002
From: (Martin v. Loewis)
Date: 06 Aug 2002 23:52:22 +0200
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
References: <>
Message-ID: <>

"M.-A. Lemburg" <> writes:

> Sure. I use pickle subclasses with hooks for various special
> object types a lot in my applications... 

Can you provide the source of one such subclass?


From  Tue Aug  6 23:07:01 2002
From: (Martin v. Loewis)
Date: 07 Aug 2002 00:07:01 +0200
Subject: [Python-Dev] cvs crash while updating Doc
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen <> writes:

> What happens is that the cvs *server* aborts with a signal 11 while
> trying to check out Doc/pyexpat.tex. Of course, if I try with a
> different CVS client the server happily checks the file out, otherwise
> I wouldn't be bothering you. And I inspected the last few revisions of
> pyexpat.tex and there's no obvious changes that I can imagine would
> blow up a cvs server.

If you want to investigate this in detail, you can download the CVS
archive from SF, and try to replicate the problem locally.


From  Tue Aug  6 23:20:20 2002
From: (M.-A. Lemburg)
Date: Wed, 07 Aug 2002 00:20:20 +0200
Subject: [Python-Dev] The memo of pickle
References: <>	<>	<>	<> <>
Message-ID: <>

Martin v. Loewis wrote:
> "M.-A. Lemburg" <> writes:
>>Sure. I use pickle subclasses with hooks for various special
>>object types a lot in my applications... 
> Can you provide the source of one such subclass?

No, they are closed-source. But the idea should be obvious:
I want to pickle the various mx types faster then by
relying on the reduce mechanism.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Tue Aug  6 23:30:18 2002
From: (Martin v. Loewis)
Date: 07 Aug 2002 00:30:18 +0200
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
References: <>
Message-ID: <>

"M.-A. Lemburg" <> writes:

> >>Sure. I use pickle subclasses with hooks for various special
> >> object types a lot in my applications...
> > Can you provide the source of one such subclass?
> No, they are closed-source. But the idea should be obvious:
> I want to pickle the various mx types faster then by
> relying on the reduce mechanism.

Ok. I think I could making this change without breaking your code:
Subclasses won't read the memo; they will only write to it - is the only place that ever reads the memo.

So subclasses could safely put tuples into the dictionary; the base
class would then look for either tuples or numbers.


From  Tue Aug  6 23:36:52 2002
From: (Steve Holden)
Date: Tue, 6 Aug 2002 18:36:52 -0400
Subject: [Python-Dev] cvs crash while updating Doc
References: <>
Message-ID: <012f01c23d99$c37d3ad0$>

----- Original Message -----
From: "Jack Jansen" <>
To: <>
Sent: Tuesday, August 06, 2002 5:49 PM
Subject: [Python-Dev] cvs crash while updating Doc

> Folks,
> I've stared at this for a while now, and I'm out of good ideas,
> so if anyone has any idea how to debug this please let me know.
> As of recently I can't do a full checkout of the Python sources
> anymore *with MacCVS Pro on Mac OS 9*. (distant rumbling and
> cursing of MacCVS Pro is heard)
> What happens is that the cvs *server* aborts with a signal 11
> while trying to check out Doc/pyexpat.tex. Of course, if I try
> with a different CVS client the server happily checks the file
> out, otherwise I wouldn't be bothering you. And I inspected the
> last few revisions of pyexpat.tex and there's no obvious changes
> that I can imagine would blow up a cvs server.
> I can get rid of MacCVS Pro and switch back to the
> much-more-pro-in-my-mind MacCVS (as it supports ssh nowadays,
> finally) but that'll be a hassle, so if anyone has any bright
> ideas please fire away!

I wonder if this could be the reason I'm currently seeing

cvs server: [15:36:32] waiting for jackjansen's lock in
cvs server: [15:37:02] waiting for jackjansen's lock in
cvs server: [15:37:32] waiting for jackjansen's lock in

as I try and check a small change in to the library documentation? Looks
like I'll have to try later.

Steve Holden                       
Python Web Programming      

From  Wed Aug  7 03:03:25 2002
From: (Greg Ewing)
Date: Wed, 07 Aug 2002 14:03:25 +1200 (NZST)
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
In-Reply-To: <>
Message-ID: <>

Christian Tismer <>:

> The reason why I want to have extra data and function
> caches in the types is that this is *very* memory
> efficient, in comparison to stuffing things into the
> instances (which would be easy to implement).

Maybe you misunderstood -- the stuff I was talking
about *would* go in the type, not in the instances.
I was suggesting a generalisation of the way the
type object keeps some of its slots in extra
structures, and allowing you to add more such

> With meta-types, it now happens that
> types become instances, but types refuse to have slots.
> This needs to be changed, everything else is a workaround.

Yes, that would be more elegant, if it could be done.
I haven't looked closely enough at exactly why types
can't have slots to know how difficult it would be.
Maybe it's not difficult, in which case my suggestion
is unnecessary.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug  7 03:08:43 2002
From: (Greg Ewing)
Date: Wed, 07 Aug 2002 14:08:43 +1200 (NZST)
Subject: [Python-Dev] Simpler reformulation of C inheritance Q.
In-Reply-To: <>
Message-ID: <>

Christian Tismer <>:

> I'm not absolutely sure what I meant.

It sounds to me like Christian wants to be able to extend
the typeobject with new built-in method slots. Ideally
these would behave just like the existing ones, to the
extent of PyType_Ready generating Python wrappers for
them automatically.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug  7 03:27:04 2002
From: (Greg Ewing)
Date: Wed, 07 Aug 2002 14:27:04 +1200 (NZST)
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <008901c23d48$73b86840$6d94fea9@newmexico>
Message-ID: <>

> PS: is pure substring testing such a common idiom?
> I have not found so many
> matches for   find\(.*\)\s*>  in the std lib

For more generality, maybe

  re in string

should be made to work too, where re is a regular
expression object?

Or would that be starting on a slippery slope
towards Perl...?-)

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug  7 03:57:05 2002
From: (
Date: Tue, 6 Aug 2002 21:57:05 -0500
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <008901c23d48$73b86840$6d94fea9@newmexico> <>
Message-ID: <>

On Wed, Aug 07, 2002 at 02:27:04PM +1200, Greg Ewing wrote:
> > PS: is pure substring testing such a common idiom?
> > I have not found so many
> > matches for   find\(.*\)\s*>  in the std lib
> For more generality, maybe
>   re in string
> should be made to work too, where re is a regular
> expression object?

Surely the re is the thing that expresses a set of strings ...
    string in re
would be the same as

Oh, so
    re in string
would be

Clear to me!

> Or would that be starting on a slippery slope
> towards Perl...?-)



From  Wed Aug  7 03:56:47 2002
From: (Greg Ewing)
Date: Wed, 07 Aug 2002 14:56:47 +1200 (NZST)
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

> Surely the re is the thing that expresses a set of strings ...
>     string in re
> would be the same as
>     re.match(string)
> Oh, so
>     re in string
> would be
> right?

That distinction might just be a tad too subtle...

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug  7 03:58:10 2002
From: (Greg Ewing)
Date: Wed, 07 Aug 2002 14:58:10 +1200 (NZST)
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

> I think this memory consumption could be reduced by saving the objects
> in a list, and only saving the position in the memo dictionary.

Do you need the list at all? Won't the object be kept
alive by the fact that it's a key in the dictionary?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug  7 04:31:04 2002
From: (Tim Peters)
Date: Tue, 6 Aug 2002 23:31:04 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

    re1 in re2

should be True iff the language accepted by re1 is a subset of the language
accepted by re2.  In this case, it's OK to consider the empty language a
subset of all others, since nobody will be able to make head or tail out of
the code anyway.

flexible-to-a-fault-ly y'rs  - tim

From  Wed Aug  7 04:32:26 2002
From: (Tim Peters)
Date: Tue, 6 Aug 2002 23:32:26 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

[Greg Ewing]
> Do you need the list at all? Won't the object be kept
> alive by the fact that it's a key in the dictionary?

The object's id() (address) is the key.  Else only hashable objects could be

From  Wed Aug  7 04:51:47 2002
From: (Greg Ewing)
Date: Wed, 07 Aug 2002 15:51:47 +1200 (NZST)
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

> The object's id() (address) is the key.  Else only hashable objects
> could be pickled.

Hmmm, I see.

It occurs to me that what you really want here is a special
kind of dictionary that uses "is" instead of "==" to compare
key values.

Or is that what was meant by an "identity dictionary"?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug  7 05:04:51 2002
From: (Skip Montanaro)
Date: Tue, 6 Aug 2002 23:04:51 -0500
Subject: [Python-Dev] Do I misunderstand how codecs.EncodedFile is supposed to work?
Message-ID: <15696.40035.180993.654851@localhost.localdomain>

The following simple session suggests I misunderstood how the
codecs.EncodedFile function should work:

    >>> import codecs
    >>> f = codecs.EncodedFile(open("unicode-test.txt", "w"), "utf-8")
    >>> s = 'Caffe\x92 Lena'
    >>> u = unicode(s, "cp1252")
    >>> u
    u'Caffe\u2019 Lena'
    >>> f.write(u.encode("utf-8"))
    >>> f.write(u)
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "/usr/local/lib/python2.3/", line 453, in write
        data, bytesdecoded = self.decode(data, self.errors)
    UnicodeError: ASCII encoding error: ordinal not in range(128)

I thought the whole purpose of the EncodedFile class was to provide
transparent encoding.  Shouldn't it support transparent encoding of Unicode
objects?  That is, I told the system I want writes to be in utf-8 when I
instantiated the class.  I don't think I should have to call .encode()
directly.  I realize I can wrap the function in a class that adds the
transparency I desire, but it seems the whole point should be to make it
easy to write Unicode objects to files.


From  Wed Aug  7 05:09:01 2002
From: (Tim Peters)
Date: Wed, 7 Aug 2002 00:09:01 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

> The object's id() (address) is the key.  Else only hashable objects
> could be pickled.

[Greg Ewing]
> Hmmm, I see.
> It occurs to me that what you really want here is a special
> kind of dictionary that uses "is" instead of "==" to compare
> key values.

Possibly.  The *effect* of that could be gotten via a wrapper object, like

class KeyViaId:
    def __init__(self, obj):
        self.obj = obj
    def __hash__(self):
        return hash(id(self.obj))
    def __eq__(self, other):
        return self.obj is other.obj

but if Martin is worried about two-tuple sizes, he's not going to fall in
love with that.

> Or is that what was meant by an "identity dictionary"?

Guido asked, but if Martin answered that question I haven't seen it yet.

From  Wed Aug  7 05:29:08 2002
From: (Greg Ewing)
Date: Wed, 07 Aug 2002 16:29:08 +1200 (NZST)
Subject: Is-dict? (RE: [Python-Dev] The memo of pickle)
In-Reply-To: <>
Message-ID: <>

> The *effect* of that could be gotten via a wrapper object
> ...
> but if Martin is worried about two-tuple sizes, he's not going to fall in
> love with that.

Indeed, which is why I suggested it.

I was wondering whether it would be worth putting one of
these in the standard library. I could have used one in
Plex, when I wanted to map a dictionary to something else
by identity, and I wanted to do it as fast as possible.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug  7 07:15:56 2002
From: (Martin v. Loewis)
Date: 07 Aug 2002 08:15:56 +0200
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
References: <>
Message-ID: <>

Greg Ewing <> writes:

> It occurs to me that what you really want here is a special
> kind of dictionary that uses "is" instead of "==" to compare
> key values.
> Or is that what was meant by an "identity dictionary"?

Yes; that would be a dictionary that uses identity instead of


From  Wed Aug  7 07:35:08 2002
From: (Martin v. Loewis)
Date: 07 Aug 2002 08:35:08 +0200
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> Is it worth it?  

If you believe that the problem is real, yes.

> Have you made a patch?  

Not yet, no. With Mark's objection, it is more difficult than I

> What use case are you thinking of?

People repeatedly complain that pickle consumes too much memory. The
most recent instance was

Earlier reports are*&hl=de&lr=&ie=UTF-8&

> Sorry, what's an identity dict?

IdentityDictionary is the name of a Smalltalk class that uses identity
instead of equality when comparing keys:

In Python, it would allow arbitrary objects as keys, and allow equal
duplicates as different keys. For pickle, this would mean that we
could save both the creation of the id() object (since the object
itself is used as a key), and the creation of the tuple (since the
value is only the position).


From  Wed Aug  7 07:46:59 2002
From: (Martin v. Loewis)
Date: 07 Aug 2002 08:46:59 +0200
Subject: [Python-Dev] Do I misunderstand how codecs.EncodedFile is supposed to work?
In-Reply-To: <15696.40035.180993.654851@localhost.localdomain>
References: <15696.40035.180993.654851@localhost.localdomain>
Message-ID: <>

Skip Montanaro <> writes:

> I thought the whole purpose of the EncodedFile class was to provide
> transparent encoding.  

    """ Return a wrapped version of file which provides transparent
        encoding translation.

        Strings written to the wrapped file are interpreted according
        to the given data_encoding and then written to the original
        file as string using file_encoding. The intermediate encoding
        will usually be Unicode but depends on the specified codecs.

        Strings are read from the file using file_encoding and then
        passed back to the caller as string using data_encoding.

        If file_encoding is not given, it defaults to data_encoding.

So, no. It provides transparent recoding: with a file encoding, and a
data encoding.

I never found this class useful.

What you want is a StreamWriter:

f = codecs.get_writer('utf-8')(open('unicode-test', 'w'))

Of course, *this* specific case can be written much easier as

f ='unicode-test', 'w', encoding = 'utf-8')

The get_writer case is useful if you already got a file-like object
from somewhere.

> Shouldn't it support transparent encoding of Unicode
> objects?  That is, I told the system I want writes to be in utf-8 when I
> instantiated the class.  

You told it also that input data are in utf-8, as you have omitted the

> I don't think I should have to call .encode() directly.  I realize I
> can wrap the function in a class that adds the transparency I
> desire, but it seems the whole point should be to make it easy to
> write Unicode objects to files.

Not this class, no. 

Now, you may ask what else is the purpose of this class. I really
don't know - it is against everything I'm advocating, as it assumes
that you have byte strings in a certain encoding in your memory that
you want to save in a different encoding. That should never happen -
all your text data should be Unicode strings.


From  Wed Aug  7 07:47:38 2002
From: (Brett Cannon)
Date: Tue, 6 Aug 2002 23:47:38 -0700 (PDT)
Subject: [Python-Dev] python-dev summaries?
Message-ID: <Pine.SOL.4.44.0208062346170.16439-100000@death.OCF.Berkeley.EDU>

I pretty much no the answer to this vague question is, "no one is doing
them at the moment/anymore", but I though I would ask in case someone is
and I am completely oblivious to them being sent out to the list and
Google can't find any of them.

-Brett C.

From  Wed Aug  7 08:38:59 2002
From: (M.-A. Lemburg)
Date: Wed, 07 Aug 2002 09:38:59 +0200
Subject: [Python-Dev] The memo of pickle
References: <>	<> <>
Message-ID: <>

Martin v. Loewis wrote:
> Guido van Rossum <> writes:
>>Is it worth it?  
> If you believe that the problem is real, yes.

I think that the tuple is not the problem here, it's the
fact that so many objects are recorded in the memo to
later rebuild recursive structures.

Now, I believe that recursive structures in pickles are
not very common, so the memo is mostly useless in these

Perhaps pickle could grow an option to assume that a
data structure is non-recursive ?! In that case, no
data would be written to the memo (or only the id()
mapped to 1 to double-check).

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Wed Aug  7 08:46:55 2002
From: (M.-A. Lemburg)
Date: Wed, 07 Aug 2002 09:46:55 +0200
Subject: [Python-Dev] Do I misunderstand how codecs.EncodedFile is supposed
 to work?
References: <15696.40035.180993.654851@localhost.localdomain> <>
Message-ID: <>

Martin v. Loewis wrote:
> Skip Montanaro <> writes:
>>I thought the whole purpose of the EncodedFile class was to provide
>>transparent encoding.  
>     """ Return a wrapped version of file which provides transparent
>         encoding translation.
>         Strings written to the wrapped file are interpreted according
>         to the given data_encoding and then written to the original
>         file as string using file_encoding. The intermediate encoding
>         will usually be Unicode but depends on the specified codecs.
>         Strings are read from the file using file_encoding and then
>         passed back to the caller as string using data_encoding.
>         If file_encoding is not given, it defaults to data_encoding.
>     """
> So, no. It provides transparent recoding: with a file encoding, and a
> data encoding.
> I never found this class useful.

It's not a class, just a helper for StreamRecoder. It's purpose
is to provide an easy way of saying "the inside world is encoding
X while the outside world uses Y":

     # Make stdout translate Latin-1 output into UTF-8 output
     sys.stdout = EncodedFile(sys.stdout, 'latin-1', 'utf-8')

     # Have stdin translate UTF-8 input into Latin-1 input
     sys.stdin = EncodedFile(sys.stdin, 'utf-8', 'latin-1')

Here the inside world uses Latin-1 while the outside world
uses UTF-8.

You could also use it to talk to a gzipped file or, provided
you have such a codec, to an encrypted file.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Wed Aug  7 12:20:48 2002
From: (Steve Holden)
Date: Wed, 7 Aug 2002 07:20:48 -0400
Subject: [Python-Dev] CVS fails to commit
Message-ID: <023201c23e04$7b94b610$>

I find that this morning I am still prevented from committing changes to


Is this a problem that's only affecting a small portion of the repository,
or is it more general? To repeat yesterday's notification, the error message
I'm seeing is:

    cvs server: [04:14:04] waiting for jackjansen's lock in

locked-out-ly y'rs  - steve
Steve Holden                       
Python Web Programming      

From  Wed Aug  7 12:35:19 2002
From: (Sjoerd Mullender)
Date: Wed, 07 Aug 2002 13:35:19 +0200
Subject: [Python-Dev] CVS fails to commit
In-Reply-To: <023201c23e04$7b94b610$>
References: <023201c23e04$7b94b610$>
Message-ID: <>

It looks like Jack's problems caused a lock file to be stuck there.  I
expect this affects a small part of the repository, and also that it
needs manual intervention to correct the problem.  So please submit a
service request to SourceForge to remove the lock file(s) in
/cvsroot/python/python/dist/src/Doc/lib (and all subdirectories).

On Wed, Aug 7 2002 "Steve Holden" wrote:

> I find that this morning I am still prevented from committing changes to
>     ~/pythoncvs/python/dist/src/Doc/lib/libposixpath.tex
> Is this a problem that's only affecting a small portion of the repository,
> or is it more general? To repeat yesterday's notification, the error message
> I'm seeing is:
>     cvs server: [04:14:04] waiting for jackjansen's lock in
> /cvsroot/python/python/dist/src/Doc/lib
> locked-out-ly y'rs  - steve
> -----------------------------------------------------------------------
> Steve Holden                       
> Python Web Programming      
> -----------------------------------------------------------------------
> _______________________________________________
> Python-Dev mailing list

-- Sjoerd Mullender <>

From  Wed Aug  7 13:02:10 2002
From: (Steve Holden)
Date: Wed, 7 Aug 2002 08:02:10 -0400
Subject: [Python-Dev] CVS fails to commit
References: <023201c23e04$7b94b610$>  <>
Message-ID: <027b01c23e0a$4317e630$>

----- Original Message -----
From: "Sjoerd Mullender" <>
> It looks like Jack's problems caused a lock file to be stuck there.  I
> expect this affects a small part of the repository, and also that it
> needs manual intervention to correct the problem.  So please submit a
> service request to SourceForge to remove the lock file(s) in
> /cvsroot/python/python/dist/src/Doc/lib (and all subdirectories).
> On Wed, Aug 7 2002 "Steve Holden" wrote:
> > I find that this morning I am still prevented from committing changes to
> >
> >     ~/pythoncvs/python/dist/src/Doc/lib/libposixpath.tex
> >
> > Is this a problem that's only affecting a small portion of the
> > or is it more general? To repeat yesterday's notification, the error
> > I'm seeing is:
> >
> >     cvs server: [04:14:04] waiting for jackjansen's lock in
> > /cvsroot/python/python/dist/src/Doc/lib
> >

Whether by accident or in response to my posting I don't know, but I could
commit within five minutes of making the support request. Kudos to
SourceForge on this one?

Steve Holden                       
Python Web Programming      

From  Wed Aug  7 13:11:03 2002
From: (Guido van Rossum)
Date: Wed, 07 Aug 2002 08:11:03 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: Your message of "Wed, 07 Aug 2002 09:38:59 +0200."
References: <> <> <>
Message-ID: <>

> I think that the tuple is not the problem here, it's the
> fact that so many objects are recorded in the memo to
> later rebuild recursive structures.
> Now, I believe that recursive structures in pickles are
> not very common, so the memo is mostly useless in these
> cases.

Use cPickle, it's much more frugal with the memo, and also has some
options to control the memo (read the docs, I forget the details and
am in a hurry).

> Perhaps pickle could grow an option to assume that a
> data structure is non-recursive ?! In that case, no
> data would be written to the memo (or only the id()
> mapped to 1 to double-check).

The memo is also for sharing.  There's no recursion in this example,
but the sharing may be important:

a = [1,2,3]
b = [a,a,a]

--Guido van Rossum (home page:

From  Wed Aug  7 14:48:22 2002
From: (M.-A. Lemburg)
Date: Wed, 07 Aug 2002 15:48:22 +0200
Subject: [Python-Dev] The memo of pickle
References: <> <> <>              <> <>
Message-ID: <>

Guido van Rossum wrote:
>>I think that the tuple is not the problem here, it's the
>>fact that so many objects are recorded in the memo to
>>later rebuild recursive structures.
>>Now, I believe that recursive structures in pickles are
>>not very common, so the memo is mostly useless in these
> Use cPickle, it's much more frugal with the memo, and also has some
> options to control the memo (read the docs, I forget the details and
> am in a hurry).

Just to clarify: I don't have a problem with the memo
in pickle at all :-) Martin brought up this issue.

>>Perhaps pickle could grow an option to assume that a
>>data structure is non-recursive ?! In that case, no
>>data would be written to the memo (or only the id()
>>mapped to 1 to double-check).
> The memo is also for sharing.  There's no recursion in this example,
> but the sharing may be important:
> a = [1,2,3]
> b = [a,a,a]

Right. I don't think these references are too common in pickles.
Zope Corp should know much more about this, I guess, since ZODB
is all about pickleing.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Wed Aug  7 15:17:39 2002
From: (Andrew Koenig)
Date: 07 Aug 2002 10:17:39 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <>
Message-ID: <>

Tim>     re1 in re2

Tim> should be True iff the language accepted by re1 is a subset of
Tim> the language accepted by re2.  In this case, it's OK to consider
Tim> the empty language a subset of all others, since nobody will be
Tim> able to make head or tail out of the code anyway.

Note the distinction between the empty language and the empty string.
As a language is a set of strings, the empty language is one that
contains no strings, not even the empty string.  Therefore, a regular
expression that accepts the empty language is one that rejects every
string, even the empty string.

Pedantically y'rs    --ark

From  Wed Aug  7 15:50:45 2002
From: (Guido van Rossum)
Date: Wed, 07 Aug 2002 10:50:45 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: Your message of "Wed, 07 Aug 2002 15:48:22 +0200."
References: <> <> <> <> <>
Message-ID: <>

> >>Perhaps pickle could grow an option to assume that a
> >>data structure is non-recursive ?! In that case, no
> >>data would be written to the memo (or only the id()
> >>mapped to 1 to double-check).
> > 
> > The memo is also for sharing.  There's no recursion in this example,
> > but the sharing may be important:
> > 
> > a = [1,2,3]
> > b = [a,a,a]
> Right. I don't think these references are too common in pickles.

I think they are.

> Zope Corp should know much more about this, I guess, since ZODB
> is all about pickleing.

Sharing object references is essential in Zope.  But only to certain
objects; sharing strings and numbers is not important, and I believe
cPickle doesn't put those in the memo, while puts
essentially everything in the memo...

--Guido van Rossum (home page:

From  Wed Aug  7 15:58:12 2002
From: (M.-A. Lemburg)
Date: Wed, 07 Aug 2002 16:58:12 +0200
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Python compile.c,2.250,2.251
References: <>	<> <>
Message-ID: <>

Martin v. L=F6wis wrote:
> "M.-A. Lemburg" <> writes:
>>>+ #ifndef Py_USING_UNICODE
>>>+ 	abort();
>>>+ #else
>>Shouldn't this be a call to Py_FatalError() with a proper
>>error message ?
> What is the guideline for when to use abort, and when to use
> Py_FatalError?

Looking at the code for Py_FatalError(), I'd say always use
this instead of calling abort directly, except maybe for
situations where you don't want anything printed.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Wed Aug  7 15:51:22 2002
From: (David Abrahams)
Date: Wed, 7 Aug 2002 10:51:22 -0400
Subject: [Python-Dev] docstrings, help(), and __name__
Message-ID: <086901c23e21$e7bea8b0$>

I've recently been implementing docstring support for Boost.Python
extension classes (and in particular, their methods). I have a callable
type which wraps all C++ functions and member functions -- it basically
looks like a minimal subset of Python's function type, with a tp_descr_get
slot which does the same thing that funcobject.c's func_descr_get() does:

    static PyObject *
    function_descr_get(PyObject *func, PyObject *obj, PyObject *type_)
        if (obj == Py_None)
            obj = NULL;
        return PyMethod_New(func, obj, type_);

So I just recently added a descriptor for the "__doc__" string attribute,
and I thought I'd try help() on one of these methods:

Failure in example: help(X)
from line #2 of __main__
Exception raised:
Traceback (most recent call last):
  File "", line 430, in _run_examples_inner
    compileflags, 1) in globs
  File "<string>", line 1, in ?
  File "c:\tools\python-2.2.1\lib\", line 279, in __call__
    return*args, **kwds)
  File "c:\tools\python-2.2.1\lib\", line 1510, in __call__
  File "c:\tools\python-2.2.1\lib\", line 1546, in help
    else: doc(request, 'Help on %s:')
  File "c:\tools\python-2.2.1\lib\", line 1341, in doc
    pager(title % (desc + suffix) + '\n\n' + text.document(thing, name))
  File "c:\tools\python-2.2.1\lib\", line 268, in document
    if inspect.isclass(object): return apply(self.docclass, args)
  File "c:\tools\python-2.2.1\lib\", line 1093, in docclass
    lambda t: t[1] == 'method')
  File "c:\tools\python-2.2.1\lib\", line 1035, in spill
    name, mod, object))
  File "c:\tools\python-2.2.1\lib\", line 269, in document
    if inspect.isroutine(object): return apply(self.docroutine, args)
  File "c:\tools\python-2.2.1\lib\", line 1116, in docroutine
    realname = object.__name__
AttributeError: 'Boost.Python.function' object has no attribute '__name__'

It seems I'm breaking some protocol. It's easy enough to add a '__name__'
attribute to my function objects, but I'd like to be sure that I'm adding
everything I really /should/ add. Just how much like a regular Python
function does my function have to be in order to make the help system (and
other standard systems with such expectations) happy?


           David Abrahams * Boost Consulting *

From  Wed Aug  7 16:36:21 2002
From: (Michael Hudson)
Date: 07 Aug 2002 16:36:21 +0100
Subject: [Python-Dev] docstrings, help(), and __name__
In-Reply-To: "David Abrahams"'s message of "Wed, 7 Aug 2002 10:51:22 -0400"
References: <086901c23e21$e7bea8b0$>
Message-ID: <>

"David Abrahams" <> writes:

> [function-like object with no __name__ breaks pydoc]
> It seems I'm breaking some protocol. It's easy enough to add a '__name__'
> attribute to my function objects, but I'd like to be sure that I'm adding
> everything I really /should/ add.

I am fairly certain the protocols inspect uses are not written down
anywhere.  I think they're defined entirely by the implementation.

> Just how much like a regular Python function does my function have
> to be in order to make the help system (and other standard systems
> with such expectations) happy?

"Use the source, Luke."  Not a good answer, but probably the only one.

I guess inspect thinks your object looks like a method descriptor?  It
certainly seems to think it's a "routine" whatever that means...


  Famous remarks are very seldom quoted correctly.
                                                    -- Simeon Strunsky

From  Wed Aug  7 16:50:24 2002
From: (Guido van Rossum)
Date: Wed, 07 Aug 2002 11:50:24 -0400
Subject: [Python-Dev] docstrings, help(), and __name__
In-Reply-To: Your message of "Wed, 07 Aug 2002 10:51:22 EDT."
References: <086901c23e21$e7bea8b0$>
Message-ID: <>

> I've recently been implementing docstring support for Boost.Python
> extension classes (and in particular, their methods). I have a callable
> type which wraps all C++ functions and member functions -- it basically
> looks like a minimal subset of Python's function type, with a tp_descr_get
> slot which does the same thing that funcobject.c's func_descr_get() does:
>     static PyObject *
>     function_descr_get(PyObject *func, PyObject *obj, PyObject *type_)
>     {
>         if (obj == Py_None)
>             obj = NULL;
>         return PyMethod_New(func, obj, type_);
>     }
> So I just recently added a descriptor for the "__doc__" string attribute,
> and I thought I'd try help() on one of these methods:
> *****************************************************************
> Failure in example: help(X)
> from line #2 of __main__
> Exception raised:
> Traceback (most recent call last):
>   File "", line 430, in _run_examples_inner
>     compileflags, 1) in globs
>   File "<string>", line 1, in ?
>   File "c:\tools\python-2.2.1\lib\", line 279, in __call__
>     return*args, **kwds)
>   File "c:\tools\python-2.2.1\lib\", line 1510, in __call__
>   File "c:\tools\python-2.2.1\lib\", line 1546, in help
>     else: doc(request, 'Help on %s:')
>   File "c:\tools\python-2.2.1\lib\", line 1341, in doc
>     pager(title % (desc + suffix) + '\n\n' + text.document(thing, name))
>   File "c:\tools\python-2.2.1\lib\", line 268, in document
>     if inspect.isclass(object): return apply(self.docclass, args)
>   File "c:\tools\python-2.2.1\lib\", line 1093, in docclass
>     lambda t: t[1] == 'method')
>   File "c:\tools\python-2.2.1\lib\", line 1035, in spill
>     name, mod, object))
>   File "c:\tools\python-2.2.1\lib\", line 269, in document
>     if inspect.isroutine(object): return apply(self.docroutine, args)
>   File "c:\tools\python-2.2.1\lib\", line 1116, in docroutine
>     realname = object.__name__
> AttributeError: 'Boost.Python.function' object has no attribute '__name__'
> *****************************************************************
> It seems I'm breaking some protocol. It's easy enough to add a '__name__'
> attribute to my function objects, but I'd like to be sure that I'm adding
> everything I really /should/ add. Just how much like a regular Python
> function does my function have to be in order to make the help system (and
> other standard systems with such expectations) happy?

It's hard to say.  The pydoc code makes up protocols as it goes.  I
think __name__ is probably the only one you're missing in practice.

--Guido van Rossum (home page:

From  Wed Aug  7 17:11:06 2002
From: (Aahz)
Date: Wed, 7 Aug 2002 12:11:06 -0400
Subject: [Python-Dev] jython-dev failure
Message-ID: <>

I'm assuming the Jython developers are monitoring this list...

(I'm leaving for a week, so I don't have time to hunt down individual

----- Forwarded message from Duke <> -----
> Date: Tue, 06 Aug 2002 21:08:35 -0600
> To:
> From: Duke <>
> Subject: Please Forward: "A naive question about the applicability of
>   Jython..."
> I tried to send the text below the line to 
> <>
> as suggested under "Email Us" on the page .
> Unfortunately, it gave "599 DSMTP mail server registration lapsed.  Try 
> resending."
> Resending failed.
> Can you please help???
> Thanks!!!!
> ------------------------------------------------------------------------------------------------
> Nature of question: Can Jython do this?
> Nature of "this": read input coming to Internet Explorer, write output 
> through it.
> I have been fishing around for a way to monitor data coming into IE, and 
> when
> appropriate, generate output -- basically, automatically processing the 
> input
> and responding to it.  Imagine surfing a database, and when desired, sending
> updates to it.
> Can Jython do this?  Specifically, can it monitor input directed to the IE 
> window,
> and send responses?  I don't know enough about Java to understand whether it
> has that capability.  I have about 2 years experience writing Python and 
> Tkinter.
> If it can great!!!  If not, do you know of any plugins that can monitor and 
> generate
> traffic inside an IE session, particularly an SSL session?
> Thanks very much for your time!
> Duke Winsor
----- End forwarded message -----

From  Wed Aug  7 19:08:35 2002
From: (Martin v. Loewis)
Date: 07 Aug 2002 20:08:35 +0200
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
References: <>
Message-ID: <>

"M.-A. Lemburg" <> writes:

> I think that the tuple is not the problem here, it's the
> fact that so many objects are recorded in the memo to
> later rebuild recursive structures.

It's not a matter of beliefs: each dictionary entry contributes 12
bytes. Each integer key contributes 12 bytes, each integer position
contributes 12 bytes. Each tuple contributes 36 bytes.

Assuming pymalloc and the integer allocator, this makes a total of 76
bytes per recorded object. The tuples contribute over 50% to that.

> Perhaps pickle could grow an option to assume that a
> data structure is non-recursive ?! In that case, no
> data would be written to the memo (or only the id()
> mapped to 1 to double-check).

That is already possible: You can pass a fake dictionary that records


From  Wed Aug  7 19:09:54 2002
From: (Martin v. Loewis)
Date: 07 Aug 2002 20:09:54 +0200
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
References: <>
Message-ID: <>

"M.-A. Lemburg" <> writes:

> > Use cPickle, it's much more frugal with the memo, and also has some
> > options to control the memo (read the docs, I forget the details and
> > am in a hurry).
> Just to clarify: I don't have a problem with the memo
> in pickle at all :-) Martin brought up this issue.

I don't have a problem with the memo, either. I have a problem with
the tuples in the memo.


From  Wed Aug  7 19:11:46 2002
From: (Martin v. Loewis)
Date: 07 Aug 2002 20:11:46 +0200
Subject: [Python-Dev] Do I misunderstand how codecs.EncodedFile is supposed to work?
In-Reply-To: <>
References: <15696.40035.180993.654851@localhost.localdomain>
Message-ID: <>

"M.-A. Lemburg" <> writes:

> It's not a class, just a helper for StreamRecoder. It's purpose
> is to provide an easy way of saying "the inside world is encoding
> X while the outside world uses Y":

In a well-designed designed application, you should not need to say
this. The inside world should use Unicode objects.


From  Wed Aug  7 20:51:59 2002
From: (M.-A. Lemburg)
Date: Wed, 07 Aug 2002 21:51:59 +0200
Subject: [Python-Dev] Do I misunderstand how codecs.EncodedFile is supposed
 to work?
References: <15696.40035.180993.654851@localhost.localdomain>	<>	<> <>
Message-ID: <>

Martin v. Loewis wrote:
> "M.-A. Lemburg" <> writes:
>>It's not a class, just a helper for StreamRecoder. It's purpose
>>is to provide an easy way of saying "the inside world is encoding
>>X while the outside world uses Y":
> In a well-designed designed application, you should not need to say
> this. The inside world should use Unicode objects.

Agreed, but if you want to port an existing application to
the Unicode world, it sometimes helps.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Wed Aug  7 21:02:53 2002
From: (Tim Peters)
Date: Wed, 7 Aug 2002 16:02:53 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

> Tim>     re1 in re2
> Tim> should be True iff the language accepted by re1 is a subset of
> Tim> the language accepted by re2.  In this case, it's OK to consider
> Tim> the empty language a subset of all others, since nobody will be
> Tim> able to make head or tail out of the code anyway.

> Note the distinction between the empty language and the empty string.
> As a language is a set of strings, the empty language is one that
> contains no strings, not even the empty string.  Therefore, a regular
> expression that accepts the empty language is one that rejects every
> string, even the empty string.

Sure, that's why I said "empty language" and not "empty string".  It
wouldn't make *any* sense for "re1 in re2" to consider a regexp that
accepted the language {""} to be "in" all other regexps.  But a regexp that
accepts the language {} (i.e., the empty language) clearly accepts a subset
of the language accepted by any regexp.

> Pedantically y'rs    --ark

Not enough to matter in this case <wink>.

From  Wed Aug  7 21:14:22 2002
From: (Andrew Koenig)
Date: 07 Aug 2002 16:14:22 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <>
Message-ID: <>

>> Note the distinction between the empty language and the empty
>> string.  As a language is a set of strings, the empty language is
>> one that contains no strings, not even the empty string.
>> Therefore, a regular expression that accepts the empty language is
>> one that rejects every string, even the empty string.

Tim> Sure, that's why I said "empty language" and not "empty string".
Tim> It wouldn't make *any* sense for "re1 in re2" to consider a
Tim> regexp that accepted the language {""} to be "in" all other
Tim> regexps.  But a regexp that accepts the language {} (i.e., the
Tim> empty language) clearly accepts a subset of the language accepted
Tim> by any regexp.

Right.  (I wasn't disagreeing with you, merely pointing out a
plausible miscomprehension on the part of the reader (because
I made just that mistake the first time I read it))

>> Pedantically y'rs    --ark

Tim> Not enough to matter in this case <wink>.

Whether it matters depends on whether the reader made the same
mistake I did on first reading.

Andrew Koenig,,

From  Wed Aug  7 21:58:05 2002
From: (Tim Peters)
Date: Wed, 7 Aug 2002 16:58:05 -0400
Subject: [Python-Dev] CVS fails to commit
In-Reply-To: <>
Message-ID: <>

Sjoerd, is Jack having some systematic problem with CVS?  A stale lock of
his prevented checkins under Doc/ Saturday through Sunday afternoon too,
which also required SourceForge intervention to clear out.

> -----Original Message-----
> From: []On
> Behalf Of Sjoerd Mullender
> Sent: Wednesday, August 07, 2002 7:35 AM
> To: Steve Holden
> Cc: Python-Dev
> Subject: Re: [Python-Dev] CVS fails to commit
> It looks like Jack's problems caused a lock file to be stuck there.  I
> expect this affects a small part of the repository, and also that it
> needs manual intervention to correct the problem.  So please submit a
> service request to SourceForge to remove the lock file(s) in
> /cvsroot/python/python/dist/src/Doc/lib (and all subdirectories).
> On Wed, Aug 7 2002 "Steve Holden" wrote:
> > I find that this morning I am still prevented from committing changes to
> >
> >     ~/pythoncvs/python/dist/src/Doc/lib/libposixpath.tex
> >
> > Is this a problem that's only affecting a small portion of the
> repository,
> > or is it more general? To repeat yesterday's notification, the
> error message
> > I'm seeing is:
> >
> >     cvs server: [04:14:04] waiting for jackjansen's lock in
> > /cvsroot/python/python/dist/src/Doc/lib
> >
> > locked-out-ly y'rs  - steve
> > -----------------------------------------------------------------------
> > Steve Holden                       
> > Python Web Programming      
> > -----------------------------------------------------------------------
> >
> >
> >
> >
> >
> > _______________________________________________
> > Python-Dev mailing list
> >
> >
> >
> -- Sjoerd Mullender <>
> _______________________________________________
> Python-Dev mailing list

From  Wed Aug  7 22:56:07 2002
From: (Greg Ewing)
Date: Thu, 08 Aug 2002 09:56:07 +1200 (NZST)
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

"M.-A. Lemburg" <>:

> Perhaps pickle could grow an option to assume that a
> data structure is non-recursive ?

Then you'd probably want some means of detecting cycles, or you'd get
infinite recursion when you got it wrong. That would mean keeping a
stack of objects, I think -- probably less memory than keeping all of
them at once.

But I think the idea of keeping the object references in a list is
well worth trying first. 4 bytes per object instead of 36 sounds like a
good improvement to me!

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug  7 23:00:50 2002
From: (David Abrahams)
Date: Wed, 7 Aug 2002 18:00:50 -0400
Subject: [Python-Dev] docstrings, help(), and __name__
References: <086901c23e21$e7bea8b0$>  <>
Message-ID: <0a3001c23e60$3927d710$>

From: "Guido van Rossum" <>

> > It seems I'm breaking some protocol. It's easy enough to add a
> > attribute to my function objects, but I'd like to be sure that I'm
> > everything I really /should/ add. Just how much like a regular Python
> > function does my function have to be in order to make the help system
> > other standard systems with such expectations) happy?
> It's hard to say.  The pydoc code makes up protocols as it goes.  I
> think __name__ is probably the only one you're missing in practice.

That appears to be correct. Interestingly, these methods seem to be treated
differently from ordinary ones. My methods get shown like this:

   |  __init__ = __init__(...)
   |      this is the __init__ function
   |      its documentation has two lines.

Where the 2nd instance of __init__ is given by the value of the __name__
attribute, while built-in methods get shown as follows:

  >>> class X(object):
  ...     def __init__(self): pass
  >>> help(X)
  Help on class X in module __main__:

  class X(__builtin__.object)
   |  Methods defined here:
   |  __init__(self)

Does anyone know why the difference? Is it perhaps the missing 'im_class'
attribute in my case? These are the sorts of things I want to know about...


           David Abrahams * Boost Consulting *

From  Wed Aug  7 23:43:46 2002
From: (Raymond Hettinger)
Date: Wed, 7 Aug 2002 18:43:46 -0400
Subject: [Python-Dev] Pickling in XML format
Message-ID: <003701c23e63$e4698440$5066accf@othello>

Do you guys have any thoughts on the merits of adding dumpXML and loadXML methods to the pickle module?

The only disadvantage that comes to mind is that the file sizes are larger (though they may compress more efficiently.

The advantages center around portability and the use of existing tools:
-- The pickles would be validatable against a DTD or schema
-- Pickles would be more human readable than the current format
-- XLST make translations to HTML, JavaPickle formats, more compact formats, etc.
-- XPATH could be used as a recursive search tool
-- Pickles would be editable and viewable with XML editors
-- No need for stack machine instructions to be included
-- Python object trees could potentially be loaded in other languages
-- The DTD can be used by non-Python sources to create data that is directly loadable in to Python objects
-- Pickle security can be improved by using tight DTDs instead of copyreg.

I would appreciate you thoughts.

Raymond Hettinger

P.S.  Here's an example of what it would look like:

class Circle:
    def __init__(self, rad):
 self.rad = rad

class Square:
    def __init__(self, side):
 self.side = side
    def __getinitargs__(self):
        return (self.side,)

class Triangle:
    def __init__(self, side1, side2, side3):
        self.sides = map(math.toRadians, (side1, side2, side3))
    def __getstate__(self):
        return self.sides
    def __setstate__(self, state):
        self.sides = state

>>> d = {"one":"uno", "two":"dos"}
>>> obj = [d, 42, u"abc", [1.0,2+5j], Circle(5), Square(4), Triangle(3,4,5), d, None, True, False, Circle, len]
>>> pickle.dumpsXML(obj)

  <dict id="0">
    <item> <str>one</str> <str>uno</str> </item>
    <item> <str>two</str> <str>dos</str> </item>
  <instance module="__main__" name="Circle">
      <item> <str>rad</str> <int>5</int> </item>
  <instance module="__main__" name="Square">
  <instance module="__main__" name="Triangle">
  <memo idref="0"/>
  <global module="__main__" name="Circle"/>
  <global module="__builtin__" name="len"/>

From  Thu Aug  8 00:36:36 2002
From: (Tim Peters)
Date: Wed, 07 Aug 2002 19:36:36 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

[Martin v. Loewis]
> It's not a matter of beliefs: each dictionary entry contributes 12
> bytes. Each integer key contributes 12 bytes, each integer position
> contributes 12 bytes. Each tuple contributes 36 bytes.

I'm not in love with giant pickle memos myself, but to reduce expectations
closer to reality, note that each dict entry consumes at least 18 bytes (we
keep the load factor under 2/3, so there's at least one unused entry for
every two real entries; it's an indirect overhead, but a real one).

From  Thu Aug  8 00:43:50 2002
From: (Skip Montanaro)
Date: Wed, 7 Aug 2002 18:43:50 -0500
Subject: [Python-Dev] Do I misunderstand how codecs.EncodedFile is supposed to work?
In-Reply-To: <>
References: <15696.40035.180993.654851@localhost.localdomain>
Message-ID: <15697.45238.952815.600057@localhost.localdomain>

>>>>> "Martin" == Martin v Loewis <> writes:

    Martin> "M.-A. Lemburg" <> writes:
    >> It's not a class, just a helper for StreamRecoder. It's purpose
    >> is to provide an easy way of saying "the inside world is encoding
    >> X while the outside world uses Y":

    Martin> In a well-designed designed application, you should not need to
    Martin> say this. The inside world should use Unicode objects.

Which is precisely what I'm trying to do. ;-)  I think I have enough clues
to make things work now.  Thanks for the pointers.


From  Thu Aug  8 00:51:06 2002
From: (Greg Ewing)
Date: Thu, 08 Aug 2002 11:51:06 +1200 (NZST)
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

> I'm not in love with giant pickle memos myself, but to reduce expectations
> closer to reality, note that each dict entry consumes at least 18 bytes (we
> keep the load factor under 2/3, so there's at least one unused entry for
> every two real entries; it's an indirect overhead, but a real one).

Is there perhaps a more memory-efficient data structure that
could be used here instead of a dict? A b-tree, perhaps,
which with a suitable bucket size could use no more than
about 8 byte per entry -- 4 for the object reference and
4 for the integer index that it maps to.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug  8 01:17:50 2002
From: (Guido van Rossum)
Date: Wed, 07 Aug 2002 20:17:50 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: Your message of "Thu, 08 Aug 2002 09:56:07 +1200."
References: <>
Message-ID: <>

> > Perhaps pickle could grow an option to assume that a
> > data structure is non-recursive ?
> Then you'd probably want some means of detecting cycles, or you'd get
> infinite recursion when you got it wrong. That would mean keeping a
> stack of objects, I think -- probably less memory than keeping all of
> them at once.

cPickle has an obscure options for this.  You create a pickler object
and set the attribute "fast" to True, I believe.  It detects cycles by
using a nesting counter, I believe (read the source to learn more).

> But I think the idea of keeping the object references in a list is
> well worth trying first. 4 bytes per object instead of 36 sounds like a
> good improvement to me!

So maybe we need to create an identitydict...

--Guido van Rossum (home page:

From  Thu Aug  8 01:18:49 2002
From: (Guido van Rossum)
Date: Wed, 07 Aug 2002 20:18:49 -0400
Subject: [Python-Dev] docstrings, help(), and __name__
In-Reply-To: Your message of "Wed, 07 Aug 2002 18:00:50 EDT."
References: <086901c23e21$e7bea8b0$> <>
Message-ID: <>

> That appears to be correct. Interestingly, these methods seem to be treated
> differently from ordinary ones. My methods get shown like this:
>    |  __init__ = __init__(...)
>    |      this is the __init__ function
>    |      its documentation has two lines.
> Where the 2nd instance of __init__ is given by the value of the __name__
> attribute, while built-in methods get shown as follows:
>   >>> class X(object):
>   ...     def __init__(self): pass
>   ...
>   >>> help(X)
>   Help on class X in module __main__:
>   class X(__builtin__.object)
>    |  Methods defined here:
>    |
>    |  __init__(self)
> Does anyone know why the difference? Is it perhaps the missing 'im_class'
> attribute in my case? These are the sorts of things I want to know about...

Who knows.  As I said, pydoc is a mess of underdocumented

--Guido van Rossum (home page:

From  Thu Aug  8 01:21:35 2002
From: (Guido van Rossum)
Date: Wed, 07 Aug 2002 20:21:35 -0400
Subject: [Python-Dev] Pickling in XML format
In-Reply-To: Your message of "Wed, 07 Aug 2002 18:43:46 EDT."
References: <003701c23e63$e4698440$5066accf@othello>
Message-ID: <>

> Do you guys have any thoughts on the merits of adding dumpXML and
> loadXML methods to the pickle module?

That doesn't belong in the pickle module.  An XML format to store
Python-specific data structures doesn't make sense.  Storing data
in XML makes total sense, but should probably be guided by some XML
standard and not by the set of data types that happen to be available
in Python.  Put it in the xml module.

Note that already has a way to do this, for the data
types supported by XMLRPC.

--Guido van Rossum (home page:

From  Thu Aug  8 01:47:16 2002
From: (Tim Peters)
Date: Wed, 07 Aug 2002 20:47:16 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

[Greg Ewing]
> Is there perhaps a more memory-efficient data structure that
> could be used here instead of a dict? A b-tree, perhaps,
> which with a suitable bucket size could use no more than
> about 8 byte per entry -- 4 for the object reference and
> 4 for the integer index that it maps to.

The code to support BTrees would be quite a burden.  Zope already has that,
so it wouldn't be a new burden there, except to get away from comparing keys
via __cmp__ we'd have to use IIBTrees, and those map 4-byte ints to 4-byte
ints (i.e., they wouldn't work right for this purpose on a 64-bit box --
although Yet Another Flavor of BTree could be compiled that would).

Judy tries look perfect for "this kind of thing" (fast, memory-efficient,
and would likely get significant compression benefit from that the high bits
of user-space addresses tend to be the same):

From  Thu Aug  8 01:57:33 2002
From: (Greg Ewing)
Date: Thu, 08 Aug 2002 12:57:33 +1200 (NZST)
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

Tim Peters <>:

> The code to support BTrees would be quite a burden.

It wouldn't be all that complicated, surely? And you'd
only need about half of it, because you only need to
be able to add keys, never delete them.


Is there any information available about this
other than the 3 lines I managed to find amongst 
all the sourceforge crud?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug  8 03:17:58 2002
From: (Tim Peters)
Date: Wed, 07 Aug 2002 22:17:58 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

>> The code to support BTrees would be quite a burden.

[Greg Ewing]
> It wouldn't be all that complicated, surely? And you'd
> only need about half of it, because you only need to
> be able to add keys, never delete them.

Have you done a production-quality, portable B-Tree implementation?  Could
you prevail in arguing that memory reduction is more important than speed
here?  Etc.  B-Trees entail a messy set of tradeoffs.


> Is there any information available about this
> other than the 3 lines I managed to find amongst
> all the sourceforge crud?

It was a short-lived topic on Python-Dev about two weeks ago.  Try, mmm,

for lots of info.

From  Thu Aug  8 06:08:36 2002
From: (Tim Peters)
Date: Thu, 8 Aug 2002 01:08:36 -0400
Subject: [Python-Dev] A different kind of heap
Message-ID: <>

Just for fun, although someone may find it useful <shudder>.

The new heapq module implements a classic min-heap, a binary tree where the
value at each node is <= the values of its children.

I mentioned weak heaps before (in the context of sorting), where that
condition is loosened to cover just the right child (and the root node of
the whole weak heap isn't allowed to have a left child).  This is more
efficient, but the code is substantially trickier.

A different kind of weakening is the so-called "pairing heap" (PH).  This is
more like the classic (strong) heap, except that it's a general tree, with
no constraint on how many children each node can have (0, 1, 2, ...,
thousands).  The parent value simply has to be <= the values of all its
children (if any).  Leaving aside storage efficiency, the code for this is
substantially simpler than even for classic heaps:  two PHs can be merged in
constant time, with a single compare (whichever PH has the larger root value
simply becomes another child of the other PH's root node).  The code below
uses a funky representation where a PH is just a list L, where L[0] is the
value associated with the PH's root node (which always has the smallest
value in the tree), and the rest of the list consists of 0 or more child PHs
(which are again lists of the same form).

All the usual heap operations build on this simple "pairing" operation,
called _link below.  Pushing an element x on the heap consists of viewing x
as the 0-child PH [x], and one link step completes merging it with the
existing PH.  Any collection of N values can thus be turned into a PH using
exactly N-1 compares.

A pop seems scary at first, since we may have one root node with N-1
children, and then it will take at least N-2 pairing steps to turn the
remaining forest of PHs back into a single PH.  Indeed, this happens if you
feed the numbers 1..N into an empty PH in order (each of 2 thru N becomes a
direct child of 1).  There are many ways the forest-merge step can done; the
code below implements a common way, with the remarkable property that,
despite the possibility for an O(N) pop step, the amortized cost for N pops
is worst-case O(log N).  In the "bad example" of inserting 1 thru N in
order, it actually turns out to be amortized constant time (it doesn't
matter how big N is, there's an independent (and small) constant c such that
the N pops take no more than c*N compares).  You have see that to believe
it, though <wink>.

PHs are an active area of current research.  They appear to have many
remarkable "adaptive" properties, but it seems difficult to prove or
disprove interesting general conjectures.  Playing around with the code and
a class that counts __cmp__ invocations, it's not hard to find cases of
partially ordered data where "pairing heap sort" does fewer compares than
our new mergesort.  OTOH, PHs do substantially worse than classic heaps on #
of compares when data is fed in randomly, and classic heaps in turn do
substantially worse on random data than our mergesort.

Still, if there are bursts of order in your data, and you can afford the
space, using a PH priority queue can be much faster than using a classic
heap.  Indeed, if you feed the numbers from 1 through N in *reverse* order,
then pop them off one at a time, it turns out that the PH queue doesn't need
to compare after any of the pops -- the N-1 compares at the start are the
whole banana.

Have fun!

def _link(x, y):
    if x[0] <= y[0]:
        return x
        return y

def _merge(x):
    n = len(x)
    if n == 1:
        return []
    pairs = [_link(x[i], x[i+1]) for i in xrange(1, n-1, 2)]
    if n & 1 == 0:
    x = pairs[-1]
    for i in xrange(len(pairs)-2, -1, -1):
        x = _link(pairs[i], x)
    return x

class Heap(object):
    __slots__ = 'x'

    def __init__(self):
        self.x = []

    def __nonzero__(self):
        return bool(self.x)

    def push(self, value):
        if self.x:
            self.x = _link(self.x, [value])

    def pop(self):
        result = self.x[0]  # raises IndexError if empty
        self.x = _merge(self.x)
        return result

From  Thu Aug  8 07:48:20 2002
From: (Martin v. Loewis)
Date: 08 Aug 2002 08:48:20 +0200
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> > But I think the idea of keeping the object references in a list is
> > well worth trying first. 4 bytes per object instead of 36 sounds like a
> > good improvement to me!
> So maybe we need to create an identitydict...

In that case, the backwards compatibility problems are more serious.

Of course, we could chose the type of dictionary based on whether this
is a Pickler subclass or not (with some protocol to let the subclass
make us aware that we should use the identitydict, anyway).

This is, of course, a bigger change than I originally had in mind.


From  Thu Aug  8 08:36:20 2002
From: (M.-A. Lemburg)
Date: Thu, 08 Aug 2002 09:36:20 +0200
Subject: [Python-Dev] Pickling in XML format
References: <003701c23e63$e4698440$5066accf@othello>
Message-ID: <>

Raymond Hettinger wrote:
> Do you guys have any thoughts on the merits of adding dumpXML and loadXML methods to the pickle module?
> The only disadvantage that comes to mind is that the file sizes are larger (though they may compress more efficiently.

FYI, there such a module in PyXML.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Thu Aug  8 08:44:58 2002
From: (Sjoerd Mullender)
Date: Thu, 08 Aug 2002 09:44:58 +0200
Subject: [Python-Dev] CVS fails to commit
In-Reply-To: <>
References: <>
Message-ID: <>

Jack mentioned something on python-dev to this effect.  He had
problems with MacCVS that whenever he used it to update a file in the
Doc/lib directory the CVS *server* would crash.  You'll find it in the

On Wed, Aug 7 2002 "Tim Peters" wrote:

> Sjoerd, is Jack having some systematic problem with CVS?  A stale lock of
> his prevented checkins under Doc/ Saturday through Sunday afternoon too,
> which also required SourceForge intervention to clear out.
> > -----Original Message-----
> > From: []On
> > Behalf Of Sjoerd Mullender
> > Sent: Wednesday, August 07, 2002 7:35 AM
> > To: Steve Holden
> > Cc: Python-Dev
> > Subject: Re: [Python-Dev] CVS fails to commit
> >
> >
> > It looks like Jack's problems caused a lock file to be stuck there.  I
> > expect this affects a small part of the repository, and also that it
> > needs manual intervention to correct the problem.  So please submit a
> > service request to SourceForge to remove the lock file(s) in
> > /cvsroot/python/python/dist/src/Doc/lib (and all subdirectories).
> >
> > On Wed, Aug 7 2002 "Steve Holden" wrote:
> >
> > > I find that this morning I am still prevented from committing changes to
> > >
> > >     ~/pythoncvs/python/dist/src/Doc/lib/libposixpath.tex
> > >
> > > Is this a problem that's only affecting a small portion of the
> > repository,
> > > or is it more general? To repeat yesterday's notification, the
> > error message
> > > I'm seeing is:
> > >
> > >     cvs server: [04:14:04] waiting for jackjansen's lock in
> > > /cvsroot/python/python/dist/src/Doc/lib
> > >
> > > locked-out-ly y'rs  - steve
> > > -----------------------------------------------------------------------
> > > Steve Holden                       
> > > Python Web Programming      
> > > -----------------------------------------------------------------------
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > Python-Dev mailing list
> > >
> > >
> > >
> >
> > -- Sjoerd Mullender <>
> >
> > _______________________________________________
> > Python-Dev mailing list
> >
> >

-- Sjoerd Mullender <>

From  Thu Aug  8 08:05:10 2002
From: (Martin v. Loewis)
Date: 08 Aug 2002 09:05:10 +0200
Subject: [Python-Dev] Pickling in XML format
In-Reply-To: <>
References: <003701c23e63$e4698440$5066accf@othello>
Message-ID: <>

Guido van Rossum <> writes:

> That doesn't belong in the pickle module.  

Also, it doesn't belong in the core (right now). PyXML has the
xml.marshal package, which has a "generic" XML marshaller, and one
that generates WDDX. There are a few users of WDDX, but nobody has
ever asked to provide marshalling for arbitrary Python objects.

Contributions to this package are welcome (; if
such a module has existed for a couple of PyXML releases, we can tell
whether there is enough demand for it to be in the standard library
(which I doubt).


From  Thu Aug  8 09:28:01 2002
From: (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: Thu, 8 Aug 2002 10:28:01 +0200
Subject: [Python-Dev] _sre as part of python.dll
Message-ID: <>

What is the reason for _sre.pyd being a separate DLL? On Unix, it is
incorporated into the executable by default; regular expressions are
central for Python and cannot be omitted.

Would anybody object if I change the Windows build process so that it
stops having _sre as a separate target?


From  Thu Aug  8 14:46:47 2002
From: (David Beazley)
Date: Thu, 8 Aug 2002 08:46:47 -0500 (CDT)
Subject: [Python-Dev] Operator overloading inconsistency (bug or feature?)
Message-ID: <>

Suppose that a new-style class wants to overload "*" and it
defines two methods like this:

class Foo(object):
    def __mul__(self,other):
        print "__mul__"
    def __rmul__(self,other):
        print "__rmul__"

Python-2.2.1, if you try this, you get the following behavior:
>>> f = Foo()
>>> f*1.0
>>> 1.0*f
>>> f*1
>>> 1*f

So here is the question: Why does the last statement in this example
not invoke __rmul__?  In other words, why do "1.0*f" and "1*f" produce 
different behavior.  Is this intentional?  Is this documented someplace?
Is there a workaround?  Or are we just missing something obvious?




From  Thu Aug  8 15:27:38 2002
From: (Guido van Rossum)
Date: Thu, 08 Aug 2002 10:27:38 -0400
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: Your message of "Thu, 08 Aug 2002 10:28:01 +0200."
References: <>
Message-ID: <>

> What is the reason for _sre.pyd being a separate DLL? On Unix, it is
> incorporated into the executable by default; regular expressions are
> central for Python and cannot be omitted.
> Would anybody object if I change the Windows build process so that it
> stops having _sre as a separate target?

Let me turn this around.  What advantage do you see to linking it

--Guido van Rossum (home page:

From  Thu Aug  8 15:54:55 2002
From: (Guido van Rossum)
Date: Thu, 08 Aug 2002 10:54:55 -0400
Subject: [Python-Dev] Operator overloading inconsistency (bug or feature?)
In-Reply-To: Your message of "Thu, 08 Aug 2002 08:46:47 CDT."
References: <>
Message-ID: <>

> Suppose that a new-style class wants to overload "*" and it
> defines two methods like this:
> class Foo(object):
>     def __mul__(self,other):
>         print "__mul__"
>     def __rmul__(self,other):
>         print "__rmul__"
> Python-2.2.1, if you try this, you get the following behavior:
> >>> f = Foo()
> >>> f*1.0
> __mul__
> >>> 1.0*f
> __rmul__
> >>> f*1
> __mul__
> >>> 1*f
> __mul__
> So here is the question: Why does the last statement in this example
> not invoke __rmul__?  In other words, why do "1.0*f" and "1*f" produce 
> different behavior.  Is this intentional?  Is this documented someplace?
> Is there a workaround?  Or are we just missing something obvious?

Aargh.  I *think* this may have to do with the hacks for sequence
repetition.  But I'm not sure.  A debug session tracing carefully
through the code is in order.

--Guido van Rossum (home page:

From  Thu Aug  8 16:17:24 2002
From: (Marcelo Matus)
Date: Thu, 08 Aug 2002 08:17:24 -0700
Subject: [Swig-dev] Re: [Python-Dev] Operator overloading inconsistency
 (bug or feature?)
References: <> <>
Message-ID: <>

Guido van Rossum wrote:

>>Suppose that a new-style class wants to overload "*" and it
>>defines two methods like this:
>>class Foo(object):
>>    def __mul__(self,other):
>>        print "__mul__"
>>    def __rmul__(self,other):
>>        print "__rmul__"
>>Python-2.2.1, if you try this, you get the following behavior:
>>>>>f = Foo()
>>So here is the question: Why does the last statement in this example
>>not invoke __rmul__?  In other words, why do "1.0*f" and "1*f" produce 
>>different behavior.  Is this intentional?  Is this documented someplace?
>>Is there a workaround?  Or are we just missing something obvious?
>Aargh.  I *think* this may have to do with the hacks for sequence
>repetition.  But I'm not sure.  A debug session tracing carefully
>through the code is in order.
>--Guido van Rossum (home page:

I guess the problem arise from here:

static PyObject *
int_mul(PyObject *v, PyObject *w)
    long a, b;
    long longprod;            /* a*b in native long arithmetic */
    double doubled_longprod;    /* (double)longprod */
    double doubleprod;        /* (double)a * (double)b */

    if (!PyInt_Check(v) &&
        v->ob_type->tp_as_sequence &&
        v->ob_type->tp_as_sequence->sq_repeat) {
        /* sequence * int */
        a = PyInt_AsLong(w);
        return (*v->ob_type->tp_as_sequence->sq_repeat)(v, a);
    if (!PyInt_Check(w) &&
        w->ob_type->tp_as_sequence &&
        w->ob_type->tp_as_sequence->sq_repeat) {
        /* int * sequence */
        a = PyInt_AsLong(v);
        return (*w->ob_type->tp_as_sequence->sq_repeat)(w, a);  


and the facts that:

1.-  there is only one 'sq_repeat' method, and not an addittional 
'sq_rrepeat' one,
      so,  n*x and x*n call the same method sq_repeat.

2.- in typeobect.c, sq_repeat is associated with __mul__

line 2775:
      SLOT1(slot_sq_repeat, "__mul__", int, "i")

line 3497:
    SQSLOT("__mul__", sq_repeat, slot_sq_repeat, wrap_intargfunc,
           "x.__mul__(n) <==> x*n"),

3.- the 'object' class by default enable the "tp_as_sequence" attribute,
triggering the call of sq_repeat in the case


but not in



From  Thu Aug  8 16:46:32 2002
From: (Neil Schemenauer)
Date: Thu, 8 Aug 2002 08:46:32 -0700
Subject: [Python-Dev] Operator overloading inconsistency (bug or feature?)
In-Reply-To: <>; from on Thu, Aug 08, 2002 at 10:54:55AM -0400
References: <> <>
Message-ID: <>

See for a possible fix.


From  Thu Aug  8 18:16:59 2002
From: (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 08 Aug 2002 19:16:59 +0200
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> Let me turn this around.  What advantage do you see to linking it
> statically?

The trigger was that it would have simplified the build for me: When
converting VC++6 projects to VC.NET, VC.NET forgets to convert the
/export: linker options, which means that you had to add them all
manually. Mark has fixed this problem differently, by removing the
need for /export:.

Integrating _sre (and _socket, select, winreg, mmap, perhaps others)
into python.dll still simplifies the build process: you don't have to
right-click that many subprojects to build them.

In addition, it should decrease startup time: Python won't need to
locate that many files anymore.

It also decreases the total size of the binary distribution slightly.


From  Thu Aug  8 18:26:23 2002
From: (Guido van Rossum)
Date: Thu, 08 Aug 2002 13:26:23 -0400
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: Your message of "Thu, 08 Aug 2002 19:16:59 +0200."
References: <> <>
Message-ID: <>

> > Let me turn this around.  What advantage do you see to linking it
> > statically?
> The trigger was that it would have simplified the build for me: When
> converting VC++6 projects to VC.NET, VC.NET forgets to convert the
> /export: linker options, which means that you had to add them all
> manually. Mark has fixed this problem differently, by removing the
> need for /export:.
> Integrating _sre (and _socket, select, winreg, mmap, perhaps others)
> into python.dll still simplifies the build process: you don't have to
> right-click that many subprojects to build them.

I never have to do that; the dependencies in the project file make
sure that the extensions are all built when you build the 'python'

> In addition, it should decrease startup time: Python won't need to
> locate that many files anymore.
> It also decreases the total size of the binary distribution slightly.

Maybe _sre is used by most apps (though I doubt even that).  But
_socket, select, winreg, mmap and the others are definitely not.  On
Unix, all extensions are built as shared libraries, except the ones
that are needed by to be able to build extensions; it looks
like only posix, errno, _sre and symtable are built statically.

I'd say that making more extensions static on Windows would increase
start time of modules that don't use those extensions.

I'm -0 on doing this for _sre (I think it's a YAGNI); I'm -1 on doing
this for other extensions.

--Guido van Rossum (home page:

From  Thu Aug  8 18:33:15 2002
From: (Tim Peters)
Date: Thu, 08 Aug 2002 13:33:15 -0400
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
Message-ID: <>

[Martin v. Lowis]
> ...
> Integrating _sre (and _socket, select, winreg, mmap, perhaps others)
> into python.dll still simplifies the build process: you don't have to
> right-click that many subprojects to build them.

If you're building via right-clicking, you're making life much harder than
necessary.  You can build from the command line, or do Build -> Batch
Build -> Build in the GUI.  The latter builds all projects in one gulp,
including both Release and Debug versions (well, it actually displays a list
of all possible project targets, and lets you select which to build in batch
mode; this selection is persistent, so you only need to do it once).

From  Thu Aug  8 18:40:41 2002
From: (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: 08 Aug 2002 19:40:41 +0200
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> I never have to do that; the dependencies in the project file make
> sure that the extensions are all built when you build the 'python'
> project.

Are you sure? If the python target is up-to-date (i.e. nothing has to
be done for python_d.exe), and I delete all generated _sre files
(i.e. sre_d.pyd, and the object files), and then ask VC++ 6 to build
the python target, nothing is done.

Indeed, I cannot find any place where it says that the python target
is related to _sre. I can only see dependencies with pythoncore.

Can you (or any other regular pcbuild.dsp user) please guess what I'm
doing wrong?

> Maybe _sre is used by most apps (though I doubt even that).  But
> _socket, select, winreg, mmap and the others are definitely not.  On
> Unix, all extensions are built as shared libraries, except the ones
> that are needed by to be able to build extensions; it looks
> like only posix, errno, _sre and symtable are built statically.

I do believe that is a mistake, as it will increase startup time of
applications that need them; applications that don't need them would
not be hurt if they were in the python binary.

> I'd say that making more extensions static on Windows would increase
> start time of modules that don't use those extensions.

I guess I have to measure these things.


From  Thu Aug  8 18:49:55 2002
From: (Guido van Rossum)
Date: Thu, 08 Aug 2002 13:49:55 -0400
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: Your message of "Thu, 08 Aug 2002 19:40:41 +0200."
References: <> <> <> <>
Message-ID: <>

> > I never have to do that; the dependencies in the project file make
> > sure that the extensions are all built when you build the 'python'
> > project.
> Are you sure? If the python target is up-to-date (i.e. nothing has to
> be done for python_d.exe), and I delete all generated _sre files
> (i.e. sre_d.pyd, and the object files), and then ask VC++ 6 to build
> the python target, nothing is done.
> Indeed, I cannot find any place where it says that the python target
> is related to _sre. I can only see dependencies with pythoncore.
> Can you (or any other regular pcbuild.dsp user) please guess what I'm
> doing wrong?

I have no idea.  It's all magic for me.  But I never delete targets

> > Maybe _sre is used by most apps (though I doubt even that).  But
> > _socket, select, winreg, mmap and the others are definitely not.  On
> > Unix, all extensions are built as shared libraries, except the ones
> > that are needed by to be able to build extensions; it looks
> > like only posix, errno, _sre and symtable are built statically.
> I do believe that is a mistake, as it will increase startup time of
> applications that need them; applications that don't need them would
> not be hurt if they were in the python binary.

But is the startup time of apps that use a lot of stuff the most
important thing?  I'd say that the startup time of apps that *don't*
use a lot of stuff is more important.  I'm not sure that making the
binary bigger doesn't slow it down.

> > I'd say that making more extensions static on Windows would increase
> > start time of modules that don't use those extensions.
> I guess I have to measure these things.

Yes, please.  We switched to building almost all extensions as shared
libs when we switched away from Modules/Setup to

--Guido van Rossum (home page:

From  Thu Aug  8 19:24:57 2002
From: (Guido van Rossum)
Date: Thu, 08 Aug 2002 14:24:57 -0400
Subject: [Swig-dev] Re: [Python-Dev] Operator overloading inconsistency (bug or feature?)
In-Reply-To: Your message of "Thu, 08 Aug 2002 08:17:24 PDT."
References: <> <>
Message-ID: <>

> >Aargh.  I *think* this may have to do with the hacks for sequence
> >repetition.  But I'm not sure.  A debug session tracing carefully
> >through the code is in order.

[Marcelo Matus]
> I guess the problem arise from here:

Good sleuthing, Marcelo!  Neil S. came up with a fix that I believe is
correct, and I recommend that we'll check that in for 2.3 as well as
on the 2.2 maintenance branch.  Thanks again, Neil!

--Guido van Rossum (home page:

From  Thu Aug  8 20:56:55 2002
From: (Tim Peters)
Date: Thu, 8 Aug 2002 15:56:55 -0400
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
Message-ID: <>

> I never have to do that; the dependencies in the project file make
> sure that the extensions are all built when you build the 'python'
> project.

> Are you sure? If the python target is up-to-date (i.e. nothing has to
> be done for python_d.exe), and I delete all generated _sre files
> (i.e. sre_d.pyd, and the object files), and then ask VC++ 6 to build
> the python target, nothing is done.

Right, every project other than pythoncore and w9xpopen depends on the
pythoncore project, but that's all.  Guido doesn't normally change any code
in any other subprojects, so he doesn't notice this viscerally.  If you want
to be completely safe at all times, do Build -> Batch Build.  One step and
easy.  It won't recompile more than needed, although if the Python DLL
changes, it will pee away a little time relinking things against the new
core .lib file.

From  Thu Aug  8 21:15:51 2002
From: (Ka-Ping Yee)
Date: Thu, 8 Aug 2002 13:15:51 -0700 (PDT)
Subject: [Python-Dev] Re: docstrings, help(), and __name__
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0208081257250.2277-100000@ziggy>

On Wed, 7 Aug 2002, Guido van Rossum wrote:
> > It seems I'm breaking some protocol. It's easy enough to add a '__name__'
> > attribute to my function objects, but I'd like to be sure that I'm adding
> > everything I really /should/ add. Just how much like a regular Python
> > function does my function have to be in order to make the help system (and
> > other standard systems with such expectations) happy?
> It's hard to say.  The pydoc code makes up protocols as it goes.  I
> think __name__ is probably the only one you're missing in practice.

pydoc does not "make up protocols as it goes".  It does its best to
utilize the protocols exposed by the Python core.  The attribute
protocols on Python built-in objects vary from type to type, and
pydoc tries to accommodate them.  Part of the purpose of pydoc and
inspect was to document and provide a more uniform interface to some
of these protocols.

All the built-in objects that are declared with a name have a __name__
attribute, so you'll want to provide that.  Beyond that, it depends
on the type of object you want to emulate; the various protocols are
documented in the 'inspect' module.  For example, see

    pydoc inspect.isfunction

for details on function objects.

-- ?!ng

From  Thu Aug  8 21:24:56 2002
From: (Ka-Ping Yee)
Date: Thu, 8 Aug 2002 13:24:56 -0700 (PDT)
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0208081316240.2277-100000@ziggy>

On Mon, 5 Aug 2002, Guido van Rossum wrote:
> At least this still holds (unless x is an iterator or otherwise
> mutated by access :-):
>   for v in x:
>      assert v in x

And -- wham! -- the dangers of an invisibly destructive "in" operator
land in front of us once again like an enormous stomping foot falling
out of the sky.

Still not convinced?

Oh, well.  There exists a solution, in case you're curious:

-- ?!ng

From  Thu Aug  8 21:36:52 2002
From: (Guido van Rossum)
Date: Thu, 08 Aug 2002 16:36:52 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Thu, 08 Aug 2002 13:24:56 PDT."
References: <Pine.LNX.4.44.0208081316240.2277-100000@ziggy>
Message-ID: <>

> Still not convinced?

No.  Your goal, making for-in "safe to use" is not important to me.


--Guido van Rossum (home page:

From  Thu Aug  8 21:37:54 2002
From: (Ka-Ping Yee)
Date: Thu, 8 Aug 2002 13:37:54 -0700 (PDT)
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0208081325340.2277-100000@ziggy>

On Tue, 6 Aug 2002, Guido van Rossum wrote:
> I think we've argued about '' in 'abc' long enough.  Tim has failed to
> convince me, so '' in 'abc' returns True.  Barry has checked it all
> in.

I would like to urge putting the brakes on this one and proceeding more
cautiously.  (I've been away for the past couple of days and missed the
discussion on this issue.)

My personal opinion sides with Tim -- i think an exception is definitely
the right choice.  (I still haven't seen convincing examples where True
is a more useful result than an exception, and the fact that there is
doubt suggests that it is an exceptional case.)

But regardless of that opinion, we should recognize that causing
'' in 'abc' to stop raising an exception is a big change -- a more
gentle introduction, with at least some sort of warning, would be better.
Silent errors are bad.

-- ?!ng

From  Thu Aug  8 21:42:31 2002
From: (Martin v. Loewis)
Date: 08 Aug 2002 22:42:31 +0200
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> But is the startup time of apps that use a lot of stuff the most
> important thing?  I'd say that the startup time of apps that *don't*
> use a lot of stuff is more important.  I'm not sure that making the
> binary bigger doesn't slow it down.

I'm pretty sure that it doesn't. On Unix, the system performs a
copy-on-write mmap of the executable. No disk access is done until
page faults trigger a disk read. I believe Windows uses a similar
mechanism. The size of the executable is irrelevant (if you have no
relocations); only the part of the executable that is used matters.

On the other hand, on my Linux installation, importing a module costs
35 system calls if the module is not found, and no PYTHONPATH is set;
every directory in PYTHONPATH adds four additional system calls.

> Yes, please.  We switched to building almost all extensions as shared
> libs when we switched away from Modules/Setup to

For modules that require configuration, this was a good thing - now will autoconfigure them. For modules that require no
additional libraries, I hope that this decision will be reverted some


From  Thu Aug  8 21:49:12 2002
From: (Guido van Rossum)
Date: Thu, 08 Aug 2002 16:49:12 -0400
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: Your message of "Thu, 08 Aug 2002 22:42:31 +0200."
References: <> <> <> <> <> <>
Message-ID: <>

> > But is the startup time of apps that use a lot of stuff the most
> > important thing?  I'd say that the startup time of apps that *don't*
> > use a lot of stuff is more important.  I'm not sure that making the
> > binary bigger doesn't slow it down.
> I'm pretty sure that it doesn't. On Unix, the system performs a
> copy-on-write mmap of the executable. No disk access is done until
> page faults trigger a disk read. I believe Windows uses a similar
> mechanism. The size of the executable is irrelevant (if you have no
> relocations); only the part of the executable that is used matters.
> On the other hand, on my Linux installation, importing a module costs
> 35 system calls if the module is not found, and no PYTHONPATH is set;
> every directory in PYTHONPATH adds four additional system calls.
> > Yes, please.  We switched to building almost all extensions as shared
> > libs when we switched away from Modules/Setup to
> For modules that require configuration, this was a good thing - now
> will autoconfigure them. For modules that require no
> additional libraries, I hope that this decision will be reverted some
> day.

If other people feel the same way, I won't stop progress here.  But I
find startup time a rather uninteresting detail, and everything else
being the same I would personally prefer to keep the status quo: not
because it's better, but because it's the status quo.  Why churn?

--Guido van Rossum (home page:

From  Thu Aug  8 21:56:54 2002
From: (Jack Jansen)
Date: Thu, 8 Aug 2002 22:56:54 +0200
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
Message-ID: <>

On donderdag, augustus 8, 2002, at 07:16 , Martin v. L=F6wis wrote:

> Guido van Rossum <> writes:
>> Let me turn this around.  What advantage do you see to linking it
>> statically?
> The trigger was that it would have simplified the build for me: When
> converting VC++6 projects to VC.NET, VC.NET forgets to convert the
> /export: linker options, which means that you had to add them all
> manually. Mark has fixed this problem differently, by removing the
> need for /export:.
> Integrating _sre (and _socket, select, winreg, mmap, perhaps others)
> into python.dll still simplifies the build process: you don't have to
> right-click that many subprojects to build them.
> In addition, it should decrease startup time: Python won't need to
> locate that many files anymore.
> It also decreases the total size of the binary distribution slightly.

Note that I went exactly the other way for MacPython over the=20
last year. It used to be so that all "common" modules were=20
included in PythonCore.slb, and I used separate project build=20
files only for Mac-only modules and one or two special cases=20
(Tk, expat).

I bit the bullet half a year ago and made PythonCore.slb lean=20
and mean, but still used my own private project build file=20
generator for all extension projects.

I bit the bullet again (actually, I bit one of the two remaining=20
half-bullets, I've kept the Mac-specific modules as they are)=20
last month, and MacPython now uses the main for a large=20
collection of the cross-platform extension modules. This turned=20
out to be only one or two evenings of work.

This has immediately resulted in a decrease in my workload:=20
whereas previously whenever someone decided to add the kaboozle=20
module I had to add project files for this, etc etc etc, all=20
that is now often taken care of by distutils and
- Jack Jansen        <>       =20 -
- If I can't dance I don't want to be part of your revolution --=20
Emma Goldman -

From  Thu Aug  8 22:04:51 2002
From: (Ka-Ping Yee)
Date: Thu, 8 Aug 2002 14:04:51 -0700 (PDT)
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0208081400420.2277-100000@ziggy>

On Tue, 6 Aug 2002, Guido van Rossum wrote:
> > Perhaps it makes sense to allow "'thon' in 'python'" to return True,
> > but still have "[1,2] in [0,1,2,3]" return False if we loosen the
> > steadfast requirement that strings and lists be as much alike as
> > possible.
> That was never a requirement.  Strings and lists are merely similar
> insofar as they have very similar needs for a slicing and subscripting
> notation, and to a lesser extent for concatenation, repetition and
> comparison.

Perhaps what Skip meant was that strings and lists are both like
sequences.  At the moment, the meaning of "in" has two general
definitions: one for sequence-like objects and one for mapping-like
objects.  The former is something along the lines of "e is in s if
there exists an i such that s[i] == e".

The question from a teaching perspective is: "Are strings a kind of

-- ?!ng

From  Thu Aug  8 22:05:57 2002
From: (M.-A. Lemburg)
Date: Thu, 08 Aug 2002 23:05:57 +0200
Subject: [Python-Dev] _sre as part of python.dll
References: <>	<>	<>	<>	<>	<> <>
Message-ID: <>

Martin v. Loewis wrote:
> On the other hand, on my Linux installation, importing a module costs
> 35 system calls if the module is not found, and no PYTHONPATH is set;
> every directory in PYTHONPATH adds four additional system calls.

Why not address this problem instead ?

Note that mxCGIPython can help a lot here: it freeze most of the
Python std lib into the executable making imports go really fast
(and that's needed if you're doing a lot of CGI scripting).

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Thu Aug  8 22:51:29 2002
From: (Neil Schemenauer)
Date: Thu, 8 Aug 2002 14:51:29 -0700
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>; from on Thu, Aug 08, 2002 at 04:49:12PM -0400
References: <> <> <> <> <> <> <> <>
Message-ID: <>

Guido van Rossum wrote:
> If other people feel the same way, I won't stop progress here.  But I
> find startup time a rather uninteresting detail,

A lot of people care about startup time.  I would like to see a few more
modules included statically.


From  Thu Aug  8 22:51:05 2002
From: (Tim Peters)
Date: Thu, 8 Aug 2002 17:51:05 -0400
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
Message-ID: <>

[Neil Schemenauer]
> A lot of people care about startup time.  I would like to see a few more
> modules included statically.

If the real goal is to reduce startup time, we should analyze where startup
time is being spent; random thrashing "in that general direction" won't
satisfy in the end.

From  Thu Aug  8 23:10:25 2002
From: (Martin v. Loewis)
Date: 09 Aug 2002 00:10:25 +0200
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
References: <>
Message-ID: <>

"M.-A. Lemburg" <> writes:

> > On the other hand, on my Linux installation, importing a module costs
> > 35 system calls if the module is not found, and no PYTHONPATH is set;
> > every directory in PYTHONPATH adds four additional system calls.
> Why not address this problem instead ?

I'm trying to: every module incorporated in config.c won't be searched

> Note that mxCGIPython can help a lot here: it freeze most of the
> Python std lib into the executable making imports go really fast
> (and that's needed if you're doing a lot of CGI scripting).

Indeed, freezing also helps - but is probably only suitable for
special-purpose applications. I think people would be surprised if
they are told that editing the source of a library module won't have
any effect.


From  Thu Aug  8 23:16:42 2002
From: (Martin v. Loewis)
Date: 09 Aug 2002 00:16:42 +0200
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen <> writes:

> This has immediately resulted in a decrease in my workload: whereas
> previously whenever someone decided to add the kaboozle module I had
> to add project files for this, etc etc etc, all that is now often
> taken care of by distutils and

Reducing the workload is a good thing, and so is sharing of build
processes across many systems; I'm not proposing to give that up.

At the moment, I'm really asking about Windows only; I'll ask about
adding things back into Setup.dist when I can show what advantages
that has. That does not mean that those things would be removed from - that is smart enough to build only things that haven't
already been build.


From  Thu Aug  8 23:58:36 2002
From: (David Abrahams)
Date: Thu, 8 Aug 2002 18:58:36 -0400
Subject: [Python-Dev] Re: docstrings, help(), and __name__
References: <Pine.LNX.4.44.0208081257250.2277-100000@ziggy>
Message-ID: <120201c23f2f$23258730$>

From: "Ka-Ping Yee" <>

> The attribute
> protocols on Python built-in objects vary from type to type, and
> pydoc tries to accommodate them.  Part of the purpose of pydoc and
> inspect was to document and provide a more uniform interface to some
> of these protocols.
> All the built-in objects that are declared with a name have a __name__
> attribute, so you'll want to provide that.  Beyond that, it depends
> on the type of object you want to emulate; the various protocols are
> documented in the 'inspect' module.  For example, see
>     pydoc inspect.isfunction

Do you mean help(inspect.isfunction), or am I clueless about the
environment which accepts the above command?

> for details on function objects.

It appears that ismethod is the one that's relevant to me, since the doc
system gets my functions through my descriptor, which is wrapping them with

So far I'm getting away with not adding an im_class attribute to my
function objects, but it does result in that odd "__init__ = __init__"
output (unless I've misdiagnosed). My function objects will certainly never
have func_code, as help(inspect.isfunction) implies they should, and I'm a
little reluctant to load up functions with a lot more attributes just so
they can be like Python's functions... though I'm certain the penalty would
be lost in the noise.

The main question is this: which attributes do I absolutely /need/ in order
to avoid raising an exception or giving nonsensical output from help()?

Thanks again,

From  Fri Aug  9 00:29:56 2002
From: (Raymond Hettinger)
Date: Thu, 8 Aug 2002 19:29:56 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
References: <Pine.LNX.4.44.0208081325340.2277-100000@ziggy>
Message-ID: <006e01c23f33$8245f520$b6b53bd0@othello>

> > I think we've argued about '' in 'abc' long enough.  Tim has failed to
> > convince me, so '' in 'abc' returns True.  Barry has checked it all
> > in.

> My personal opinion sides with Tim -- i think an exception is definitely
> the right choice.  (I still haven't seen convincing examples where True
> is a more useful result than an exception, and the fact that there is
> doubt suggests that it is an exceptional case.)

I think Barry and GvR are on the right track.

My gut feeling is that it is best to stay with the mathematical view that
the null set is a subset of every other set.  It doesn't seem to have hurt the
world of regular expressions where re.match('', 'abc') returns a match
object.  Likewise, the truth of "abc" ~ "" is not on the wart list for AWK.  
Excel and Lotus have both return non-zero for FIND("","abc").

Though errors should not pass silently, we are talking about an error
that is possibly very far upstream from the membership check:

   potentialsub = complicatedfunction(*manyvars) #semantic error here
   <much other computation here ...>
   if potentialsub in astring:  # why raise an exception way down here

'in' should not be responsible for suggesting that complicatedfunction()
doesn't know what it is doing.  If there is an error, it isn't the membership
check; rather, it is a semantic problem with the function.  Accordingly, 
the postcondition for the function belongs at the tail of the function and 
not as a precondition for the use of the result.  Otherwise, the exception 
and its cause are too far apart (as in the example above).

Raymond Hettinger

From  Fri Aug  9 00:36:13 2002
From: (Barry A. Warsaw)
Date: Thu, 8 Aug 2002 19:36:13 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
References: <Pine.LNX.4.44.0208081325340.2277-100000@ziggy>
Message-ID: <>

>>>>> "RH" == Raymond Hettinger <> writes:

    RH> I think Barry and GvR are on the right track.

Heh, I actually agree with Tim that it should raise an exception, but
I can also see the value in the other point of view.  This is one of
those things that we'd just have to learn to live with whichever
behavior is chosen, and Guido's made up his mind.


From  Fri Aug  9 01:22:57 2002
From: (Ka-Ping Yee)
Date: Thu, 8 Aug 2002 17:22:57 -0700 (PDT)
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0208081720230.2277-100000@ziggy>

On Thu, 8 Aug 2002, Barry A. Warsaw wrote:
>     RH> I think Barry and GvR are on the right track.
> Heh, I actually agree with Tim that it should raise an exception, but
> I can also see the value in the other point of view.  This is one of
> those things that we'd just have to learn to live with whichever
> behavior is chosen, and Guido's made up his mind.

That's fine.  But what i'm trying to say is there's a migration issue:
this decision is a significant change from current behaviour, and it
worries me that we would let this change pass silently without any
grace period.

-- ?!ng

From  Fri Aug  9 01:25:42 2002
From: (Tim Peters)
Date: Thu, 08 Aug 2002 20:25:42 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

> Heh, I actually agree with Tim that it should raise an exception, but
> I can also see the value in the other point of view.

Hah!  As I've long suspected, your mind is easily clouded.  I see no value
in any point of view, and that's why the universe will be all mine.

> This is one of those things that we'd just have to learn to live with
> whichever behavior is chosen, and Guido's made up his mind.

That too.  I didn't figure the world would actually end if Guido decided not
to promulgate conflicting definitions of what substring meant, and, so far,
I don't think it has.  And it's important to me that the world not end,
since, as demonstrated earlier, someday it will be all mine.

bill-gates-only-thinks-it's-his-ly y'rs  - tim

From  Fri Aug  9 01:40:53 2002
From: (Marcelo Matus)
Date: Thu, 08 Aug 2002 17:40:53 -0700
Subject: [Swig-dev] Re: [Python-Dev] Operator overloading inconsistency
 (bug or feature?)
References: <> <> <>
Message-ID: <>

Yes, it wokrs here, thanks very much for your promptly answer


Neil Schemenauer wrote:

>See for a possible fix.
>  Neil
>Swig-dev mailing list  -

From  Fri Aug  9 01:50:24 2002
From: (Guido van Rossum)
Date: Thu, 08 Aug 2002 20:50:24 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Thu, 08 Aug 2002 17:22:57 PDT."
References: <Pine.LNX.4.44.0208081720230.2277-100000@ziggy>
Message-ID: <>

> That's fine.  But what i'm trying to say is there's a migration issue:
> this decision is a significant change from current behaviour, and it
> worries me that we would let this change pass silently without any
> grace period.

And others have said the exact same thing already.

But there is no backwards compatibility issue.  Correct programs
currently never ask for '' in 'abc' because that's guaranteed to raise
a TypeError.  Backwards compatibility guarantees have always had to
use the qualification "except for programs that rely on XYZ raising an
exception" so you can't argue that reasonable code could expect the
TypeError either.

The only issue is whether certain programming mistakes come to light a
little later now that '' in 'abc' no longer raises TypeError.  I'm
willing to accept that in order to make teaching the feature easier:
s1 in s2 means the same thing as s2.find(s1)>=0.

--Guido van Rossum (home page:

From  Fri Aug  9 01:54:51 2002
From: (Barry A. Warsaw)
Date: Thu, 8 Aug 2002 20:54:51 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
References: <>
Message-ID: <>

>>>>> "KY" == Ka-Ping Yee <> writes:

    KY> That's fine.  But what i'm trying to say is there's a
    KY> migration issue: this decision is a significant change from
    KY> current behaviour, and it worries me that we would let this
    KY> change pass silently without any grace period.

from __future__ import str_in_str


From  Fri Aug  9 01:58:40 2002
From: (Barry A. Warsaw)
Date: Thu, 8 Aug 2002 20:58:40 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
References: <>
Message-ID: <>

>>>>> "TP" == Tim Peters <> writes:

    >> Heh, I actually agree with Tim that it should raise an
    >> exception, but I can also see the value in the other point of
    >> view.

    TP> Hah!  As I've long suspected, your mind is easily clouded.  I
    TP> see no value in any point of view, and that's why the universe
    TP> will be all mine.

Yes, but nothing will be all mine, and since nothing is in everything,
all your strings are belong to us.

    >> This is one of those things that we'd just have to learn to
    >> live with whichever behavior is chosen, and Guido's made up his
    >> mind.

    TP> That too.  I didn't figure the world would actually end if
    TP> Guido decided not to promulgate conflicting definitions of
    TP> what substring meant, and, so far, I don't think it has.  And
    TP> it's important to me that the world not end, since, as
    TP> demonstrated earlier, someday it will be all mine.

    TP> bill-gates-only-thinks-it's-his-ly y'rs - tim

If you had visited the alternative universe where Guido decided to
raise the exception, you would have noticed that the universe did
indeed end.  But it's too late now.  Of course, /they/ think our
universe ended too, so it all comes out in the wash.

go-eat-ly y'rs,

From  Fri Aug  9 02:42:10 2002
From: (Greg Ewing)
Date: Fri, 09 Aug 2002 13:42:10 +1200 (NZST)
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <Pine.LNX.4.44.0208081400420.2277-100000@ziggy>
Message-ID: <>

Ka-Ping Yee <>:

> The question from a teaching perspective is: "Are strings a kind of
> sequence?"

Seems to me they're a kind of... er... um... string.
They're like nothing else on Earth, really.

They do seem to me more like sequences than mappings,
however, if we really have to pick one.

The question is, do we have to pick one, or should
we just regard them as a third kind with its own
special rules?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug  9 04:55:30 2002
From: (Guido van Rossum)
Date: Thu, 08 Aug 2002 23:55:30 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: Your message of "Thu, 08 Aug 2002 08:48:20 +0200."
References: <> <>
Message-ID: <>

Martin quoted a complaint about cPickle performance:

But if you read the full thread, it's clear that this complaint came
about because the author wasn't using binary pickle mode.  In binary
mode his times became acceptable.  I've run the test and I haven't
seen abnormal memory behavior -- the process grows to 26 Mb just to
create the test data, and then adds about 1 Mb during pickling.  The
loading almost doubles the process size, because another copy of the
test data is read (the test data isn't thrown away).

The slowdown of text-mode pickle is due to the extremely expensive way
of unpickling pickled strings in text-mode: it invokes eval() (well,
PyRun_String()) to parse the string literal!  (After checking that
there's really only a string literal there to prevent trojan horses.)

So I'm not sure that the memo size is worth pursuing.  I didn't look
at the other complaints referenced by Martin, but I bet they're more
of the same.

What might be worth looking into:

(1) Make binary pickling the default (in cPickle as well as pickle).
    This would definitely give the most bang for the buck.

(2) Replace the PyRun_String() call in cPickle with something faster.
    Maybe the algorithm from parsestr() from compile.c can be exposed;
    although the error reporting must be done differently then.

--Guido van Rossum (home page:

From  Fri Aug  9 05:04:45 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 00:04:45 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: Your message of "Thu, 08 Aug 2002 23:55:30 EDT."
Message-ID: <>

> The slowdown of text-mode pickle is due to the extremely expensive way
> of unpickling pickled strings in text-mode: it invokes eval() (well,
> PyRun_String()) to parse the string literal!  (After checking that
> there's really only a string literal there to prevent trojan horses.)

After re-reading the quoted thread, there was another phenomenon
remarked upon there: the slow text-mode pickle used less memory.  I
noticed this too when I ran the test program.  The explanation is that
the strings in the test program were "key0", "key1", ... "key24" and
"value0" ... "value24", over and over (each test dict has the same
keys and values).  Because these literals look like identifiers, they
are interned, so the unpickled data structure shares the string
references -- while the original test data has 10,000 copies of each

If we really want this as a feature, a call to
PyString_InternFromString() could be made under certain conditions in
load_short_binstring() (e.g. when the length is at most 10 and
all_name_chars() from compile.c returns true).

I'm not sure that this is a desirable feature though.

--Guido van Rossum (home page:

From  Fri Aug  9 05:08:25 2002
From: (Tim Peters)
Date: Fri, 09 Aug 2002 00:08:25 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

> ...
> (2) Replace the PyRun_String() call in cPickle with something faster.
>     Maybe the algorithm from parsestr() from compile.c can be exposed;
>     although the error reporting must be done differently then.

Note that Martin has had a patch pending for this since, umm, January:

Maybe he should review it himself <wink>.

From  Fri Aug  9 05:13:15 2002
From: (Tim Peters)
Date: Fri, 09 Aug 2002 00:13:15 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

> ...
> Because these literals look like identifiers, they are interned, so the
> unpickled data structure shares the string references -- while the
> original test data has 10,000 copies of each string!
> If we really want this as a feature, a call to
> PyString_InternFromString() could be made under certain conditions in
> load_short_binstring() (e.g. when the length is at most 10 and
> all_name_chars() from compile.c returns true).
> I'm not sure that this is a desirable feature though.

I hope Oren resumes his crusade to make interned strings follow the same
refcount rules as everything else, and then we wouldn't have this fear of
interning.  BTW, nobody yet has reported any code where "indirect interning"
pays -- or even triggers once in a non-eating-its-own-tail way.

From  Fri Aug  9 07:08:35 2002
From: (Inyeol Lee)
Date: Thu, 8 Aug 2002 23:08:35 -0700
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
Message-ID: <>

Guido> I think we've argued about '' in 'abc' long enough.  Tim has failed to
Guido> convince me, so '' in 'abc' returns True.  Barry has checked it all
Guido> in.

I'm Inyeol Lee, happy python user. I checked other string methods and
re functions.

1. most of them assume null character between normal characters and
   at the start/end of string;

   'abc'.count('') -> 4
   'abc'.endswith('') -> 1
   'abc'.find('') -> 0
   'abc'.index('') -> 0
   'abc'.rfind('') -> 3
   'abc'.rindex('') -> 3
   'abc'.startswith('') -> 1'', 'abc').span() -> (0, 0)
   re.match('', 'abc').span() -> (0, 0)
   re.findall('', 'abc') -> ['', '', '', '']
   re.sub('', '_', 'abc') -> '_a_b_c_'
   re.subn('', '_', 'abc') -> ('_a_b_c_', 4)

2. some of them generate exception;

   '' in 'abc'
   'abc'.replace('', '_')

3. one of them ignores empty match;

   re.split('', 'abc') -> ['abc']

(couldn't test re.finditer but it seems to be the same as re.findall.)

Since '' in 'abc' now returns True, How about changing 'abc'.replace('')
to generate '_a_b_c_', too? It is consistent with re.sub()/subn() and the
cost for change is similar to '' in 'abc' case.

Inyeol Lee

From  Fri Aug  9 09:08:19 2002
From: (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 09 Aug 2002 10:08:19 +0200
Subject: [Python-Dev] Patch 592529: Split-out ntmodule.c
Message-ID: <>

I'm collecting opinions on whether the module nt should live in its
own source code file; it currently lives in posixmodule.c. has a patch that implements that feature.
Tim is -0, Guido is +0.5, more votes are needed.

If you are familiar with the code, it would be good if you could
comment on the following questions:

- should os2module.c get its own source code file as well?

- are the #ifdefs in the resulting ntmodule.c still needed?
  I believe they are, as the various compilers appear to support
  different sets of functions in their C libraries. Of course,
  most of these could be eliminated if the C is avoided in favour
  of the Win32 API. Alternatively, can anybody with access to any
  of these compilers (BorlandC, Watcom, IBM) please comment on
  which functions provided by MSVC are missing in those compilers?


From  Fri Aug  9 09:52:08 2002
From: (Duncan Booth)
Date: Fri, 9 Aug 2002 09:52:08 +0100
Subject: [Python-Dev] _sre as part of python.dll
References: <> <> <> <>
Message-ID: <>

On 08 Aug 2002, Guido van Rossum <> wrote:

>> In addition, it should decrease startup time: Python won't need to
>> locate that many files anymore.
>> It also decreases the total size of the binary distribution slightly.
> Maybe _sre is used by most apps (though I doubt even that).  But
> _socket, select, winreg, mmap and the others are definitely not.  On
> Unix, all extensions are built as shared libraries, except the ones
> that are needed by to be able to build extensions; it looks
> like only posix, errno, _sre and symtable are built statically.
> I'd say that making more extensions static on Windows would increase
> start time of modules that don't use those extensions.

_sre is used by any application that imports 'os'. That (IMHO) is almost 
every non-trivial Python program.

Of course, we shouldn't be guessing about startup times. Someone should 
actually try building two versions and comparing them.

Duncan Booth                                   
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?

From  Fri Aug  9 10:35:19 2002
From: (M.-A. Lemburg)
Date: Fri, 09 Aug 2002 11:35:19 +0200
Subject: [Python-Dev] _sre as part of python.dll
References: <>	<>	<>	<>	<>	<>	<>	<> <>
Message-ID: <>

Martin v. Loewis wrote:
> "M.-A. Lemburg" <> writes:
>>>On the other hand, on my Linux installation, importing a module costs
>>>35 system calls if the module is not found, and no PYTHONPATH is set;
>>>every directory in PYTHONPATH adds four additional system calls.
>>Why not address this problem instead ?
> I'm trying to: every module incorporated in config.c won't be searched
>>Note that mxCGIPython can help a lot here: it freeze most of the
>>Python std lib into the executable making imports go really fast
>>(and that's needed if you're doing a lot of CGI scripting).
> Indeed, freezing also helps - but is probably only suitable for
> special-purpose applications. I think people would be surprised if
> they are told that editing the source of a library module won't have
> any effect.

They shouldn't edit those anyway :-) What ever happened to the
ZIP archive import that James C. Ahlstrom was working (I think it
was him) ?

If startup time for the std lib is considered a problem, then
people could be directed to a ZIP archive incorporating the
complete pure Python std lib.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Fri Aug  9 12:16:51 2002
From: (Sjoerd Mullender)
Date: Fri, 09 Aug 2002 13:16:51 +0200
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

On Fri, Aug 9 2002 Duncan Booth wrote:

> On 08 Aug 2002, Guido van Rossum <> wrote:
> >> In addition, it should decrease startup time: Python won't need to
> >> locate that many files anymore.
> >> 
> >> It also decreases the total size of the binary distribution slightly.
> > 
> > Maybe _sre is used by most apps (though I doubt even that).  But
> > _socket, select, winreg, mmap and the others are definitely not.  On
> > Unix, all extensions are built as shared libraries, except the ones
> > that are needed by to be able to build extensions; it looks
> > like only posix, errno, _sre and symtable are built statically.
> > 
> > I'd say that making more extensions static on Windows would increase
> > start time of modules that don't use those extensions.
> _sre is used by any application that imports 'os'. That (IMHO) is almost 
every non-trivial Python program.

Not on my system it isn't!

It's true that _sre does get imported whenever I start Python, but
that is not because it gets imported by os.  There is an import of re
in posixpath (imported by os), but that is inside the function
expandvars which is not called during import.

In my case imports distutils.util because Python decides it is
called from the build directory.

-- Sjoerd Mullender <>

From  Fri Aug  9 12:51:30 2002
From: (Gordon McMillan)
Date: Fri, 9 Aug 2002 07:51:30 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <Pine.LNX.4.44.0208081720230.2277-100000@ziggy>
References: <>
Message-ID: <3D537482.571.52A3D6DF@localhost>

On 8 Aug 2002 at 17:22, Ka-Ping Yee wrote:

> That's fine.  But what i'm trying to say is there's
> a migration issue: this decision is a significant
> change from current behaviour, and it worries me
> that we would let this change pass silently without
> any grace period. 

But it's no more significant than 'ab' in 'abc' not
raising an exception. If you're relying on "x in str"
to validate the "char"ness of x, you're screwed
either way.

-- Gordon

From  Fri Aug  9 12:51:30 2002
From: (Gordon McMillan)
Date: Fri, 9 Aug 2002 07:51:30 -0400
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
References: <>
Message-ID: <3D537482.23186.52A3D72F@localhost>

On 8 Aug 2002 at 17:51, Tim Peters wrote:

> If the real goal is to reduce startup time, we
> should analyze where startup time is being spent;
> random thrashing "in that general direction" won't
> satisfy in the end. 

In the 1.5.2 timeframe, most *startup* time was
spent figuring out where to root sys.path (looking
for the sentinel, deciding if this is a developer
build, etc.). In crude experiments on my Linux
box, I got rid of a few hundred system calls
just by removing most of the intelligence from
the getpath stuff. 

Then there are the things you can do with import
(archives, careful crafting of sys.path), but that's
harder to do, especially in a way that will satisfy
most people / most scripts.

So the lowest hanging fruit, I think, is to find some
way of telling Python "don't be clever - just start
here", and have it fallback to current behavior in
the absence of that info.

-- Gordon

From  Fri Aug  9 13:28:01 2002
From: (Jeremy Hylton)
Date: Fri, 9 Aug 2002 08:28:01 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
References: <>
Message-ID: <>

One of the things I mentioned on the thread is the
fwrite() call.  I looked into it, using Penrose's test case, and found
that the locking alone added 25% overhead.  I expect the layer or two
of C function calls above fwrite() add overhead.  I also expect that
calling fwrite() repeatedly for very small strings is inefficient.

If I were to suggest a cPickle project, it would be an efficient
internal buffering scheme.


From  Fri Aug  9 15:03:40 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 10:03:40 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: Your message of "Fri, 09 Aug 2002 08:28:01 EDT."
References: <> <> <> <>
Message-ID: <>

> One of the things I mentioned on the thread is the
> Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS calls around every
> fwrite() call.  I looked into it, using Penrose's test case, and found
> that the locking alone added 25% overhead.  I expect the layer or two
> of C function calls above fwrite() add overhead.  I also expect that
> calling fwrite() repeatedly for very small strings is inefficient.
> If I were to suggest a cPickle project, it would be an efficient
> internal buffering scheme.

Who's got time?  It's fast enough for Zope. :-)

--Guido van Rossum (home page:

From  Fri Aug  9 15:08:40 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 10:08:40 -0400
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: Your message of "Fri, 09 Aug 2002 11:35:19 +0200."
References: <> <> <> <> <> <> <> <> <>
Message-ID: <>

> What ever happened to the
> ZIP archive import that James C. Ahlstrom was working (I think it
> was him) ? is open for review.

--Guido van Rossum (home page:

From  Fri Aug  9 15:13:12 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 10:13:12 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Thu, 08 Aug 2002 23:08:35 PDT."
References: <>
Message-ID: <>

> Since '' in 'abc' now returns True, How about changing
> 'abc'.replace('') to generate '_a_b_c_', too? It is consistent with
> re.sub()/subn() and the cost for change is similar to '' in 'abc'
> case.

Do you have a use case?  Or are you just striving for consistency?  It
would be more consistent but I'm not sure what the point is.  I can
think of situations where '' in 'abc' would be needed, but not so for
'abc'.replace('', '_').

--Guido van Rossum (home page:

From  Fri Aug  9 15:14:06 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 10:14:06 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: Your message of "Fri, 09 Aug 2002 00:13:15 EDT."
References: <>
Message-ID: <>

> I hope Oren resumes his crusade to make interned strings follow the
> same refcount rules as everything else, and then we wouldn't have
> this fear of interning.  BTW, nobody yet has reported any code where
> "indirect interning" pays -- or even triggers once in a
> non-eating-its-own-tail way.

Maybe we should just drop indirect interning then.  It can save 31
bits per string object, right?  How to collect those savings?

--Guido van Rossum (home page:

From  Fri Aug  9 15:44:36 2002
From: (Barry A. Warsaw)
Date: Fri, 9 Aug 2002 10:44:36 -0400
Subject: [Python-Dev] _sre as part of python.dll
References: <>
Message-ID: <>

>>>>> "Gordo" == Gordon McMillan <> writes:

    Gordo> In the 1.5.2 timeframe, most *startup* time was
    Gordo> spent figuring out where to root sys.path (looking
    Gordo> for the sentinel, deciding if this is a developer
    Gordo> build, etc.). In crude experiments on my Linux
    Gordo> box, I got rid of a few hundred system calls
    Gordo> just by removing most of the intelligence from
    Gordo> the getpath stuff. 

I remember doing some similar testing probably around the Python 2.0
timeframe and found a huge speed up by avoiding the import of
for largely the same reasons (avoiding tons of stat calls).  It's not
always practical to avoid loading, but if you can, you can get
a big startup win.

    Gordo> So the lowest hanging fruit, I think, is to find some
    Gordo> way of telling Python "don't be clever - just start
    Gordo> here", and have it fallback to current behavior in
    Gordo> the absence of that info.

That's what $PYTHONHOME is supposed to do.  It's been a while since I
dug around in getpath.c, but setting $PYTHONHOME should set prefix and
exec_prefix unconditionally, even in the build directory.

(The comments in the file are abit little misleading.  Step 1 could be
read as implying that $PYTHONHOME isn't consulted when looking for
build directory landmarks, but that's not the case: even for a build
dir search, $PYTHONHOME is trusted unconditionally.)


From  Fri Aug  9 15:46:57 2002
From: (Todd Miller)
Date: Fri, 09 Aug 2002 10:46:57 -0400
Subject: [Python-Dev] C basetype mapping protocol difference between 2.2.1 and 2.3
Message-ID: <>

I am trying to accelerate Numarray by "dropping the bottom out" and 
re-writing the simplest, most used portions in a  C basetype.  Looking 
at a new NumArray instance in Python-2.2.1 under GDB, I see:

(gdb) p *self->ob_type->tp_as_mapping
$2 = {mp_length = 0x4006eb7c <_ndarray_length>, mp_subscript = 0x80669b8 
mp_ass_subscript = 0x80669e0 <slot_mp_ass_subscript>}

Looking at the same code compiled for Python-2.3, _ndarray "owns" all of 
the mapping protocol slots, which is what I really want to happen:    

(gdb) p *o->ob_type->tp_as_mapping
$1 = {mp_length = 0x400c1a68 <_ndarray_length>, mp_subscript = 
0x400c1a80 <_ndarray_subscript>, mp_ass_subscript = 0x400c1188 

Did anything change between Python-2.2.1 and Python-2.3 that would 
account for this?


Todd Miller

From  Fri Aug  9 15:52:48 2002
From: (Andrew Koenig)
Date: 09 Aug 2002 10:52:48 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <>
Message-ID: <>

Guido> Do you have a use case?  Or are you just striving for consistency?  It
Guido> would be more consistent but I'm not sure what the point is.  I can
Guido> think of situations where '' in 'abc' would be needed, but not so for
Guido> 'abc'.replace('', '_').

It's the first way that comes to mind of  s p r e a d i n g   o u t   the
characters in a string for use in, say, the title of a report.

Andrew Koenig,,

From  Fri Aug  9 16:01:30 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 11:01:30 -0400
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: Your message of "Fri, 09 Aug 2002 10:44:36 EDT."
References: <> <3D537482.23186.52A3D72F@localhost>
Message-ID: <>

> I remember doing some similar testing probably around the Python 2.0
> timeframe and found a huge speed up by avoiding the import of
> for largely the same reasons (avoiding tons of stat calls).  It's not
> always practical to avoid loading, but if you can, you can get
> a big startup win.

It's also easy: "python -S" avoids loading

--Guido van Rossum (home page:

From  Fri Aug  9 16:12:08 2002
From: (Oren Tirosh)
Date: Fri, 9 Aug 2002 11:12:08 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Fri, Aug 09, 2002 at 10:14:06AM -0400, Guido van Rossum wrote:
> > I hope Oren resumes his crusade to make interned strings follow the
> > same refcount rules as everything else, and then we wouldn't have
> > this fear of interning.  BTW, nobody yet has reported any code where
> > "indirect interning" pays -- or even triggers once in a
> > non-eating-its-own-tail way.
> Maybe we should just drop indirect interning then.  It can save 31
> bits per string object, right?  How to collect those savings?

I was just going back to that patch.  The current savings are 24 bits (so 
now you see why I considered making 'interned' a type - to get that bit 
in without paying for a whole byte :-).

Before the nitpickers point it out: yes, the average savings are likely to 
be less than 24 bits because of allocator overhead and nonuniform 
distribution of string lengths.


From  Fri Aug  9 16:26:19 2002
From: (Duncan Booth)
Date: Fri, 9 Aug 2002 16:26:19 +0100
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
References: <> <> <>
Message-ID: <>

On 09 Aug 2002, Andrew Koenig <> wrote:

>> Do you have a use case?  Or are you just striving for consistency? 
>> It would be more consistent but I'm not sure what the point is.  I
>> can think of situations where '' in 'abc' would be needed, but not so
>> for 'abc'.replace('', '_'). 
> It's the first way that comes to mind of  s p r e a d i n g   o u t  
> the characters in a string for use in, say, the title of a report.

The first way that comes to my mind is:

>>> ' '.join("spreading out")
's p r e a d i n g   o u t'

Duncan Booth                                   
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?

From  Fri Aug  9 16:11:50 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 11:11:50 -0400
Subject: [Python-Dev] C basetype mapping protocol difference between 2.2.1 and 2.3
In-Reply-To: Your message of "Fri, 09 Aug 2002 10:46:57 EDT."
References: <>
Message-ID: <>

> (gdb) p *self->ob_type->tp_as_mapping
> $2 = {mp_length = 0x4006eb7c <_ndarray_length>, mp_subscript = 0x80669b8 
> <slot_mp_subscript>,
> mp_ass_subscript = 0x80669e0 <slot_mp_ass_subscript>}
> Looking at the same code compiled for Python-2.3, _ndarray "owns" all of 
> the mapping protocol slots, which is what I really want to happen:    
> (gdb) p *o->ob_type->tp_as_mapping
> $1 = {mp_length = 0x400c1a68 <_ndarray_length>, mp_subscript = 
> 0x400c1a80 <_ndarray_subscript>, mp_ass_subscript = 0x400c1188 
> <_ndarray_ass_subscript>}
> Did anything change between Python-2.2.1 and Python-2.3 that would 
> account for this?

Yes, I did several massive refactorings of a lot of very subtle code
in typeobject.c.  Note that this is only a performance improvement,
not a semantic change: slot_mp_subscript will look for and call the
__setitem__ descriptor in the type dict, which will be a Python
wrapper around _ndarray_subscript.  The new code notices that this is
so and leaves _ndarray_subscript in the slot.

I wish it was easy to backport this to 2.2.2, but it's not. :-(

--Guido van Rossum (home page:

From  Fri Aug  9 16:37:11 2002
From: (Barry A. Warsaw)
Date: Fri, 9 Aug 2002 11:37:11 -0400
Subject: [Python-Dev] _sre as part of python.dll
References: <>
Message-ID: <>

>>>>> "GvR" == Guido van Rossum <> writes:

    >> I remember doing some similar testing probably around the
    >> Python 2.0 timeframe and found a huge speed up by avoiding the
    >> import of for largely the same reasons (avoiding tons
    >> of stat calls).  It's not always practical to avoid loading
    >>, but if you can, you can get a big startup win.

    GvR> It's also easy: "python -S" avoids loading

Yes.  The one gotcha is that site-packages is put on sys.path via so using -S means you lose that directory.  You can, of
course, reinstall it explicitly by something like:

import sys
sitedir = os.path.join(sys.prefix, 'lib', 'python'+sys.version[:3],


From  Fri Aug  9 16:39:30 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 11:39:30 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Fri, 09 Aug 2002 16:26:19 BST."
References: <> <> <>
Message-ID: <>

If someone really wants 'abc'.replace('', '-') to return '-a-b-c-',
please submit patches for both 8-bit and Unicode strings to
SourceForge and assign to me.  I looked into this and it's
non-trivial: the implementation used for 8-bit strings goes into an
infinite loop when the pattern is empty, and the Unicode
implementation tacks '----' onto the end.  Please supply doc and
unittest patches too.  At least re does the right thing already:

  >>> import re
  >>> re.sub('', '-', 'abc')

--Guido van Rossum (home page:

From  Fri Aug  9 16:45:03 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 11:45:03 -0400
Subject: [Python-Dev] C basetype mapping protocol difference between 2.2.1 and 2.3
In-Reply-To: Your message of "Fri, 09 Aug 2002 11:36:23 EDT."
References: <> <>
Message-ID: <>

> Doesn't the current wrapper narrow the acceptable definitions for 
> _ndarray_subscript?  The reason I noticed this is that my 2.2.1 code 
> raises an exception:
>  >>> import numarray
>  >>> a=numarray.arange(10)
>  >>> a
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "/home/jmiller/lib/python2.2/site-packages/numarray/", 
> line 622, in __repr__
> File "/home/jmiller/lib/python2.2/site-packages/numarray/", 
> line 156, in array2string
> separator, array_output)
> File "/home/jmiller/lib/python2.2/site-packages/numarray/", 
> line 112, in _array2string
> max_str_len = max(len(str(max_reduce(data))),
> File "/home/jmiller/lib/python2.2/site-packages/numarray/", line 
> 759, in reduce
> r = self.areduce(inarr, dim, outarr)
> File "/home/jmiller/lib/python2.2/site-packages/numarray/", line 
> 745, in areduce
> _outarr1 = self._cumulative("reduce", _inarr, _outarr0)
> File "/home/jmiller/lib/python2.2/site-packages/numarray/", line 
> 653, in _cumulative
> toutarr = self._reduce_out(inarr, outarr, outtype)
> File "/home/jmiller/lib/python2.2/site-packages/numarray/", line 
> 591, in _reduce_out
> toutarr = inarr[...,0].copy().astype(outtype)
> TypeError: an integer is required

I guess that means it's going through the *sequence* getitem, not the
*mapping* getitem.  Have you tried leaving the sequence getitem slot
NULL, and doing everything through your mapping getitem slot?  That
should work in 2.2.

--Guido van Rossum (home page:

From  Fri Aug  9 16:56:45 2002
From: (David Abrahams)
Date: Fri, 9 Aug 2002 11:56:45 -0400
Subject: [Python-Dev] Exception-handling model
Message-ID: <027301c23fbd$b3c44b30$>

I have always been confused about Python's exception-handling model. I hope
someone can clear up a few questions: says:

    "When an exception is raised, an object (maybe None) is passed as the
exception's value; this object does not affect the selection of an
exception handler, but is passed to the selected exception handler as
additional information. For class exceptions, this object must be an
instance of the exception class being raised."

But unless I misunderstand the source, Luke, Python itself raises
exceptions all over the place with PyErr_SetString(), which uses a class as
the exception type and a string as the exception object. Other uses of
PyErr_SetObject() that I've found /never/ seem to use an instance of the
exception class as the exception object.

If I got that right, what's the meaning of the documentation I quoted?
What rules must one actually follow when raising an exception?


           David Abrahams * Boost Consulting *

From  Fri Aug  9 17:32:39 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 12:32:39 -0400
Subject: [Python-Dev] Exception-handling model
In-Reply-To: Your message of "Fri, 09 Aug 2002 11:56:45 EDT."
References: <027301c23fbd$b3c44b30$>
Message-ID: <>

> I have always been confused about Python's exception-handling model. I hope
> someone can clear up a few questions:
> says:
>     "When an exception is raised, an object (maybe None) is passed as the
> exception's value; this object does not affect the selection of an
> exception handler, but is passed to the selected exception handler as
> additional information. For class exceptions, this object must be an
> instance of the exception class being raised."
> But unless I misunderstand the source, Luke, Python itself raises
> exceptions all over the place with PyErr_SetString(), which uses a class as
> the exception type and a string as the exception object. Other uses of
> PyErr_SetObject() that I've found /never/ seem to use an instance of the
> exception class as the exception object.
> If I got that right, what's the meaning of the documentation I quoted?
> What rules must one actually follow when raising an exception?

This may be a case where reading the source is actually confusing. :-)

When the exception type is a class and the exception value is not an
instance of that class, eventually the class is instantiated with the
value as argument (if the value is a tuple, it is used as an argument

But there's an efficiency hack that tries to put off the class
instantiation as long as possible.  It is possible for C code to
"catch" the exception and clear it without the instantiation
happening, and then the instantiation costs are saved.  Because C code
rather frequently checks and clears exceptions, this can be a big win.
Thus, in C, if you the exception value using PyErr_Fetch(), you may
see a value that's not an instance of the class.  But if you catch it
in Python with an except clause, it will be instantiated before your
except clause is entered.  This is done by PyErr_NormalizeException();
its API docs provide a summary of what I just explained.

--Guido van Rossum (home page:

From  Fri Aug  9 17:57:12 2002
From: (David Abrahams)
Date: Fri, 9 Aug 2002 12:57:12 -0400
Subject: [Python-Dev] Exception-handling model
References: <027301c23fbd$b3c44b30$>  <>
Message-ID: <030801c23fc6$4a9d1c00$>

Thanks, Guido and Jeremy! I thought something like that might be the case,
but then couldn't find the code which did the work... I never expected the
instantiation to be deferred in that way so I never looked further than the
code which actually does the raising.

It would probably be useful to put pointers to the PyErr_NormailzeException
behavior right at the top of the API docs for exception-handling, since
making sense out of basic facilities like PyErr_SetString() depends on it.

           David Abrahams * Boost Consulting *

From: "Guido van Rossum" <>
> > I have always been confused about Python's exception-handling model. I
> > someone can clear up a few questions:
> >
> > says:
> >
> >     "When an exception is raised, an object (maybe None) is passed as
> > exception's value; this object does not affect the selection of an
> > exception handler, but is passed to the selected exception handler as
> > additional information. For class exceptions, this object must be an
> > instance of the exception class being raised."
> >
> > But unless I misunderstand the source, Luke, Python itself raises
> > exceptions all over the place with PyErr_SetString(), which uses a
class as
> > the exception type and a string as the exception object. Other uses of
> > PyErr_SetObject() that I've found /never/ seem to use an instance of
> > exception class as the exception object.
> >
> > If I got that right, what's the meaning of the documentation I quoted?
> > What rules must one actually follow when raising an exception?
> This may be a case where reading the source is actually confusing. :-)
> When the exception type is a class and the exception value is not an
> instance of that class, eventually the class is instantiated with the
> value as argument (if the value is a tuple, it is used as an argument
> list).
> But there's an efficiency hack that tries to put off the class
> instantiation as long as possible.  It is possible for C code to
> "catch" the exception and clear it without the instantiation
> happening, and then the instantiation costs are saved.  Because C code
> rather frequently checks and clears exceptions, this can be a big win.
> Thus, in C, if you the exception value using PyErr_Fetch(), you may
> see a value that's not an instance of the class.  But if you catch it
> in Python with an except clause, it will be instantiated before your
> except clause is entered.  This is done by PyErr_NormalizeException();
> its API docs provide a summary of what I just explained.
> --Guido van Rossum (home page:

From  Fri Aug  9 18:50:40 2002
From: (Tim Peters)
Date: Fri, 09 Aug 2002 13:50:40 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

> Maybe we should just drop indirect interning then.  It can save 31
> bits per string object, right?  How to collect those savings?

Make the flag a byte insted of a pointer and it will save 3 or 7 bytes
(depending on native pointer size) "on average".  Note, assuming a 32-bit
box:  since pymalloc 8-byte aligns, the smallest footprint a string object
can have now is 24 bytes, 20 of which are consumed by bookkeeping overheads
(type pointer, refcount, ob_size, ob_shash, ob_sinterned).  Strings through
length 3 fit in this size (one byte is needed for the trailing \0 we always
put in ob_sval[]).  Saving 3 bytes wouldn't actually change the memory
burden of the smallest string object, but would allow all strings of lengths
4, 5 and 6 to consume 8 fewer bytes than at present (assuming compilers are
happy not to pad between a char member and char[] member).  That's probably
a significant savings for many string-slinging apps (count the number of
words of lengths 4, 5 and 6 in this msg (even <wink> benefits <wink>)).

From  Fri Aug  9 19:03:10 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 14:03:10 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: Your message of "Fri, 09 Aug 2002 13:50:40 EDT."
References: <>
Message-ID: <>

> [Guido]
> > Maybe we should just drop indirect interning then.  It can save 31
> > bits per string object, right?  How to collect those savings?

> Make the flag a byte insted of a pointer and it will save 3 or 7
> bytes (depending on native pointer size) "on average".  Note,
> assuming a 32-bit box: since pymalloc 8-byte aligns, the smallest
> footprint a string object can have now is 24 bytes, 20 of which are
> consumed by bookkeeping overheads (type pointer, refcount, ob_size,
> ob_shash, ob_sinterned).  Strings through length 3 fit in this size
> (one byte is needed for the trailing \0 we always put in ob_sval[]).
> Saving 3 bytes wouldn't actually change the memory burden of the
> smallest string object, but would allow all strings of lengths 4, 5
> and 6 to consume 8 fewer bytes than at present (assuming compilers
> are happy not to pad between a char member and char[] member).
> That's probably a significant savings for many string-slinging apps
> (count the number of words of lengths 4, 5 and 6 in this msg (even
> <wink> benefits <wink>)).

This means a change in the string object lay-out, which breaks binary
compatibility (the PyString_AS_STRING macro depends on this).

I don't mind biting this bullet, but it means we have to increment the
API version, and perhaps the warning about API version mismatches
should become an error if an extension with too an API version before
this change is detected.

Oren, how's that patch coming along? :-)

--Guido van Rossum (home page:

From  Fri Aug  9 17:13:34 2002
From: (Todd Miller)
Date: Fri, 09 Aug 2002 12:13:34 -0400
Subject: [Python-Dev] C basetype mapping protocol difference between 2.2.1
 and 2.3
References: <> <>              <> <>
Message-ID: <>

Guido van Rossum wrote:

>>Doesn't the current wrapper narrow the acceptable definitions for 
>>_ndarray_subscript?  The reason I noticed this is that my 2.2.1 code 
>>raises an exception:
>> >>> import numarray
>> >>> a=numarray.arange(10)
>> >>> a
>>Traceback (most recent call last):
>>File "<stdin>", line 1, in ?
>>File "/home/jmiller/lib/python2.2/site-packages/numarray/", 
>>line 622, in __repr__
>>File "/home/jmiller/lib/python2.2/site-packages/numarray/", 
>>line 156, in array2string
>>separator, array_output)
>>File "/home/jmiller/lib/python2.2/site-packages/numarray/", 
>>line 112, in _array2string
>>max_str_len = max(len(str(max_reduce(data))),
>>File "/home/jmiller/lib/python2.2/site-packages/numarray/", line 
>>759, in reduce
>>r = self.areduce(inarr, dim, outarr)
>>File "/home/jmiller/lib/python2.2/site-packages/numarray/", line 
>>745, in areduce
>>_outarr1 = self._cumulative("reduce", _inarr, _outarr0)
>>File "/home/jmiller/lib/python2.2/site-packages/numarray/", line 
>>653, in _cumulative
>>toutarr = self._reduce_out(inarr, outarr, outtype)
>>File "/home/jmiller/lib/python2.2/site-packages/numarray/", line 
>>591, in _reduce_out
>>toutarr = inarr[...,0].copy().astype(outtype)
>>TypeError: an integer is required
>I guess that means it's going through the *sequence* getitem, not the

>*mapping* getitem.  Have you tried leaving the sequence getitem slot
>NULL, and doing everything through your mapping getitem slot?  

>should work in 2.2.
It does now.

>--Guido van Rossum (home page:

Todd Miller

From  Fri Aug  9 21:20:47 2002
From: (Patrick K. O'Brien)
Date: Fri, 9 Aug 2002 15:20:47 -0500
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

[Tim Peters]
> OTOH, I do expect that once code relies on stability, we'll have about as
> much chance of taking that away as getting rid of list.append().

There you go again! Your flip comment has got me thinking about the "one
best idiom" for list appending. So I'll ask the question. Is there a reason
to want to get rid of list.append()? How does one decide between
list.append() and augmented assignment (+=), such as:

>>> l = []
>>> l.append('something')
>>> l.append('else')
>>> l += ['to']
>>> l += ['consider']
>>> l
['something', 'else', 'to', 'consider']

And wasn't someone documenting current idioms in light of recent Python
features? Did that ever get posted anywhere?

Patrick K. O'Brien
"Your source for Python programming expertise."

From  Fri Aug  9 21:33:55 2002
From: (Tim Peters)
Date: Fri, 09 Aug 2002 16:33:55 -0400
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

> OTOH, I do expect that once code relies on stability, we'll
> have about as much chance of taking that away as getting rid of
> list.append().

[Patrick K. O'Brien]
> There you go again! Your flip comment has got me thinking about the "one
> best idiom" for list appending.

It was a serious enough comment, judged against the universe of all comments
I make <wink>.  I expect that if a hypothetical 3x-faster non-stable sort
algorithm got discovered for 2.6, we wouldn't be able to call it list.sort()
then.  It's very hard to take any perceived goodness away, ever.

> So I'll ask the question. Is there a reason to want to get rid of
> list.append()?

I believe-- and sincerely hope --that I'm the only one who ever suggested

> How does one decide between list.append() and augmented assignment (+=),
> such as:
> >>> l = []
> >>> l.append('something')
> >>> l.append('else')
> >>> l += ['to']
> >>> l += ['consider']
> >>> l
> ['something', 'else', 'to', 'consider']
> >>>

Clarity:  l.append() is obvious; I'd never append a single item via +=.
More, I'd probably do

    push = L.append

outside a loop and call push('something') inside the loop.  "+=" as a
synonym for list.extend() is only interesting if you're writing polymorphic
code that wants to exploit the ability to define __iadd__.  For sane people,
that's approximately never <wink>.

> And wasn't someone documenting current idioms in light of recent Python
> features? Did that ever get posted anywhere?

Rings a bell, but beats me.

From  Fri Aug  9 21:40:02 2002
From: (Guido van Rossum)
Date: Fri, 09 Aug 2002 16:40:02 -0400
Subject: [Python-Dev] PEP 282 Implementation
In-Reply-To: Your message of "Mon, 08 Jul 2002 02:16:32 BST."
References: <00e001c2261d$19bfc320$652b6992@alpha>
Message-ID: <>

A month (!) ago, Vinay Sajip wrote:

> I've uploaded my logging module, the proposed implementation for PEP 282,
> for committer review, to the SourceForge patch manager:
> I've assigned it to Mark Hammond as (a) he had posted some comments
> to Trent Mick's original PEP posting, and (b) Barry Warsaw advised
> not assigning to PythonLabs people on account of their current
> workload.

Well, Mark was apparently too busy too.  I've assigned this to myself
and am making progress with the review.

> The file is (apart from some test scripts) all that's
> supposed to go into Python 2.3. The file logging-0.4.6.tar.gz
> contains the module, an updated version of the PEP (which I mailed
> to Barry Warsaw on 26th June), numerous test/example scripts, TeX
> documentation etc. You can also refer to
> Here's hoping for a speedy review :-)

Here's some feedback.

In general the code looks good.  Only one style nits: I prefer
docstrings that have a one-line summary, then a blank line, and then a
longer description.

There's a lot of code there!  Should it perhaps be broken up into
different modules?  Perhaps it should become a logging *package* with
submodules that define the various filters and handlers.

Some detailed questions:

- Why does the FileHandler open the file with mode "a+" (and later
  with "w+")?  The "+" makes the file readable, but I see no reason to
  read it.  Am I missing?

- setRollover(): the explanation isn't 100% clear.  I *think* that you
  always write to "app.log", and when that's full, you rename it to
  app.log.1, and app.log.1 gets renamed to app.log.2, and so on, and
  then you start writing to a new app.log, right?

- class SocketHandler: why set yourself up for buffer overflow by
  using only 2 bytes for the packet size?  You can use the struct
  module to encode/decode this, BTW.  I also wonder what the
  application for this is, BTW.

  - method send(): in Python 2.2 and later, you can use the sendall()
    socket method which takes care of this loop for you.

- class DatagramHandler, method send(): I don't think UDP handles
  fragmented packets very well -- if you have to break the packet up,
  there's no guarantee that the receiver will see the parts in order
  (or even all of them).

- fileConfig(): Is there documentation for the configuration file?

That's it for now.

--Guido van Rossum (home page:

From  Fri Aug  9 21:45:27 2002
From: (Patrick K. O'Brien)
Date: Fri, 9 Aug 2002 15:45:27 -0500
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

[Tim Peters]
> > So I'll ask the question. Is there a reason to want to get rid of
> > list.append()?
> I believe-- and sincerely hope --that I'm the only one who ever suggested
> that.

Okay, good. I just needed a reality check. Thanks.

Patrick K. O'Brien
"Your source for Python programming expertise."

From  Fri Aug  9 21:51:54 2002
From: (Inyeol Lee)
Date: Fri, 9 Aug 2002 13:51:54 -0700
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <> <>
Message-ID: <>

To underline strings for viewers like less.

>>> underlined = normal.replace('', '_\b')

This also can be done with re.sub(), but I think it is natural to use
string methods to handle non-RE strings.

This cannot be done with '_\b'.join(), since it doesn't prepend '_\b'.

- Inyeol Lee

On Fri, Aug 09, 2002 at 10:13:12AM -0400, Guido van Rossum wrote:
> > Since '' in 'abc' now returns True, How about changing
> > 'abc'.replace('') to generate '_a_b_c_', too? It is consistent with
> > re.sub()/subn() and the cost for change is similar to '' in 'abc'
> > case.
> Do you have a use case?  Or are you just striving for consistency?  It
> would be more consistent but I'm not sure what the point is.  I can
> think of situations where '' in 'abc' would be needed, but not so for
> 'abc'.replace('', '_').
> --Guido van Rossum (home page:

From  Fri Aug  9 22:31:30 2002
From: (Ka-Ping Yee)
Date: Fri, 9 Aug 2002 14:31:30 -0700 (PDT)
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0208091428580.2277-100000@ziggy>

On Fri, 9 Aug 2002, Inyeol Lee wrote:
> To underline strings for viewers like less.
> >>> underlined = normal.replace('', '_\b')

That doesn't quite work, since it puts an extra underbar at the end.
But it can be done fairly easily without using replace():

    underlined = ''.join(['_\b' + c for c in normal])

-- ?!ng

From  Fri Aug  9 22:47:20 2002
From: (Andrew Koenig)
Date: 09 Aug 2002 17:47:20 -0400
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <Pine.LNX.4.44.0208091428580.2277-100000@ziggy>
References: <Pine.LNX.4.44.0208091428580.2277-100000@ziggy>
Message-ID: <>

Ping> On Fri, 9 Aug 2002, Inyeol Lee wrote:

>> To underline strings for viewers like less.

>> >>> underlined = normal.replace('', '_\b')

Ping> That doesn't quite work, since it puts an extra underbar at the end.
Ping> But it can be done fairly easily without using replace():

Ping>     underlined = ''.join(['_\b' + c for c in normal])

With a sufficiently rich family of functions, you can avoid any one of
them if you want to do so badly enough.  Even so, that doesn't make
proposed uses of that function illegitimate.

Andrew Koenig,,

From  Fri Aug  9 23:20:57 2002
From: (Inyeol Lee)
Date: Fri, 9 Aug 2002 15:20:57 -0700
Subject: [Python-Dev] Re: string.find() again (was Re: timsort for jython)
In-Reply-To: <Pine.LNX.4.44.0208091428580.2277-100000@ziggy>
References: <> <Pine.LNX.4.44.0208091428580.2277-100000@ziggy>
Message-ID: <>

On Fri, Aug 09, 2002 at 02:31:30PM -0700, Ka-Ping Yee wrote:
> On Fri, 9 Aug 2002, Inyeol Lee wrote:
> > To underline strings for viewers like less.
> >
> > >>> underlined = normal.replace('', '_\b')
> That doesn't quite work, since it puts an extra underbar at the end.

underlined = normal.replace('', '_\b', len(normal))

Hmm... my position is getting weaker...
When I first posted this, I just thought about consistency, not about
use cases. This underline samples are created in a hurry :-)

-- Inyeol Lee

From  Sat Aug 10 00:28:48 2002
From: (Raymond Hettinger)
Date: Fri, 9 Aug 2002 19:28:48 -0400
Subject: [Python-Dev] timsort for jython
References: <>
Message-ID: <004001c23ffc$845a15c0$b9e97ad1@othello>

From: "Patrick K. O'Brien" <>
> And wasn't someone documenting current idioms in light of recent Python
> features? Did that ever get posted anywhere?

Are you referring to the modernization and migration guide, ?  It documents
transition procedures for new features but doesn't make
current idioms a central focus.


From  Sat Aug 10 00:49:41 2002
From: (Patrick K. O'Brien)
Date: Fri, 9 Aug 2002 18:49:41 -0500
Subject: [Python-Dev] timsort for jython
In-Reply-To: <004001c23ffc$845a15c0$b9e97ad1@othello>
Message-ID: <>

[Raymond Hettinger]
> From: "Patrick K. O'Brien" <>
> > And wasn't someone documenting current idioms in light of recent Python
> > features? Did that ever get posted anywhere?
> Are you referring to the modernization and migration guide,
> ?  It documents
> transition procedures for new features but doesn't make
> current idioms a central focus.

Yep. That was it. I forgot that it became a PEP. Thanks for the link.

Patrick K. O'Brien
"Your source for Python programming expertise."

From  Sat Aug 10 07:56:10 2002
From: (Tim Peters)
Date: Sat, 10 Aug 2002 02:56:10 -0400
Subject: [Python-Dev] RE: companies data for sorting comparisons
In-Reply-To: <>
Message-ID: <>

Update:  With the last batch of checkins, all sorts on Kevin's company
database are faster (a little to a killer lot) under 2.3a0 than under 2.2.1.

A reminder of what this looks like:

> A record looks like this after running his script to turn them
> into Python dicts:
>   {'Address': '395 Page Mill Road\nPalo Alto, CA 94306',
>    'Company': 'Agilent Technologies Inc.',
>    'Exchange': 'NYSE',
>    'NumberOfEmployees': '41,000',
>    'Phone': '(650) 752-5000',
>    'Profile': '',
>    'Symbol': 'A',
>    'Web': ''}
> It appears to me that the XML file is maintained by hand, in order
> of ticker symbol.  But people make mistakes when alphabetizing
> by hand, and there are 37 indices i such that
>     data[i]['Symbol'] > data[i+1]['Symbol']
> So it's "almost sorted" by that measure ...
> The proper order of Yahoo profile URLs is also strongly correlated
> with ticker symbol, while both the company name and web address
> look weakly correlated
> [and Address, NumberOfEmployess, and Phone are essentially
>  randomly ordered]

Here are the latest (and I expect the last) timings, in milliseconds per
sort, on the list of (key, index, record) tuples

    values = [(x.get(fieldname), i, x) for i, x in enumerate(data)]

[I wrote a little generator to simulate 2.3's enumerate() in 2.2.1]

There are 6635 companies in the database, but not all fields are present in
all records; .get() plugs in a key of None for those cases, and the index is
to prevent equal-key cases from falling into breaking the tie via expensive
dict comparison (each record x is a dict!):

Sorting on field 'Address'
    2.2.1:  41.57
    2.3a0:  40.96

Sorting on field 'Company'
    2.2.1:  40.14
    2.3a0:  29.79

Sorting on field 'Exchange'
    2.2.1:  53.83
    2.3a0:  24.79

Sorting on field 'NumberOfEmployees'
    2.2.1:  47.89
    2.3a0:  45.74

Sorting on field 'Phone'
    2.2.1:  48.09
    2.3a0:  47.15

Sorting on field 'Profile'
    2.2.1:  58.41
    2.3a0:   8.77

Sorting on field 'Symbol'
    2.2.1:  40.78
    2.3a0:   6.30

Sorting on field 'Web'
    2.2.1:  46.79
    2.3a0:  35.64

This may have been sorted more times by now than any other database on Earth

From  Sat Aug 10 11:15:25 2002
From: (Oren Tirosh)
Date: Sat, 10 Aug 2002 06:15:25 -0400
Subject: [Python-Dev] interning
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Fri, Aug 09, 2002 at 02:03:10PM -0400, Guido van Rossum wrote:
> This means a change in the string object lay-out, which breaks binary
> compatibility (the PyString_AS_STRING macro depends on this).

I think that making interned string mortal is important enough by
itself even without the size reduction.  If binary compatibility is
important enough it's possible to maintain it.

> I don't mind biting this bullet, but it means we have to increment the
> API version, and perhaps the warning about API version mismatches
> should become an error if an extension with too an API version before
> this change is detected.
> Oren, how's that patch coming along? :-)

I've just submitted a new patch for

It passes regrtest but causes test_gc to leak 20 objects. 13 from 
test_finalizer_newclass and 7 from test_del_newclass. These leaks
go away if test_saveall is skipped. I've tried earlier versions of 
this patch (which were ok at the time) and they now create this 
leak too.

Some change since the last time I worked on interning must have 
caused this. Either this change reveals a bug in my patch or my patch 
reveals a subtle bug in the GC.

I don't know why it interacts with GC logic because strings are 
non-gc objects. I've tried to untrack the interned dictionary because
it plays dirty tricks with refcounts but it doesn't change the 


From  Sat Aug 10 14:57:53 2002
From: (Guido van Rossum)
Date: Sat, 10 Aug 2002 09:57:53 -0400
Subject: [Python-Dev] interning
In-Reply-To: Your message of "Sat, 10 Aug 2002 06:15:25 EDT."
References: <> <>
Message-ID: <>

> I've just submitted a new patch for

I'll review it when I've got time.

> It passes regrtest but causes test_gc to leak 20 objects. 13 from 
> test_finalizer_newclass and 7 from test_del_newclass. These leaks
> go away if test_saveall is skipped. I've tried earlier versions of 
> this patch (which were ok at the time) and they now create this 
> leak too.
> Some change since the last time I worked on interning must have 
> caused this. Either this change reveals a bug in my patch or my patch 
> reveals a subtle bug in the GC.
> I don't know why it interacts with GC logic because strings are 
> non-gc objects. I've tried to untrack the interned dictionary because
> it plays dirty tricks with refcounts but it doesn't change the 
> symptom.

I've seen this too!  But only when I run the full test suite, not when
I run test_gc in isolation.  I made a number of small changes to the GC
code, I'll have to roll them back one at a time to see which one
caused this -- and then look for a solution. :-(

--Guido van Rossum (home page:

From  Sat Aug 10 17:17:17 2002
From: (Guido van Rossum)
Date: Sat, 10 Aug 2002 12:17:17 -0400
Subject: [Python-Dev] interning
In-Reply-To: Your message of "Sat, 10 Aug 2002 09:57:53 EDT."
References: <> <> <>
Message-ID: <>

> > It passes regrtest but causes test_gc to leak 20 objects. 13 from 
> > test_finalizer_newclass and 7 from test_del_newclass. These leaks
> > go away if test_saveall is skipped. I've tried earlier versions of 
> > this patch (which were ok at the time) and they now create this 
> > leak too.
> > 
> > Some change since the last time I worked on interning must have 
> > caused this. Either this change reveals a bug in my patch or my patch 
> > reveals a subtle bug in the GC.
> > 
> > I don't know why it interacts with GC logic because strings are 
> > non-gc objects. I've tried to untrack the interned dictionary because
> > it plays dirty tricks with refcounts but it doesn't change the 
> > symptom.
> I've seen this too!  But only when I run the full test suite, not when
> I run test_gc in isolation.  I made a number of small changes to the GC
> code, I'll have to roll them back one at a time to see which one
> caused this -- and then look for a solution. :-(

Duh.  This warning is only printed when is given the -l
option.  Oren's first paragraph quoted above is exactly right.

But none of the changes to C files made in the last month made any
difference...  The difference is itself!  With a checkout
from a month ago, if I change the classes in test_finalizer() and
test_del() to be new-style classes, I get the same warnings.

Maybe Tim understands the problem now?  (Summary: why do I get the
Warning below.)

$ ./python ../Lib/test/ -l test_gc
Warning: test created 20 uncollectable object(s).
1 test OK.

--Guido van Rossum (home page:

From  Sat Aug 10 19:38:51 2002
From: (Christian Tismer)
Date: Sat, 10 Aug 2002 20:38:51 +0200
Subject: [Python-Dev] _sre as part of python.dll
References: <> <> <> <> <>
Message-ID: <>

Duncan Booth wrote:

> _sre is used by any application that imports 'os'. That (IMHO) is almost 
> every non-trivial Python program.

Sure? Then try this in a Windows shell:

hey this is sitepython
Python 2.2.1 (#34, Apr  9 2002, 19:34:33) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
 >>> import sys
 >>> for i in sys.modules: print i

As you can see, os is imported by the startup code, already.
(Which I didn't know!)
Furthermore, os didn't cause an import of _sre.

ciao - chris

Christian Tismer             :^)   <>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship*
14109 Berlin                 :     PGP key ->
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?

From  Sat Aug 10 19:47:42 2002
From: (Neil Schemenauer)
Date: Sat, 10 Aug 2002 11:47:42 -0700
Subject: [Python-Dev] interning
In-Reply-To: <>; from on Sat, Aug 10, 2002 at 12:17:17PM -0400
References: <> <> <> <> <>
Message-ID: <>

Guido van Rossum wrote:
> $ ./python ../Lib/test/ -l test_gc
> test_gc
> Warning: test created 20 uncollectable object(s).
> 1 test OK.

Something weird is going on.  This patch fixes test_finalizer_newclass: 

--- Lib/test/ 9 Aug 2002 17:38:16 -0000       1.19
+++ Lib/test/ 10 Aug 2002 18:33:47 -0000
@@ -147,6 +147,8 @@
         raise TestFailed, "didn't find obj in garbage (finalizer)"
+    del A, B, obj
+    gc.collect() # finds 13 objects!

I guess there is a reference cycle there that wasn't there before.
Could it have something to do with tp_del?


From  Sat Aug 10 22:22:36 2002
From: (Tim Peters)
Date: Sat, 10 Aug 2002 17:22:36 -0400
Subject: [Python-Dev] interning
In-Reply-To: <>
Message-ID: <>

> ...
> But none of the changes to C files made in the last month made any
> difference...  The difference is itself!  With a checkout
> from a month ago, if I change the classes in test_finalizer() and
> test_del() to be new-style classes, I get the same warnings.
> Maybe Tim understands the problem now?  (Summary: why do I get the
> Warning below.)
> $ ./python ../Lib/test/ -l test_gc
> test_gc
> Warning: test created 20 uncollectable object(s).
> 1 test OK.
> $

I think this was shallow, and checked in a change (to test_saveall()) that I
believe fixes it.  Update and try again?

From  Sat Aug 10 22:26:56 2002
From: (Martin v. Loewis)
Date: 10 Aug 2002 23:26:56 +0200
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
References: <>
Message-ID: <>

Tim Peters <> writes:

> If you're building via right-clicking, you're making life much
> harder than necessary.  You can build from the command line, or do
> Build -> Batch Build -> Build in the GUI.

Thanks, that is very useful to know.


From  Sun Aug 11 00:58:21 2002
From: (Guido van Rossum)
Date: Sat, 10 Aug 2002 19:58:21 -0400
Subject: [Python-Dev] interning
In-Reply-To: Your message of "Sat, 10 Aug 2002 11:47:42 PDT."
References: <> <> <> <> <>
Message-ID: <>

> > $ ./python ../Lib/test/ -l test_gc
> > test_gc
> > Warning: test created 20 uncollectable object(s).
> > 1 test OK.

[Neil S]
> Something weird is going on.  This patch fixes test_finalizer_newclass: 
> --- Lib/test/ 9 Aug 2002 17:38:16 -0000       1.19
> +++ Lib/test/ 10 Aug 2002 18:33:47 -0000
> @@ -147,6 +147,8 @@
>      else:
>          raise TestFailed, "didn't find obj in garbage (finalizer)"
>      gc.garbage.remove(obj)
> +    del A, B, obj
> +    gc.collect() # finds 13 objects!
> I guess there is a reference cycle there that wasn't there before.
> Could it have something to do with tp_del?

I don't think so -- a Python of a month old had the same warnings when
I added these tests that use new-style class.

It's much simpler than this: new-style classes have cyclical
references to themselves that must be collected.  It so happened that
the saveall test was fooled by these.  Tim checked in a fix that
prevents this.

--Guido van Rossum (home page:

From  Sun Aug 11 03:49:26 2002
From: (Andrew P. Lentvorski)
Date: Sat, 10 Aug 2002 19:49:26 -0700 (PDT)
Subject: [Python-Dev] Python rounding and/or rint
Message-ID: <>

Now that C9X is an official standard, can we either:

1) add rint() back into the math module (removed since 1.6.1?) or
2) update round() so that it complies with the default rounding mode

I bumped into a bug today because round() doesn't obey the same rounding
semantics as the FP operations do.

While there are lots of arguments about whether or not to add other C9X
functions, I'd like to try and avoid that tarpit.

The primary argument against rint was the lack of being able to write
portable code.  In this instance, the *lack* of rint (or its use in
round()) prevents writing portable code as I have no means to match the
rounding semantics of my FP ops from within Python.


From  Sun Aug 11 04:23:51 2002
From: (Tim Peters)
Date: Sat, 10 Aug 2002 23:23:51 -0400
Subject: [Python-Dev] Python rounding and/or rint
In-Reply-To: <>
Message-ID: <>

[Andrew P. Lentvorski]
> Now that C9X is an official standard,

I'm afraid that's irrelevant in practice before "almost all" platform C
packages conform to the new standard.

> can we either:
> 1) add rint() back into the math module (removed since 1.6.1?) or

This sounds counfused.  According to the CVS logs, rint() was briefly in the
codebase, first released in 1.6 beta 1 (rev 2.43 & 2.44 of mathmodule.c),
but retracted before 1.6 final was released (revs and 2.53).

> 2) update round() so that it complies with the default rounding mode

round() has always forced "add a half and chop" (which can be done portably,
relying on C89's rounding-mode-independent floor() and ceil()); changing
that would be incompatible.

> I bumped into a bug today because round() doesn't obey the same
> rounding semantics as the FP operations do.

round() wasn't intended to.

> While there are lots of arguments about whether or not to add other C9X
> functions, I'd like to try and avoid that tarpit.

It's a non-starter before C9X triumphs, if ever.  If it does, there won't be
debate -- we'll gladly expose all the spiffy new numeric functions then.  It
would be great to have them!

> The primary argument against rint was the lack of being able to write
> portable code.  In this instance, the *lack* of rint (or its use in
> round()) prevents writing portable code as I have no means to match the
> rounding semantics of my FP ops from within Python.

Sorry, but neither does Python.  I suggest you write an extension module
with your favorite C9X gimmicks (this isn't hard), and offer it for use on
C9X platforms.  Eventually there may be enough of those that we could fold
the new functions into the core.  Before then, you may find more people in
the NumPy community willing and able to wrestle with reams of platform
#ifdefs for numeric gimmicks.

From  Sun Aug 11 12:21:25 2002
From: (M.-A. Lemburg)
Date: Sun, 11 Aug 2002 13:21:25 +0200
Subject: [Python-Dev] Python rounding and/or rint
References: <>
Message-ID: <>

Tim Peters wrote:
> [Andrew P. Lentvorski]
>>The primary argument against rint was the lack of being able to write
>>portable code.  In this instance, the *lack* of rint (or its use in
>>round()) prevents writing portable code as I have no means to match the
>>rounding semantics of my FP ops from within Python.
> Sorry, but neither does Python.  I suggest you write an extension module
> with your favorite C9X gimmicks (this isn't hard), and offer it for use on
> C9X platforms.  Eventually there may be enough of those that we could fold
> the new functions into the core.  Before then, you may find more people in
> the NumPy community willing and able to wrestle with reams of platform
> #ifdefs for numeric gimmicks.

Another approach would be to use GNU MP's cousin MPFR which has
a few well-defined rounding modes built in apart from various
other goodies which make dealing with floating point numbers
platform independent. Another interesting extension to GNU MP is
MPFI which implements interval arithmetics -- also nice to have
if you're deep into dealing with rounding errors.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Sun Aug 11 13:00:17 2002
From: (Skip Montanaro)
Date: Sun, 11 Aug 2002 07:00:17 -0500
Subject: [Python-Dev] Weekly Python Bug/Patch Summary
Message-ID: <>

Bug/Patch Summary

266 open / 2739 total bugs (-8)
118 open / 1644 total patches (-13)

New Bugs

list(xrange(1e9))  -->  seg fault (2002-05-14)
Sig11 in cPickle (stack overflow) (2002-07-01)
os.tmpfile() can fail on win32 (2002-08-05)
Mixin broken for new-style classes (2002-08-05)
makesetup fails: long Setup.local lines (2002-08-05)
httplib throws a TypeError when the target host disconnects (2002-08-05)
Get rid of etype struct (2002-08-06)
Hint for speeding up cPickle (2002-08-07)
installation errors (2002-08-07)
Webchecker error on (2002-08-07)
comments taken as values in ConfigParser (2002-08-08)
Bug with deepcopy and new style objects (2002-08-08)
HTTPS does not handle pipelined requests (2002-08-08)
os.chmod is underdocumented :-) (2002-08-08)
Can't assign to __name__ or __bases__ of new class (2002-08-09)
u'%c' % large value: broken result (2002-08-10)

New Patches

Fix "file:" URL to have right no. of /'s (2002-08-06)
Split-out ntmodule.c (2002-08-08)
socketmodule.[ch] downgrade (2002-08-09)
bugfixes and cleanup for (2002-08-10)
Static names (2002-08-11)

Closed Bugs

Problems with Tcl/Tk and non-ASCII text entry (2000-10-31)
Tutorial does not describe nested scope (2002-01-07)
Get rid of make frameworkinstall (2002-01-18)
Bgen should generate 7-bit-clean code (2002-06-08)
tarball to untar into a single dir (2002-06-11)
"python -u" not binary on cygwin (2002-06-17)
Chained __slots__ dealloc segfault (2002-06-26)
Tex Macro Error (2002-06-27)
Parts of 2.2.1 core use old gc API (2002-06-30)
os.path.walk behavior on symlinks (2002-07-03)
LibRef 2.2.1, replace zero with False (2002-07-11)
mimetools module privacy leak (2002-07-12)
MacOSX build problems (2002-07-12)
''.split() docstring clarification (2002-07-15)
no doc for os.fsync and os.fdatasync (2002-07-21)
Two corrects for weakref docs (2002-07-25)
references to email package (2002-07-26)
ur'\u' not handled properly (2002-07-26) wrapper needs a class (2002-07-31)
shared libpython & dependant libraries (2002-07-31)
"".split() ignores maxsplit arg (2002-08-01)
preconvert AppleSingle resource files (2002-08-02)

Closed Patches

Removal of SET_LINENO (experimental) (2000-07-30)
let mailbox.Maildir tag messages as read (2001-09-29)
GETCONST/GETNAME/GETNAMEV speedup (2002-01-21)
PEP 263 Implementation (2002-03-07)
PEP 263 Implementation (2002-03-24)
ae* modules: handle type inheritance (2002-04-02)
Deprecate bsddb (2002-05-06)
timeout socket implementation (2002-05-12)
GetFInfo update (2002-06-11)
Add param to email.Utils.decode() (2002-06-12)
PyTRASHCAN slots deallocation (2002-06-28)
Build MachoPython with 2level namespace (2002-07-10)
xreadlines caching, file iterator (2002-07-11)
Alternative PyTRASHCAN subtype_dealloc (2002-07-15)
make file object an iterator (2002-07-17)
yield allowed in try/finally (2002-07-21)
Cygwin _hotshot patch (2002-07-30)
os._execvpe security fix (2002-08-02)

From  Sun Aug 11 14:44:46 2002
From: (Magnus Lie Hetland)
Date: Sun, 11 Aug 2002 15:44:46 +0200
Subject: [Python-Dev] Priority queue (binary heap) python code
In-Reply-To: <20020624213318.A5740@arizona.localdomain>; from on Mon, Jun 24, 2002 at 09:33:18PM -0400
References: <20020624213318.A5740@arizona.localdomain>
Message-ID: <>

Kevin O'Connor <>:
> I often find myself needing priority queues in python, and I've finally
> broken down and written a simple implementation.

I see that heapq is now in the libraries -- great!

Just one thought: If I want to use this library in an algorithm such
as, say, Dijkstra's single-source shortest path algorithms, I would
need an additional operation, the deacrease-key operation (as far as I
can see, the heapreplace only works with index 0 -- wy is that?)


def heapdecrease(heap, index, item):
    Replace an item at a given index with a smaller one.

    May be used to implement the standard priority queue method
    heap-decrease-key, useful, for instance, in several graph
    assert item <= heap[index]
    heap[index] = item
    _siftdown(heap, 0, index)

Something might perhaps be useful to include in the library... Or,
perhaps, the _siftup and _siftdown methods don't have to be private?

In addition, I guess one would have to implement a sequence class that
maintained a map from values to heap indices to be able to use
heapdecrease in any useful way (otherwise, how would you know which
index to use?). That, however, I guess is not something that would be
'at home' in the heapq module. (Perhaps that is argument enough to
avoid including heapdecrease as well? Oh, well...)

Magnus Lie Hetland                                  The Anygui Project                        

From  Sun Aug 11 20:07:43 2002
From: (Tim Peters)
Date: Sun, 11 Aug 2002 15:07:43 -0400
Subject: [Python-Dev] Priority queue (binary heap) python code
In-Reply-To: <>
Message-ID: <>

[Magnus Lie Hetland]
> I see that heapq is now in the libraries -- great!
> Just one thought: If I want to use this library in an algorithm such
> as, say, Dijkstra's single-source shortest path algorithms, I would
> need an additional operation, the deacrease-key operation

You'd need more than just that <wink>.

> (as far as I can see, the heapreplace only works with index 0 -- wy is
> that?)

heapreplace() is emphatically not a decrease-key operation.  It's equivalent
to a pop-min *followed by* a push, which combination is often called a
"hold" operation.  The value added may very well be larger than the value
popped, and, e.g., the example of an efficient N-Best queue in the test file
relies on this.  Hold is an extremely common operation in some kinds of
event simulations too, where it's also most common to push a value larger
than the one popped (e.g., when the queue is ordered by scheduled time, and
the item pushed is a follow-up event to the one getting popped).

> E.g.:
> def heapdecrease(heap, index, item):
>     """
>     Replace an item at a given index with a smaller one.
>     May be used to implement the standard priority queue method
>     heap-decrease-key, useful, for instance, in several graph
>     algorithms.
>     """
>     assert item <= heap[index]

That's the opposite of what heapreplace() usually sees.

>     heap[index] = item
>     _siftdown(heap, 0, index)
> Something might perhaps be useful to include in the library...

I really don't see how -- you generally have no way to know the correct
index short of O(N) search.  This representation of priority queue is
well-known to be sub-optimal for applications requiring frequent
decrease-key (Fibonacci heaps were desgined for it, though, and pairing
heaps are reported to run faster than Fibonacci heaps in practice despite
that one of the PH inventors eventually proved that decrease-key can destroy
PH's otherwise good worst-case behavior).

> Or, perhaps, the _siftup and _siftdown methods don't have to be
> private?

You really have to know what you're doing to use them correctly, and it's
dubious that _siftup calls _siftdown now (it's most convenient *given* all
the uses made of them right now, but, e.g., if a "delete at arbitrary index"
function were to be added, _siftdown and _siftup could stand to be
refactored -- exposing them would inhibit future improvements).

> In addition, I guess one would have to implement a sequence class that
> maintained a map from values to heap indices to be able to use
> heapdecrease in any useful way (otherwise, how would you know which
> index to use?).

Bingo.  All the internal heap manipulations would have to know about this
mapping too, in order to keep the indices up to date as it moved items up
and down in the queue.  If want you want is frequent decrease-key, you don't
want this implementation of priority queues at all.

From  Sun Aug 11 22:33:59 2002
From: (Jack Jansen)
Date: Sun, 11 Aug 2002 23:33:59 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
Message-ID: <>

As of recently I'm getting deprecation warnings on lots of 
constructs of the form "0xff << 24", telling me that in Python 
2.4 this will return a long.

As these things are bitpatterns (they're all generated from .h 
files for system call interfaces and such) that the user will 
pass to methods that wrap underlying API calls I don't want them 
to be longs. How do I force them to remain ints?
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Sun Aug 11 23:19:06 2002
From: (M.-A. Lemburg)
Date: Mon, 12 Aug 2002 00:19:06 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
References: <>
Message-ID: <>

Jack Jansen wrote:
> As of recently I'm getting deprecation warnings on lots of constructs of 
> the form "0xff << 24", telling me that in Python 2.4 this will return a 
> long.

Interesting. I wonder why the implementation warns about 0xff << 24...
0xff000000 fits nicely into a 32-bit integer. I don't see why the
"changing sign" is relevant here or even why it is mentioned in the
warning since the PEP doesn't say anything about it.

Changing these semantics would cause compatibility problems for
applications doing low-level bit manipulations or ones which use
the Python integer type to store unsigned integer values, e.g.
for use as bitmapped flags.

> As these things are bitpatterns (they're all generated from .h files for 
> system call interfaces and such) that the user will pass to methods that 
> wrap underlying API calls I don't want them to be longs. How do I force 
> them to remain ints?

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug 12 00:30:24 2002
From: (Greg Ewing)
Date: Mon, 12 Aug 2002 11:30:24 +1200 (NZST)
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>


> (1) Make binary pickling the default (in cPickle as well as pickle).

That would break a lot of programs that use pickle
without opening the file in binary mode.

> (2) Replace the PyRun_String() call in cPickle with something faster.
>     Maybe the algorithm from parsestr() from compile.c can be
> exposed;

I like that idea -- it could be useful for other things,
too. I could use something like that in Pyrex, for example.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Mon Aug 12 01:18:27 2002
From: (Greg Ewing)
Date: Mon, 12 Aug 2002 12:18:27 +1200 (NZST)
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>


> Maybe we should just drop indirect interning then.  It can save 31
> bits per string object, right?  How to collect those savings?

Store the value of strings <= 3 chars long in there. :-)

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Mon Aug 12 01:25:27 2002
From: (Guido van Rossum)
Date: Sun, 11 Aug 2002 20:25:27 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Mon, 12 Aug 2002 00:19:06 +0200."
References: <>
Message-ID: <>

> Jack Jansen wrote:
> > As of recently I'm getting deprecation warnings on lots of constructs of 
> > the form "0xff << 24", telling me that in Python 2.4 this will return a 
> > long.
> Interesting. I wonder why the implementation warns about 0xff << 24...
> 0xff000000 fits nicely into a 32-bit integer. I don't see why the
> "changing sign" is relevant here or even why it is mentioned in the
> warning since the PEP doesn't say anything about it.

PEP 237 *tries* mention it:

    - Currently, x<<n can lose bits for short ints.  This will be
      changed to return a long int containing all the shifted-out
      bits, if returning a short int would lose bits.

Maybe you don't see changing sign as "losing bits"?  I do!  Maybe I
have to clarify this in the PEP.

PEP 237 is about erasing all differences between int and long.  When
seen as a long, 0xff000000 has the value 4278190080.  But currently it
is an int and has the value -16777216.  As a bit pattern that doesn't
make much of a difference, but as a numeric value it makes a huge
difference (2**32 to be exact :-).  So in Python 2.4, 0xff<<24, as
well as the constant 0xff000000, will have the value 4278190080.

Note that larger constants are already longs in 2.2: e.g. 0x100000000
equals 4294967296 (which happens to be representable only as a long).
It's the oct and hex constants in range(2**31, 2**32) that currently
behave anomalously, returning negative numbers despite looking like
positive numbers (to everyone except people whose minds have been
exposed to 32-bit bit-fiddling too long :-).

> Changing these semantics would cause compatibility problems for
> applications doing low-level bit manipulations or ones which use
> the Python integer type to store unsigned integer values, e.g.
> for use as bitmapped flags.

That's why I'm adding the warnings to 2.3.  Note that the bit pattern
in the lower 32 bits will remain the same; it's just the
interpretation of the sign that will change.

> > As these things are bitpatterns (they're all generated from .h
> > files for system call interfaces and such) that the user will pass
> > to methods that wrap underlying API calls I don't want them to be
> > longs. How do I force them to remain ints?

Why do you want them to remain ints?  Does a long whose lower 32 bits
have the right bit pattern not work?

If you really want the int value, you have to do a little arithmetic.
Here's something that's independent of the Python version and won't
issue a warning:

def toint32(x):
    x = x & 0xffffffffL # Force it to be a long in range(0, 2**32)
    if x & 0x80000000L: # If sign bit set
        x -= 0x100000000L # flip sign
    return int(x)

You can also write it as a single expression:

def toint32(x):
    return int((x & 0xffffffffL) - ((x & 0x80000000L) << 1))

In the long run you'll thank me for this.

--Guido van Rossum (home page:

From  Mon Aug 12 01:58:59 2002
From: (Greg Ewing)
Date: Mon, 12 Aug 2002 12:58:59 +1200 (NZST)
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
Message-ID: <>

> Is there a reason to want to get rid of list.append()?

No, because...

> How does one decide between list.append() and augmented assignment
> (+=)

That's easy -- if I'm only appending one item, I use
append(), because it avoids creating a one-element list
and then throwing it away.


Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Mon Aug 12 02:51:56 2002
From: (Guido van Rossum)
Date: Sun, 11 Aug 2002 21:51:56 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: Your message of "Mon, 12 Aug 2002 11:30:24 +1200."
References: <>
Message-ID: <>

> > (1) Make binary pickling the default (in cPickle as well as pickle).
> That would break a lot of programs that use pickle
> without opening the file in binary mode.

Really?  That's unfortunate.  The example thread on Google shows that
binary pickling isn't as widely known as it should be.

--Guido van Rossum (home page:

From  Mon Aug 12 03:21:51 2002
From: (Tim Peters)
Date: Sun, 11 Aug 2002 22:21:51 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

> ...
> PEP 237 is about erasing all differences between int and long.  When
> seen as a long, 0xff000000 has the value 4278190080.  But currently it
> is an int and has the value -16777216.

Note that while it's an int under all current Python installations, the
*value* differs:  on 64-bit boxes other than Win64, 0xff000000 already has
value 4278190080.  This creates porting problems too.

> As a bit pattern that doesn't make much of a difference, but as a
> numeric value it makes a huge difference (2**32 to be exact :-).  So
> in Python 2.4, 0xff<<24, as well as the constant 0xff000000, will have
> the value 4278190080.

Some users won't even notice <wink>.  They may notice that
0xff00000000000000 "changes value", though.

> ...
> In the long run you'll thank me for this.

I'll start today:  thank you.

From  Mon Aug 12 03:45:59 2002
From: (Greg Ewing)
Date: Mon, 12 Aug 2002 14:45:59 +1200 (NZST)
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

Jack Jansen <>:

> As these things are bitpatterns (they're all generated from .h files
> for system call interfaces and such) that the user will pass to
> methods that wrap underlying API calls I don't want them to be
> longs.

Presumably, by the time these actually become longs, the
relevant Python/C API calls for converting Python ints to
C ints will accept longs that are within range, so it
shouldn't be an issue.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Mon Aug 12 03:50:09 2002
From: (Greg Ewing)
Date: Mon, 12 Aug 2002 14:50:09 +1200 (NZST)
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

"M.-A. Lemburg" <>:

> I wonder why the implementation warns about 0xff << 24...  0xff000000
> fits nicely into a 32-bit integer. I don't see why the "changing sign"
> is relevant here

When the change happens, the result will be a positive number instead
of a negative one. While this isn't relevant for what you're doing, it
might be relevant in some other applications, so I suppose it was
thought prudent to warn about it.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Mon Aug 12 03:58:17 2002
From: (Guido van Rossum)
Date: Sun, 11 Aug 2002 22:58:17 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Mon, 12 Aug 2002 14:45:59 +1200."
References: <>
Message-ID: <>

> Presumably, by the time these actually become longs, the
> relevant Python/C API calls for converting Python ints to
> C ints will accept longs that are within range, so it
> shouldn't be an issue.

PyInt_AsLong() and the 'i' and 'l' format chars for PyArg_Parse*()
already do so -- and have done so for a long time.

--Guido van Rossum (home page:

From  Mon Aug 12 04:05:42 2002
From: (Greg Ewing)
Date: Mon, 12 Aug 2002 15:05:42 +1200 (NZST)
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

> Me:
> > That would break a lot of programs that use pickle
> > without opening the file in binary mode.
> Really?  That's unfortunate.

Unfortunate, yes, and true, as far as I can see. It bit me recently --
I decided to change something to use binary pickling, and forgot to
change the way I was opening the file.

If you must do this, I suppose you could start issuing warnings
if pickling is done without specifying a mode, and then change
the default later.

If there's a way of making non-binary unpickling dramatically
faster, though -- even if only with cPickle -- that would be a 
*big* win, and shouldn't cause any compatibity problems.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Mon Aug 12 04:12:16 2002
From: (Guido van Rossum)
Date: Sun, 11 Aug 2002 23:12:16 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: Your message of "Mon, 12 Aug 2002 15:05:42 +1200."
References: <>
Message-ID: <>

[Greg E]
> > > That would break a lot of programs that use pickle
> > > without opening the file in binary mode.

> > Really?  That's unfortunate.

[Greg E]
> Unfortunate, yes, and true, as far as I can see. It bit me recently --
> I decided to change something to use binary pickling, and forgot to
> change the way I was opening the file.
> If you must do this, I suppose you could start issuing warnings
> if pickling is done without specifying a mode, and then change
> the default later.

I thought of that.  But probably not worth the upheaval.

> If there's a way of making non-binary unpickling dramatically
> faster, though -- even if only with cPickle -- that would be a 
> *big* win, and shouldn't cause any compatibity problems.

python/sf/505705 is close to acceptance, and reduced one particularly
slow unpickling example 6-fold in speed.

--Guido van Rossum (home page:

From  Mon Aug 12 04:16:47 2002
From: (Tim Peters)
Date: Sun, 11 Aug 2002 23:16:47 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

[Greg Ewing]
> That would break a lot of programs that use pickle
> without opening the file in binary mode.

> Really?  That's unfortunate.

> Unfortunate, yes, and true, as far as I can see. It bit me recently --
> I decided to change something to use binary pickling, and forgot to
> change the way I was opening the file.

Greg, do you use Windows?  If not, I suspect you're mis-remembering what you
did -- "binary mode" versus "text mode" doesn't make any difference on

From  Mon Aug 12 04:17:55 2002
From: (Greg Ewing)
Date: Mon, 12 Aug 2002 15:17:55 +1200 (NZST)
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

> Greg, do you use Windows?  If not, I suspect you're mis-remembering
> what you did -- "binary mode" versus "text mode" doesn't make any
> difference on Linux.

No, but I use a Mac (with Classic OS) where it certainly
does make a difference!

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Mon Aug 12 04:24:21 2002
From: (Tim Peters)
Date: Sun, 11 Aug 2002 23:24:21 -0400
Subject: [Python-Dev] The memo of pickle
In-Reply-To: <>
Message-ID: <>

[Greg Ewing, on text mode vs binary mode]
> No, but I use a Mac (with Classic OS) where it certainly
> does make a difference!

Cool!  I was just wondering the other day whether there are any Mac users
left apart from Jack and Guido's brother.  It's a landslide <wink>.

From  Sun Aug 11 00:28:37 2002
From: (Andrew MacIntyre)
Date: Sun, 11 Aug 2002 10:28:37 +1100 (edt)
Subject: [Python-Dev] Patch 592529: Split-out ntmodule.c
In-Reply-To: <>
Message-ID: <>

On 9 Aug 2002, Martin v. [iso-8859-15] L=F6wis wrote:

> If you are familiar with the code, it would be good if you could
> comment on the following questions:
> - should os2module.c get its own source code file as well?
> - are the #ifdefs in the resulting ntmodule.c still needed?
>   I believe they are, as the various compilers appear to support
>   different sets of functions in their C libraries. Of course,
>   most of these could be eliminated if the C is avoided in favour
>   of the Win32 API. Alternatively, can anybody with access to any
>   of these compilers (BorlandC, Watcom, IBM) please comment on
>   which functions provided by MSVC are missing in those compilers?

I don't have a problem with the OS/2 stuff being split out as well.
However I think there is some merit to Tim's point (added as a comment to
the patch) about trying to contain the natural divergence of API support
if the code is completely split out.

I haven't looked at the your patch (just the SF patch manager
entry, sorry), but if a split is pursued, I would would find an approach
similar to the thread_* and dynload_* bits (in Python/) somewhat more in
keeping with the above reservation.

This approach would have a master module file (eg platformmodule.c) which
contains the init function and the PyMethodDef array (with methods
controlled by HAVE_method #ifdefs as appropriate), and includes
the platform specific implementation files.

I have had thoughts about doing this before, but the scale of the task and
the fact that I don't have a Windows dev box for testing put me off.

Andrew I MacIntyre                     "These thoughts are mine alone..."
E-mail:  | Snail: PO Box 370            |        Belconnen  ACT  2616
Web:        |        Australia

From  Mon Aug 12 06:41:04 2002
From: (Martin v. Loewis)
Date: 12 Aug 2002 07:41:04 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
References: <>
Message-ID: <>

"M.-A. Lemburg" <> writes:

> Changing these semantics would cause compatibility problems for
> applications doing low-level bit manipulations or ones which use
> the Python integer type to store unsigned integer values, e.g.
> for use as bitmapped flags.

In case this isn't clear yet: There likely will be no compatibility
problems. The bit manipulations will likely see the same results.

Of course, it would be good if users bring forward examples of how the
change affects their code. I.e. whenever you see such a warning,
please study the code and report whether the upcoming change will or
will not break your code. Such reports would allow to improve the


From  Mon Aug 12 07:37:32 2002
From: (Oren Tirosh)
Date: Mon, 12 Aug 2002 02:37:32 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Mon, Aug 12, 2002 at 12:19:06AM +0200, M.-A. Lemburg wrote:
> Changing these semantics would cause compatibility problems for
> applications doing low-level bit manipulations or ones which use
> the Python integer type to store unsigned integer values, e.g.
> for use as bitmapped flags.

I'm very much in favor of this change but a deprecation warning is not 
enough - some suitable replacement should be provided to cryptographers 
and other bit fiddlers.


A standard module implementing the types [u]int[8|16|32|64]. These types
would behave just like C integers - wrap around on overflow, etc and have 
a guaranteed size regardless of platform. They can even have methods for
bit rotation.


From  Mon Aug 12 09:09:35 2002
From: (M.-A. Lemburg)
Date: Mon, 12 Aug 2002 10:09:35 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
References: <>              <> <>
Message-ID: <>

Guido van Rossum wrote:
>>Jack Jansen wrote:
>>>As of recently I'm getting deprecation warnings on lots of constructs of 
>>>the form "0xff << 24", telling me that in Python 2.4 this will return a 
>>Interesting. I wonder why the implementation warns about 0xff << 24...
>>0xff000000 fits nicely into a 32-bit integer. I don't see why the
>>"changing sign" is relevant here or even why it is mentioned in the
>>warning since the PEP doesn't say anything about it.
> PEP 237 *tries* mention it:
>     - Currently, x<<n can lose bits for short ints.  This will be
>       changed to return a long int containing all the shifted-out
>       bits, if returning a short int would lose bits.
> Maybe you don't see changing sign as "losing bits"?  I do!  Maybe I
> have to clarify this in the PEP.

I was talking about the sign bit which lies within the 32 bits for
32-bit integers, so no bits are lost. I am not talking about things
like 0xff << 28 where bits are actually moved beyond the 32 bits to the
left and lost that way (or preserved if you move to a 64-bit platform,
but that's another story ;-).

> PEP 237 is about erasing all differences between int and long.  When
> seen as a long, 0xff000000 has the value 4278190080.  But currently it
> is an int and has the value -16777216.  As a bit pattern that doesn't
> make much of a difference, but as a numeric value it makes a huge
> difference (2**32 to be exact :-).  So in Python 2.4, 0xff<<24, as
> well as the constant 0xff000000, will have the value 4278190080.
> Note that larger constants are already longs in 2.2: e.g. 0x100000000
> equals 4294967296 (which happens to be representable only as a long).
> It's the oct and hex constants in range(2**31, 2**32) that currently
> behave anomalously, returning negative numbers despite looking like
> positive numbers (to everyone except people whose minds have been
> exposed to 32-bit bit-fiddling too long :-).
>>Changing these semantics would cause compatibility problems for
>>applications doing low-level bit manipulations or ones which use
>>the Python integer type to store unsigned integer values, e.g.
>>for use as bitmapped flags.
> That's why I'm adding the warnings to 2.3.  Note that the bit pattern
> in the lower 32 bits will remain the same; it's just the
> interpretation of the sign that will change.

That's exactly what I'd like too :-) With the only difference
that you seem to see the sign bit as not included in the 32 bits.

>>>As these things are bitpatterns (they're all generated from .h
>>>files for system call interfaces and such) that the user will pass
>>>to methods that wrap underlying API calls I don't want them to be
>>>longs. How do I force them to remain ints?
> Why do you want them to remain ints?  Does a long whose lower 32 bits
> have the right bit pattern not work?

No, because you usually pass these objects directly to some
Python C function (directly as parameter or indirectly as item
in a list or tuple) which often enough insists on getting a true
integer object.

> If you really want the int value, you have to do a little arithmetic.
> Here's something that's independent of the Python version and won't
> issue a warning:
> def toint32(x):
>     x = x & 0xffffffffL # Force it to be a long in range(0, 2**32)
>     if x & 0x80000000L: # If sign bit set
>         x -= 0x100000000L # flip sign
>     return int(x)
> You can also write it as a single expression:
> def toint32(x):
>     return int((x & 0xffffffffL) - ((x & 0x80000000L) << 1))
> In the long run you'll thank me for this.

No argument about this. It's just that I see a lot of programs
breaking because of the 0x1 << 31 returning a long. That needen't
be the case. People using this will know what they are doing and
use a long when possible anyway. However, tweaking C extensions to
also accept longs instead of integers requires hacking those
extensions which I'd like to avoid if possible. I already had
one of these instances with file.tell() returning a long and
that caused a lot of trouble then.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug 12 11:02:11 2002
From: (Duncan Booth)
Date: Mon, 12 Aug 2002 11:02:11 +0100
Subject: [Python-Dev] _sre as part of python.dll
In-Reply-To: <>
Message-ID: <3D5795B3.6178.22DC5F47@localhost>

On 10 Aug 2002 at 20:38, Christian Tismer wrote:

> Duncan Booth wrote:
> ...
> > _sre is used by any application that imports 'os'. That (IMHO) is almost 
> > every non-trivial Python program.
> Sure? Then try this in a Windows shell:

Sjoerd Mullender already pointed out I got this wrong. Unfortunately, for reasons 
that currently escape me, my response disappeared into a black hole and didn't 
appear on the mailing list.

I jumped to the wrong conclusion because running py2exe on a program that 
imports os always includes _sre.dll in the files for distribution. This is because the 
os module does indeed import _sre, but only when the function that uses it is 
actually called. So any program that imports os includes _sre in the automatically 
generated list of denpendencies, but it may or may not actually import it.
Duncan Booth                                   
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?

From  Mon Aug 12 11:47:33 2002
From: (=?ISO-8859-15?Q?Walter_D=F6rwald?=)
Date: Mon, 12 Aug 2002 12:47:33 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <> <> <>              <> <>
Message-ID: <>

I'm back from vacation. Comments on the thread and a list
of open issues are below.

Guido van Rossum wrote:
 > M.-A. Lemburg wrote:
 > > Walter has written a pretty good test suite for the patch
 > > and I have a good feeling about it. I'd like Walter to check
 > > it into CVS and then see whether the alpha tests bring up any
 > > quirks. The patch only touches the codecs and adds some new
 > > exceptions. There are no other changes involved.
 > >
 > > I think that together with PEP 263 (source code encoding) this
 > > is a great step forward in Python's i18n capabilities.
 > >
 > > BTW, the test script contains some examples of how to put the
 > > error callbacks to use:
 > >
 > >
 > Sounds like a plan then.

Does this mean we can check in the patch?

Documentation is still missing and encoding specific
decoding tests should be added to the test script.

Has anybody except me and Marc-André tried the patch?
On anything other than Linux/Intel? With UCS2 and UCS4?

Martin v. Loewis wrote:
 > If you look at the large blocks of new code, you find that it is in
 > - charmap_encoding_error, which insists on implementing known error
 >   handling algorithms inline,

This is done for performance reasons.

 > - the default error handlers, of which atleast
 >   PyCodec_XMLCharRefReplaceErrors should be pure-Python

The PyCodec_XMLCharRefReplaceErrors functionality is
independent of the rest, so moving this to Python
won't reduce complexity that much. And it will
slow down "xmlcharrefreplace" handling for those
codecs that don't implement it inline.

 > - PyCodec_BackslashReplaceErrors, likewise,
 > - the UnicodeError exception methods (which could be omitted, IMO).

Those methods were implemented so that we can easily
move to new style exceptions. The exception attributes
can then be members of the C struct and the accessor functions
can be simple macros.

I guess some of the methods could be removed by moving
duplicate ones to the base class UnicodeError, but
this would break backwards compatibility.

Oren Tirosh wrote:
 > Some of my reservations about PEP 293:
 > It overloads the meaning of the error handling argument in an unintuitive
 > way.  It gets to the point where it's much more than just error 
handling -
 > it's actually extending the functionality of the codec.
 > Why implement yet another name-based registry?  There must be a 
simpler way
 > to do it.

The registry is name-based because this is required by the current C API.
Passing the error handler directly as a function object would be
simpler, but this can't be done, as it would require vast changes
to the C API (an old version of the patch did that.) And this way
we gain the benefit of implementing well-known error hanlding
names inline.

It is "yet another" registry exactly because encoding and error handling
are completely orthogonal (at least for encoding). If you add a
new error handler all codecs can use it (as long as they are aware
of the new error handling way) and if you define a new codec it will
work with all existing error handlers.

 > Generating an exception for each character that isn't handled by simple
 > lookup probably adds quite a lot of overhead.

1. All encoders try to collect runs of unencodable characters to
minimize the number of calls to the callback.

2. The PEP explicitely states that the codec is allowed to
reuse the exception object. All codecs do this, so the
exception object will only be created once (at most;
when no error occurs, no exception object will be created)
The exception object is just a quick way to pass information
between the codec and the error handler and it could become
even faster as soon as we get new style exceptions.

 > What are the use cases?  Maybe a simple extension to charmap would be 
 > for all the practical cases?

Not all codecs are charmap based.

Open issues:

1. For each error handler two Python function objects are created:
One in the registry and a different one in the codecs module. This
means that e.g.
codecs.lookup_error("replace") != codecs.replace_errors

We can fix that by making the name ob the Python function object
globally visible or by changing the codecs init function to do a lookup 
and use the result or simply by removing codecs.replace_errors

2. Currently charmap encoding uses a safe way for reallocation
string storage, which tests available space on each output. This
slows charmap encoding down a bit. This should probably be changed
back to the old way: Test available space only for output strings
longer than one character.

3. Error reporting logic in the exception attribute setters/getters
may be non-standard. What is the standard way to report errors for
C functions that don't return object pointers?
==0 for error and !=0 for success
==0 for success and !=0 for error
PyArg_ParseTuple returns true an success, PyObject_SetAttr returns true
on failure, which one is the exception and which one the rule?

4. Assigning to an attribute of an exception object does not
change the appropriate entry in the args attribute. Is this
worth changing?

5. UTF-7 decoding does not yet take full advantage of the machinery:
When an unterminated shift sequence is encountered (e.g. "+xxx")
the faulty byte sequence has already been emitted.

    Walter Dörwald

From  Mon Aug 12 12:48:06 2002
From: (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: Mon, 12 Aug 2002 13:48:06 +0200 (CEST)
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
Message-ID: <>

The PEP describes a Windows-only change to Unicode in file names: On
Windows NT/2k/XP, Python would allow arbitrary Unicode strings as file
names and pass them to the OS, instead of converting them to CP_ACP
first. This applies to open() and all os functions that accept

In addition, os.list() would return Unicode filenames if the argument
is Unicode.

Please comment on the PEP. There is an updated patch on; please comment on the patch as well.


From  Mon Aug 12 13:29:01 2002
From: (Skip Montanaro)
Date: Mon, 12 Aug 2002 07:29:01 -0500
Subject: [Python-Dev] timsort for jython
In-Reply-To: <>
References: <>
Message-ID: <15703.43533.581619.884543@localhost.localdomain>

    Patrick> Your flip comment has got me thinking about the "one best
    Patrick> idiom" for list appending. So I'll ask the question. Is there a
    Patrick> reason to want to get rid of list.append()? 

Certainly not for performance.  append is substantially faster that +=, at
least in part because of the list creation, especially if you cache the
method lookup. 


import time

def timefunc(s, args, *sargs, **kwds):
    t = time.time()
    apply(s, args+sargs, kwds)
    return time.time()-t

def appendit(l, o, n):
    append = l.append
    for i in xrange(n):

def extendit(l, o, n):
    extend = l.extend
    for i in xrange(n):

def augassignit(l, o, n):
    for i in xrange(n):
        l += [o]

print "append small int:",
x = 0.0
for i in 1,2,3:
    x += timefunc(appendit, ([], 1, 100000))
print "%.3f" % (x/3)
print "aug assign small int:",
x = 0.0
for i in 1,2,3:
    x += timefunc(augassignit, ([], 1, 100000))
print "%.3f" % (x/3)
print "extend small int:",
x = 0.0
for i in 1,2,3:
    x += timefunc(extendit, ([], 1, 100000))
print "%.3f" % (x/3)

From  Mon Aug 12 14:10:38 2002
From: (M.-A. Lemburg)
Date: Mon, 12 Aug 2002 15:10:38 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <> <> <>              <> <> <>
Message-ID: <>

Walter D=F6rwald wrote:
> I'm back from vacation. Comments on the thread and a list
> of open issues are below.

I'm going on vacation for two weeks, so you'll have to take
it along from here.

Have fun,
Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug 12 14:18:52 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 09:18:52 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Mon, 12 Aug 2002 02:37:32 EDT."
References: <> <>
Message-ID: <>

> I'm very much in favor of this change but a deprecation warning is not 
> enough - some suitable replacement should be provided to cryptographers 
> and other bit fiddlers.

You can do all the bit fiddling you want using longs already.  If you
want the result truncated to n bits, simply apply a mask after each
operation, e.g. (for 32-bit results) x = (x << 14) & 0xffffffff.

> Proposal:
> A standard module implementing the types [u]int[8|16|32|64]. These types
> would behave just like C integers - wrap around on overflow, etc and have 
> a guaranteed size regardless of platform. They can even have methods for
> bit rotation.

If you propose this as a Python module, I'm +/- 0; I don't have the
need, and I feel you can do all of this already, but I can see that
there may be one or two things that beginners at bit-fiddling might
find useful (like how to do sign extension or sign folding without an
if statement).

If you were proposing a C module, an emphatic YAGNI accompanies a -1.

--Guido van Rossum (home page:

From  Mon Aug 12 14:23:58 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 09:23:58 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Mon, 12 Aug 2002 10:09:35 +0200."
References: <> <> <>
Message-ID: <>

> > That's why I'm adding the warnings to 2.3.  Note that the bit pattern
> > in the lower 32 bits will remain the same; it's just the
> > interpretation of the sign that will change.
> That's exactly what I'd like too :-) With the only difference
> that you seem to see the sign bit as not included in the 32 bits.

I was using sloppy language by lumping "sign change" under "lost bits".
What I really meant was "returning a value that's different from what
the same operation on a long would return".  I've added something
about sign changes to the PEP.

> > Why do you want them to remain ints?  Does a long whose lower 32 bits
> > have the right bit pattern not work?
> No, because you usually pass these objects directly to some
> Python C function (directly as parameter or indirectly as item
> in a list or tuple) which often enough insists on getting a true
> integer object.

There's no excuse for that any more.  The 'i' and 'l' format chars of
PyArg_Parse* and PyInt_AsLong() both work for longs as well as for

> No argument about this. It's just that I see a lot of programs
> breaking because of the 0x1 << 31 returning a long.

I think you're overly pessimistic.  But that's why I'm putting the
warning in for 2.3 -- the semantics are the same as for 2.2, they
won't change until 2.4 (or later if this turns out to be a bigger

> That needen't
> be the case. People using this will know what they are doing and
> use a long when possible anyway. However, tweaking C extensions to
> also accept longs instead of integers requires hacking those
> extensions which I'd like to avoid if possible. I already had
> one of these instances with file.tell() returning a long and
> that caused a lot of trouble then.

Sorry, no go.  There's no way I can defend returning a different value
for x<<y depending on the type of x.

--Guido van Rossum (home page:

From  Mon Aug 12 14:38:02 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 09:38:02 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: Your message of "Mon, 12 Aug 2002 13:48:06 +0200."
References: <>
Message-ID: <>

> The PEP describes a Windows-only change to Unicode in file names: On
> Windows NT/2k/XP, Python would allow arbitrary Unicode strings as file
> names and pass them to the OS, instead of converting them to CP_ACP
> first. This applies to open() and all os functions that accept
> filenames.
> In addition, os.list() would return Unicode filenames if the argument
> is Unicode.
> Please comment on the PEP. There is an updated patch on
>; please comment on the patch as well.

I've added some comments to the patch.

I'm +0 on the PEP; I'd like to defer to people who actually use
Windows like Mark Hammond and Tim Peters.

--Guido van Rossum (home page:

From  Mon Aug 12 14:53:12 2002
From: (Barry A. Warsaw)
Date: Mon, 12 Aug 2002 09:53:12 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
References: <>
Message-ID: <>

>>>>> "GvR" == Guido van Rossum <> writes:

    GvR> If you propose this as a Python module, I'm +/- 0; I don't
    GvR> have the need, and I feel you can do all of this already, but
    GvR> I can see that there may be one or two things that beginners
    GvR> at bit-fiddling might find useful (like how to do sign
    GvR> extension or sign folding without an if statement).

A HOWTO might also suffice.


From  Mon Aug 12 16:21:00 2002
From: (Jack Jansen)
Date: Mon, 12 Aug 2002 17:21:00 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

On Monday, August 12, 2002, at 08:37 , Oren Tirosh wrote:
> Proposal:
> A standard module implementing the types [u]int[8|16|32|64]. These types
> would behave just like C integers - wrap around on overflow, etc and 
> have
> a guaranteed size regardless of platform. They can even have methods for
> bit rotation.

This, plus some syntactic sugar so I could easily specify values of 
these types in my source code, would do the trick.
Preferrably in such a way that I can use the C code verbatim: at the 
moment I don't have to understand what the C code does, whether it uses 
other constants, strings, expressions, etc.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- Emma 
Goldman -

From  Mon Aug 12 16:29:14 2002
From: (Martin v. Loewis)
Date: 12 Aug 2002 17:29:14 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> There's no excuse for that any more.  The 'i' and 'l' format chars of
> PyArg_Parse* and PyInt_AsLong() both work for longs as well as for
> ints.

There is a change, of course: Passing 0xff<<24 to a function that uses
the "i" converter will produce an OverflowError, whereas it previously
would pass in the negative numbers.

For cases of "I want 32 bits in an int", you'll have to accept both
signed and unsigned 32 bits - something that is currently not
supported in ParseTuple.


From  Mon Aug 12 16:37:03 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 11:37:03 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Mon, 12 Aug 2002 17:29:14 +0200."
References: <> <> <> <> <>
Message-ID: <>

> Guido van Rossum <> writes:
> > There's no excuse for that any more.  The 'i' and 'l' format chars of
> > PyArg_Parse* and PyInt_AsLong() both work for longs as well as for
> > ints.

> There is a change, of course: Passing 0xff<<24 to a function that uses
> the "i" converter will produce an OverflowError, whereas it previously
> would pass in the negative numbers.

And unfortunately the same will happen for the "l" converter
(PyInt_AsLong(<long>) does a signed range check.

> For cases of "I want 32 bits in an int", you'll have to accept both
> signed and unsigned 32 bits - something that is currently not
> supported in ParseTuple.

Oops.  Darn.  You're right.  Sigh.  That's painful.  We have to add a
new format code (or more) to accept signed 32-bit ints but also longs
in range(32).  This should be added in 2.3 so extensions can start
using it now, and user code can start passing longs in range(2**32)
now.  I propose 'k' (for masK).  We should backport this to 2.2.2 as
well.  Plus a variant on PyInt_AsLong() with the same semantics, maybe
named PyInt_AsMask().

Any takers?

--Guido van Rossum (home page:

From  Mon Aug 12 16:41:46 2002
From: (Martin v. Loewis)
Date: 12 Aug 2002 17:41:46 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
References: <> <>
Message-ID: <>

Walter D=F6rwald <> writes:

>  > - charmap_encoding_error, which insists on implementing known error
>  >   handling algorithms inline,
> This is done for performance reasons.

Is that really worth it? Such errors are rare, and when they occur,
they usually cause an exception as the result of the "strict" error

I'd strongly encourage you to avoid duplication of code, and use
Python whereever possible.

> The PyCodec_XMLCharRefReplaceErrors functionality is
> independent of the rest, so moving this to Python
> won't reduce complexity that much. And it will
> slow down "xmlcharrefreplace" handling for those
> codecs that don't implement it inline.

Sure it will. But how much does that matter in the overall context of
generating HTML/XML?

>  > - the UnicodeError exception methods (which could be omitted, IMO).
> Those methods were implemented so that we can easily
> move to new style exceptions.=20

What are new-style exceptions?

> The exception attributes can then be members of the C struct and the
> accessor functions can be simple macros.

Again, I sense premature optimization.

> 1. For each error handler two Python function objects are created:
> One in the registry and a different one in the codecs module. This
> means that e.g.
> codecs.lookup_error("replace") !=3D codecs.replace_errors

Why would this be a problem?=20

> We can fix that by making the name ob the Python function object
> globally visible or by changing the codecs init function to do a
> lookup and use the result or simply by removing codecs.replace_errors

I recommend to fix this by implementing the registry in Python.

> 4. Assigning to an attribute of an exception object does not
> change the appropriate entry in the args attribute. Is this
> worth changing?

No. Exception objects should be treated as immutable (even if they
aren't). If somebody complains, we can fix it; until then, it suffices
if this is documented.

> 5. UTF-7 decoding does not yet take full advantage of the machinery:
> When an unterminated shift sequence is encountered (e.g. "+xxx")
> the faulty byte sequence has already been emitted.

It would be ok if it works as good as it did in 2.2. UTF-7 is rarely
used; if it is used, it is machine-generated, so there shouldn't be
any errors.


From  Mon Aug 12 16:49:06 2002
From: (Tim Peters)
Date: Mon, 12 Aug 2002 11:49:06 -0400
Subject: [Python-Dev] 32-bit values (was RE: [Python-checkins] python/dist/src/Lib/test,1.18,1.19)
In-Reply-To: <>
Message-ID: <>

> Modified Files:
> Log Message:
> Portable way of producing unsigned 32-bit hex output to print the
> CRCs.
> Index:
> ===================================================================
> RCS file: /cvsroot/python/python/dist/src/Lib/test/,v
> retrieving revision 1.18
> retrieving revision 1.19
> diff -C2 -d -r1.18 -r1.19
> ***	23 Jul 2002 19:04:09 -0000	1.18
> ---	12 Aug 2002 15:26:05 -0000	1.19
> ***************
> *** 13,18 ****
>   # test the checksums (hex so the test doesn't break on 64-bit machines)
> ! print hex(zlib.crc32('penguin')), hex(zlib.crc32('penguin', 1))
> ! print hex(zlib.adler32('penguin')), hex(zlib.adler32('penguin', 1))
>   # make sure we generate some expected errors
> --- 13,20 ----
>   # test the checksums (hex so the test doesn't break on 64-bit machines)
> ! def fix(x):
> !     return "0x%x" % (x & 0xffffffffL)
> ! print fix(zlib.crc32('penguin')), fix(zlib.crc32('penguin', 1))
> ! print fix(zlib.adler32('penguin')), fix(zlib.adler32('penguin', 1))
>   # make sure we generate some expected errors

This raises a question:  what should crc32 and adler32 return?  They return
32-bit values, and that's part of external definitions so we can't change
that, but how we *view* "the sign bit" is up to us.  binascii.crc32()
always-- even on 64-bit boxes --returns a value in range(-2**31, 2**31).  I
know that because I forced it to not long ago.  I don't know what the other
guys return (zlib.crc32(), zlib.adler32(), ...?).

It would sure be nice if they returned values in range(0, 2**32) instead.  A
difficulty with changing this stuff is that checksums seem frequently to be
read and written via the struct module, with format code "l", and e.g.

>>> struct.pack("!l", 1L << 31)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
OverflowError: long int too large to convert to int

From  Mon Aug 12 16:50:57 2002
From: (M.-A. Lemburg)
Date: Mon, 12 Aug 2002 17:50:57 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
References: <> <> <> <> <>              <> <>
Message-ID: <>

Guido van Rossum wrote:
>>Guido van Rossum <> writes:
>>>There's no excuse for that any more.  The 'i' and 'l' format chars of
>>>PyArg_Parse* and PyInt_AsLong() both work for longs as well as for


> Martin:
>>There is a change, of course: Passing 0xff<<24 to a function that uses
>>the "i" converter will produce an OverflowError, whereas it previously
>>would pass in the negative numbers.
> And unfortunately the same will happen for the "l" converter
> (PyInt_AsLong(<long>) does a signed range check.
>>For cases of "I want 32 bits in an int", you'll have to accept both
>>signed and unsigned 32 bits - something that is currently not
>>supported in ParseTuple.
> Oops.  Darn.  You're right.  Sigh.  That's painful.  We have to add a
> new format code (or more) to accept signed 32-bit ints but also longs
> in range(32). 

Rather than inventing something new to be compatible to the existing
old status quo, I'd rather like to see new format codes for unsigned
integers and/or longs and have the existing ones support the new
status quo.

 > This should be added in 2.3 so extensions can start
> using it now, and user code can start passing longs in range(2**32)
> now.  I propose 'k' (for masK).  We should backport this to 2.2.2 as
> well.  Plus a variant on PyInt_AsLong() with the same semantics, maybe
> named PyInt_AsMask().
> Any takers?

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug 12 16:54:09 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 11:54:09 -0400
Subject: [Python-Dev] 32-bit values (was RE: [Python-checkins] python/dist/src/Lib/test,1.18,1.19)
In-Reply-To: Your message of "Mon, 12 Aug 2002 11:49:06 EDT."
References: <>
Message-ID: <>

> This raises a question:  what should crc32 and adler32 return?  They return
> 32-bit values, and that's part of external definitions so we can't change
> that, but how we *view* "the sign bit" is up to us.  binascii.crc32()
> always-- even on 64-bit boxes --returns a value in range(-2**31, 2**31).  I
> know that because I forced it to not long ago.  I don't know what the other
> guys return (zlib.crc32(), zlib.adler32(), ...?).
> It would sure be nice if they returned values in range(0, 2**32) instead.  A
> difficulty with changing this stuff is that checksums seem frequently to be
> read and written via the struct module, with format code "l", and e.g.
> >>> struct.pack("!l", 1L << 31)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> OverflowError: long int too large to convert to int
> >>>

Such programs will have to be changed to use format code "L" instead.

Or perhaps "l" should be allowed to accept longs in
range(-2**31, 2**32) ?

--Guido van Rossum (home page:

From  Mon Aug 12 16:57:09 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 11:57:09 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Mon, 12 Aug 2002 17:50:57 +0200."
References: <> <> <> <> <> <> <>
Message-ID: <>

> > Oops.  Darn.  You're right.  Sigh.  That's painful.  We have to add a
> > new format code (or more) to accept signed 32-bit ints but also longs
> > in range(32). 
> Rather than inventing something new to be compatible to the existing
> old status quo, I'd rather like to see new format codes for unsigned
> integers and/or longs and have the existing ones support the new
> status quo.

That's okay too.  The function could be PyInt_AsUnsignedLong().  It
could convert negative 32-bit ints to unsigned as a backward
compatibility measure (with warning?) that will eventually disappear.

The format code could be 'I' for unsigned int, but I don't know what
to use for unsigned long.  Or perhaps still use 'k'/'K' for masK?

--Guido van Rossum (home page:

From  Mon Aug 12 17:03:14 2002
From: (M.-A. Lemburg)
Date: Mon, 12 Aug 2002 18:03:14 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <> <>	<>	<>	<>	<> <>
Message-ID: <>

Martin v. Loewis wrote:
> Walter D=F6rwald <> writes:
>> > - charmap_encoding_error, which insists on implementing known error
>> >   handling algorithms inline,
>>This is done for performance reasons.
> Is that really worth it? Such errors are rare, and when they occur,
> they usually cause an exception as the result of the "strict" error
> handling.
> I'd strongly encourage you to avoid duplication of code, and use
> Python whereever possible.

See below: this is not always possible; much for the same reason
that exceptions are implemented in C as well.

>>The PyCodec_XMLCharRefReplaceErrors functionality is
>>independent of the rest, so moving this to Python
>>won't reduce complexity that much. And it will
>>slow down "xmlcharrefreplace" handling for those
>>codecs that don't implement it inline.
> Sure it will. But how much does that matter in the overall context of
> generating HTML/XML?
>> > - the UnicodeError exception methods (which could be omitted, IMO).
>>Those methods were implemented so that we can easily
>>move to new style exceptions.=20
> What are new-style exceptions?=20

Exceptions that are built as subclassable types.

>>The exception attributes can then be members of the C struct and the
>>accessor functions can be simple macros.
> Again, I sense premature optimization.

There's nothing premature here. By moving exception handling to
C level, you get *much* better performance than at Python level.
Remember that applications like e.g. escaping chars in an XML
document can cause lots of these exceptions to be generated.

>>1. For each error handler two Python function objects are created:
>>One in the registry and a different one in the codecs module. This
>>means that e.g.
>>codecs.lookup_error("replace") !=3D codecs.replace_errors
> Why would this be a problem?=20
>>We can fix that by making the name ob the Python function object
>>globally visible or by changing the codecs init function to do a
>>lookup and use the result or simply by removing codecs.replace_errors
> I recommend to fix this by implementing the registry in Python.

This doesn't work as I've already explained before. The predefined
error handling modes of builtin codecs must work with relying on
the Python import mechanism.

>>4. Assigning to an attribute of an exception object does not
>>change the appropriate entry in the args attribute. Is this
>>worth changing?
> No. Exception objects should be treated as immutable (even if they
> aren't). If somebody complains, we can fix it; until then, it suffices
> if this is documented.

What ? That exceptions are immutable ? I think it's a big win that
exceptions are in fact mutable -- they are great for transporting
extra information up the chain...

except Exception, obj:
     obj.been_there =3D 1

>>5. UTF-7 decoding does not yet take full advantage of the machinery:
>>When an unterminated shift sequence is encountered (e.g. "+xxx")
>>the faulty byte sequence has already been emitted.
> It would be ok if it works as good as it did in 2.2. UTF-7 is rarely
> used; if it is used, it is machine-generated, so there shouldn't be
> any errors.


Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug 12 17:10:19 2002
From: (Tim Peters)
Date: Mon, 12 Aug 2002 12:10:19 -0400
Subject: [Python-Dev] 32-bit values (was RE: [Python-checkins]
In-Reply-To: <>
Message-ID: <>

> This raises a question:  what should crc32 and adler32 return?
> ...
> binascii.crc32() always-- even on 64-bit boxes --returns a value in
> range(-2**31, 2**31).
> ...
> I don't know what the other guys return (zlib.crc32(),
> zlib.adler32(), ...?).
> It would sure be nice if they returned values in range(0,
> 2**32) instead.  A difficulty with changing this stuff is that
> checksums seem frequently to be read and written via the struct
> module, with format code "l", and e.g.
> >>> struct.pack("!l", 1L << 31)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> OverflowError: long int too large to convert to int

> Such programs will have to be changed to use format code "L" instead.

I'm not following this.  At least binascii.crc32() always produces a 32-bit
signed int now, so there's no *need* to use "L" now.  Are you saying that
binascii.crc32() should be changed to return a non-negative value always?
Also the other xyz.abc32() functions?

> Or perhaps "l" should be allowed to accept longs in
> range(-2**31, 2**32) ?

Well, unpacking a packed value wouldn't always return the value you started
with then (pack 2**31 via "l", then unpack it via "l" and you get
back -2**31), so it's not very attractive.  If you dump a checksum via pack,
then unpack it later, you really want to get back the same value, not just
"the same bits after some fiddling".

From  Mon Aug 12 17:24:47 2002
From: (Martin v. Loewis)
Date: 12 Aug 2002 18:24:47 +0200
Subject: [Python-Dev] 32-bit values (was RE: [Python-checkins] python/dist/src/Lib/test,1.18,1.19)
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> Or perhaps "l" should be allowed to accept longs in range(-2**31,
> 2**32) ?

For the struct and array modules, that sounds reasonable.


From  Mon Aug 12 17:26:37 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 12:26:37 -0400
Subject: [Python-Dev] 32-bit values (was RE: [Python-checkins] python/dist/src/Lib/test,1.18,1.19)
In-Reply-To: Your message of "Mon, 12 Aug 2002 12:10:19 EDT."
References: <>
Message-ID: <>

> [Tim]
> > This raises a question:  what should crc32 and adler32 return?
> > ...
> > binascii.crc32() always-- even on 64-bit boxes --returns a value in
> > range(-2**31, 2**31).
> > ...
> > I don't know what the other guys return (zlib.crc32(),
> > zlib.adler32(), ...?).
> >
> > It would sure be nice if they returned values in range(0,
> > 2**32) instead.  A difficulty with changing this stuff is that
> > checksums seem frequently to be read and written via the struct
> > module, with format code "l", and e.g.
> >
> > >>> struct.pack("!l", 1L << 31)
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> > OverflowError: long int too large to convert to int

> [Guido]
> > Such programs will have to be changed to use format code "L" instead.

> I'm not following this.  At least binascii.crc32() always produces a 32-bit
> signed int now, so there's no *need* to use "L" now.  Are you saying that
> binascii.crc32() should be changed to return a non-negative value always?
> Also the other xyz.abc32() functions?

Um, I thought *you* were proposing that!  What else did you mean by
"It would sure be nice if they returned values in range(0, 2**32)
instead" ?

> > Or perhaps "l" should be allowed to accept longs in
> > range(-2**31, 2**32) ?
> Well, unpacking a packed value wouldn't always return the value you
> started with then (pack 2**31 via "l", then unpack it via "l" and
> you get back -2**31), so it's not very attractive.  If you dump a
> checksum via pack, then unpack it later, you really want to get back
> the same value, not just "the same bits after some fiddling".

Yeah, you can't win. :-(

--Guido van Rossum (home page:

From  Mon Aug 12 17:27:32 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 12:27:32 -0400
Subject: [Python-Dev] 32-bit values (was RE: [Python-checkins] python/dist/src/Lib/test,1.18,1.19)
In-Reply-To: Your message of "Mon, 12 Aug 2002 18:24:47 +0200."
References: <> <>
Message-ID: <>

> Guido van Rossum <> writes:
> > Or perhaps "l" should be allowed to accept longs in range(-2**31,
> > 2**32) ?
> For the struct and array modules, that sounds reasonable.

Though Tim brought up that then you won't always get back what you put
in (if you put in a value > sys.maxint, it comes back negative).

Is that a problem or not?  I tend to think that's not how this is most
often used.

--Guido van Rossum (home page:

From  Mon Aug 12 17:45:24 2002
From: (Tim Peters)
Date: Mon, 12 Aug 2002 12:45:24 -0400
Subject: [Python-Dev] 32-bit values (was RE: [Python-checkins]
In-Reply-To: <>
Message-ID: <>

> Such programs will have to be changed to use format code "L" instead.

> I'm not following this.  At least binascii.crc32() always
> produces a 32-bit signed int now, so there's no *need* to use "L" now.
> Are you saying that binascii.crc32() should be changed to return a
> non-negative value always?  Also the other xyz.abc32() functions?

> Um, I thought *you* were proposing that!  What else did you mean by
> "It would sure be nice if they returned values in range(0, 2**32)
> instead" ?

I did suggest it, yes.  Had you said "Such programs *would* have to be
changed ...", my response would have been different.  But you said "will",
which reads like you already decided such a change will be made.  Now it
sounds like it's undecided (OK by me either way, I'm just trying to locate
our current position on the map <wink>).

From  Mon Aug 12 17:51:20 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 12:51:20 -0400
Subject: [Python-Dev] 32-bit values (was RE: [Python-checkins] python/dist/src/Lib/test,1.18,1.19)
In-Reply-To: Your message of "Mon, 12 Aug 2002 12:45:24 EDT."
References: <>
Message-ID: <>

> Subject: RE: [Python-Dev] 32-bit values (was RE: [Python-checkins]
>     python/dist/src/Lib/test,1.18,1.19)
> From: Tim Peters <>
> To: Guido van Rossum <>
> Cc: PythonDev <>
> Date: Mon, 12 Aug 2002 12:45:24 -0400
> X-Spam-Level: 
> [Guido]
> > Such programs will have to be changed to use format code "L" instead.
> [Tim]
> > I'm not following this.  At least binascii.crc32() always
> > produces a 32-bit signed int now, so there's no *need* to use "L" now.
> > Are you saying that binascii.crc32() should be changed to return a
> > non-negative value always?  Also the other xyz.abc32() functions?
> [Guido]
> > Um, I thought *you* were proposing that!  What else did you mean by
> > "It would sure be nice if they returned values in range(0, 2**32)
> > instead" ?
> I did suggest it, yes.  Had you said "Such programs *would* have to be
> changed ...", my response would have been different.  But you said "will",
> which reads like you already decided such a change will be made.  Now it
> sounds like it's undecided (OK by me either way, I'm just trying to locate
> our current position on the map <wink>).

I responded to your specific example, which probably wasn't how you
intended to use it.

I really don't know what's the best return range for 32-bit checksums
given all the constraints.  I'd leave this alone until we have decided
what to do with the other issues (like what to do with extensions that
use signed 32-bit values to represent masks now).

--Guido van Rossum (home page:

From  Mon Aug 12 17:23:00 2002
From: (Martin v. Loewis)
Date: 12 Aug 2002 18:23:00 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
References: <> <>
Message-ID: <>

"M.-A. Lemburg" <> writes:

> > What are new-style exceptions?
> Exceptions that are built as subclassable types.

Exceptions first of all inherit from Exception. When/if Exception
stops being a class, we'll have to deal with more issues than the PEP
293 exceptions.

> There's nothing premature here. By moving exception handling to
> C level, you get *much* better performance than at Python level.

Can you give a specific example: What Python code, how much better

> This doesn't work as I've already explained before. The predefined
> error handling modes of builtin codecs must work with relying on
> the Python import mechanism.

You mean "without"? Where did you explain this before? And why is
that? Guido argues that more of the central interpreter machinery must
be moved to Python - I can't see why codecs should be an exception

> What ? That exceptions are immutable ? I think it's a big win that
> exceptions are in fact mutable -- they are great for transporting
> extra information up the chain...

I see. So this is an open issue.


From  Mon Aug 12 18:37:29 2002
From: (Trent Mick)
Date: Mon, 12 Aug 2002 10:37:29 -0700
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>; from on Mon, Aug 12, 2002 at 09:38:02AM -0400
References: <> <>
Message-ID: <>

[Guido van Rossum wrote]
> I've added some comments to the patch.
> I'm +0 on the PEP; I'd like to defer to people who actually use
> Windows like Mark Hammond and Tim Peters.

We (ActiveState) are committed to getting the functionality in -- which
means that David and myself can help with coding and testing.  Depending
on the schedule, I can help with e.g. filling out the test suite.  We
can also setup some test machines to test things that are probably too
odd to fit in the test suite.

Martin, if there is anything we can do to help with for this patch,
please let us know.


From  Mon Aug 12 18:43:20 2002
From: (M.-A. Lemburg)
Date: Mon, 12 Aug 2002 19:43:20 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <> <>	<>	<>	<>	<>	<>	<> <>
Message-ID: <>

Martin v. Loewis wrote:
> "M.-A. Lemburg" <> writes:
>>>What are new-style exceptions?
>>Exceptions that are built as subclassable types.
> Exceptions first of all inherit from Exception. When/if Exception
> stops being a class, we'll have to deal with more issues than the PEP
> 293 exceptions.

Right. It would be nice to have classes or at least exceptions
turn into new-style types as well. Then you'd have access to
slots and all the other goodies which make a great difference
in accessing performance at C level.

>>There's nothing premature here. By moving exception handling to
>>C level, you get *much* better performance than at Python level.
> Can you give a specific example: What Python code, how much better
> performance?

Walter has the details here.

>>This doesn't work as I've already explained before. The predefined
>>error handling modes of builtin codecs must work with relying on
>>the Python import mechanism.
> You mean "without"? 

Right. s/with/without/.

> Where did you explain this before? 

Hmm, I remember having posted the reasoning I gave here
in another response on this thread, but I can't find it
at the moment.

 > And why is
> that? Guido argues that more of the central interpreter machinery must
> be moved to Python - I can't see why codecs should be an exception
> here.

The problem is the same as what we had with the
module early on in the 1.6 alphas: if this module isn't found
all kinds of things start failing. The same would happen when
you start to use builtin codecs which have external error handler
implementation as .py files, e.g. unicode('utf-8', 'replace')
could then fail because of an ImportError.

For the charmap codec it's mostly about performance. I don't
have objections for other codecs which rely on external

>>What ? That exceptions are immutable ? I think it's a big win that
>>exceptions are in fact mutable -- they are great for transporting
>>extra information up the chain...
> I see. So this is an open issue.

I wouldn't call it an issue. It's a feature :-) (and one that makes
Python's exception mechanism very powerful)

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Mon Aug 12 18:57:18 2002
From: (Martijn Faassen)
Date: Mon, 12 Aug 2002 19:57:18 +0200
Subject: [Python-Dev] Re: [PythonLabs] PEP 2
In-Reply-To: <>
References: <> <>
Message-ID: <>


 I was going to point David at PEP 2 as the guidelines for getting modules 
 added to the standard library, but I don't think PEP 2 really describes 
 current practice.

 PEP 2 says:

    When developers wish to have a contribution accepted into the
    standard library, they will first form a group of maintainers
    (normally initially consisting of themselves).

    Then, this group shall produce a PEP called a library PEP. A
    library PEP is a special form of standards track PEP.  The library
    PEP gives an overview of the proposed contribution, along with the
    proposed contribution as the reference implementation.  This PEP
    should also contain a motivation on why this contribution should
    be part of the standard library.

 I think only in rare situations do we need a PEP for a library
 module.  If you agree then I think we should rewrite PEP 2 to describe
 current practice.

 [Barry's description of current practice later in this post]


 Sounds like a good idea.

 It really depends a lot on circumstances.  PEP 282 was written to
 propose a logging module, and then a logging module was proposed that
 will soon go into the std lib (I hope).  But Optik will end up being
 adapted without a PEP.

[back to me]

PEP 2 indeed does not describe current practice; it tries to introduce new
procedures for the development of the standard library.

PEP 2 was written as informed by this post by Tim Peters:

which in turn is referring to this post by Guido:

Tim said the following:

> All the core
> developers have done major work on the libraries, so that's not a hangup.
> What is a hangup is that people also want stuff the current group of core
> developers has no competence (and/or sometimes interest) to write.  Like SSL
> support on Windows, or IPv6 support, etc.  Expert-level work in a field
> requires experts in that field to contribute.  We also need a plan to keep
> their stuff running after they go away again, the lack of which is one
> strong reason Guido resists adding stuff to the library.

And he suggested I look at the then empty PEP 2.

I took the point of these postings to be that many core library modules
cannot be developed and maintained by the core developers, as they lack
the knowledge and expertise to do so. Therefore the community needs
to develop and importantly also maintain those. In particular,
having explicit and active maintainers was held to be a particularly 
important precondition for library inclusion by the core developers.

Since I thought development of the standard library was important I
tried to make sure that these requirements were to be fulfilled by
people wanting to add a new module.

Barry describes these steps as the way new modules get added now:

- develop the library as an independent project, outside the Python project

- make the library available to the Python community, usually in the
  form of a distutils package

- get feedback and experience from the Python community at large

- if the module becomes popular, is widely backed, and/or fills a   
  niche in the standard library, propose it for inclusion in the core
  distro via discussion and consensus building on python-dev
And quoting from PEP 2:
    The library PEP differs from a normal standard track PEP in that
    the reference
    implementation should in this case always already have been
    written before the PEP is to be reviewed for inclusion by the
    integrators and to be commented upon by the community; the
    reference implementation _is_ the proposed contribution.

and Barry quotes PEP 2 himself:

    When developers wish to have a contribution accepted into the
    standard library, they will first form a group of maintainers
    (normally initially consisting of themselves).

This is not incompatible at all with the above procedure, if you 
read it carefully. :) The idea of PEP 2 is that there is *already* 
a (potential) contribution, and people want it accepted into the
standard library. The PEP does not describe where this contribution
is coming from; it may indeed be a popular module or it may fill
a niche or whatnot, or it may simply have a stunning design and
implementation extremely useful to everybody. I can make this more
explicit in the PEP.

What PEP 2 tries to supply is a procedure to follow if people
have already decided they would like to try to get a module or set of
modules accepted into the standard library. They can decide this before
or after they write the module; the PEP doesn't care -- as long as the
module is there when they submit the library PEP. At least they know
there'll be Integrators that will review things, and they know they had
better come up with some maintainers before submitting the PEP.

I can add a phrase to make initial community participation more important,
though the PEP process itself already indicates to ask for community input
when submitting a PEP.

Right now the Integrators are described to be PythonLabs, but this could be a
separate group dedicated to maintaining the library later on if desired.

Anyway, what I was aiming for with the PEP was not to codify existing
procedure but to improve on it, informed by what I thought were
the requirements. I thought it was a problem that the standard library is 
apparently not the central focus of the core developers, while the standard
library *is* a very important part of what makes Python appealing. If
the community is to develop it further, I thought the procedures to do
so needed a bit more framework than there is now.

Perhaps I misunderstood something and perhaps PEP 2 won't help anyway,
but that's the background behind it.

I cc-ed my reply to Python-dev as this may need a bit more input. There was
only little (Aazh I recall, though shamefully I think I forgot to integrate
one of his suggestions on backporting bugfixes) when I first posted it both 
on comp.lang.python and on python-dev, but perhaps the reasoning behind it
was unclear.

PEP 2 is here:



From  Mon Aug 12 19:39:25 2002
From: (=?ISO-8859-15?Q?Walter_D=F6rwald?=)
Date: Mon, 12 Aug 2002 20:39:25 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <> <>	<>	<>	<>	<> <>
Message-ID: <>

This is a multi-part message in MIME format.
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 8bit

Martin v. Loewis wrote:

> Walter Dörwald <> writes:
>> > - charmap_encoding_error, which insists on implementing known error
>> >   handling algorithms inline,
>>This is done for performance reasons.
> Is that really worth it? Such errors are rare, and when they occur,
> they usually cause an exception as the result of the "strict" error
> handling.

Of course it's irrelevant how fast the exception is raised, but it could
be important for the handlers that really do a replacement.

> I'd strongly encourage you to avoid duplication of code, and use
> Python whereever possible.
>>The PyCodec_XMLCharRefReplaceErrors functionality is
>>independent of the rest, so moving this to Python
>>won't reduce complexity that much. And it will
>>slow down "xmlcharrefreplace" handling for those
>>codecs that don't implement it inline.
> Sure it will. But how much does that matter in the overall context of
> generating HTML/XML?

See the attached test script. It encodes 100 versions of the german
text on

Output is as follows:
1790000 chars, 2.330% unenc
ignore: 0.022 (factor=1.000)
xmlcharrefreplace: 0.044 (factor=1.962)
xml2: 0.267 (factor=12.003)
xml3: 0.723 (factor=32.506)
workaround: 5.151 (factor=231.702)
i.e. a 1.7MB string with 2.3% unencodable characters was

Using the the inline xmlcharrefplace instead of ignore is
half as fast. Using a callback instead of the inline
implementation is a factor of 12 slower than ignore.
Using the Python implementation of the callback is a
factor of 32 slower and using the pre-PEP workaround
is a factor of 231 slower.

Replacing every unencodable character with u"\u4242" and
using "iso-8859-15" gives:
ignore: 0.351 (factor=1.000)
xmlcharrefreplace: 0.390 (factor=1.113)
xml2: 0.653 (factor=1.862)
xml3: 1.137 (factor=3.244)
workaround: 12.310 (factor=35.117)

 > [...]
>>The exception attributes can then be members of the C struct and the
>>accessor functions can be simple macros.
> Again, I sense premature optimization.

No it's more like anticipating change.

>>1. For each error handler two Python function objects are created:
>>One in the registry and a different one in the codecs module. This
>>means that e.g.
>>codecs.lookup_error("replace") != codecs.replace_errors
> Why would this be a problem?

It's just unintuitive.

>>We can fix that by making the name ob the Python function object
>>globally visible or by changing the codecs init function to do a
>>lookup and use the result or simply by removing codecs.replace_errors
> I recommend to fix this by implementing the registry in Python.

Even simpler would be to move the initialization of the module
variables from Modules/_codecsmodule.c to Lib/ There is
no need for them to be available in _codecs. All that is required
for this change is to add

    strict_errors = lookup_error("strict")
    ignore_errors = lookup_error("ignore")
    replace_errors = lookup_error("replace")
    xmlcharrefreplace_errors = lookup_error("xmlcharrefreplace")
    backslashreplace_errors = lookup_error("backslashreplace")


The registry should be available via two simple C APIs, just
like the encoding registry.

>>4. Assigning to an attribute of an exception object does not
>>change the appropriate entry in the args attribute. Is this
>>worth changing?
> No. Exception objects should be treated as immutable (even if they
> aren't).

The codecs in the PEP *do* modify attributes of the exception

> If somebody complains, we can fix it; until then, it suffices
> if this is documented.

It can't really be fixed for codecs implemented in Python. For codecs
that use the C functions we could add the functionality that e.g.
PyUnicodeEncodeError_SetReason(exc) sets exc.reason and exc.args[3],
but AFAICT it can't be done easily for Python where attribute assignment
directly goes to the instance dict.

If those exception classes were new style classes it would be simple, 
because the attributes would be properties and args would probably
be generated lazily.

>>5. UTF-7 decoding does not yet take full advantage of the machinery:
>>When an unterminated shift sequence is encountered (e.g. "+xxx")
>>the faulty byte sequence has already been emitted.
> It would be ok if it works as good as it did in 2.2. UTF-7 is rarely
> used; if it is used, it is machine-generated, so there shouldn't be
> any errors.

It does:
 >>> "+xxx".decode("utf-7", "replace")

althought the result should probably have been u'\ufffd'.

    Walter Dörwald

Content-Type: text/plain;
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;

import codecs, time

def xml3(exc):
	if isinstance(exc, UnicodeEncodeError):
		return (u"".join([ u"&#%d;" % ord(c) for c in exc.object[exc.start:exc.end]]), exc.end)
		raise TypeError("don't know how to handle %r" % exc)

count = 0

def check(exc):
	global count
	count += exc.end-exc.start
	return (u"", exc.end)

codecs.register_error("xmlcheck", check)
codecs.register_error("xml2", codecs.xmlcharrefreplace_errors)
codecs.register_error("xml3", xml3)

l = 100
s = unicode(open("tapferschneider.txt").read(), "latin-1")
s *= l

s.encode("ascii", "xmlcheck")

print "%d chars, %.03f%% unenc" % (len(s), 100.*(float(count)/len(s)))

handlers = ["ignore", "xmlcharrefreplace", "xml2", "xml3"]
times = [0]*(len(handlers)+1)
res = [0]*(len(handlers)+1)
for (i, h) in enumerate(handlers):
	t1 = time.time()
	res[i] = s.encode("ascii", h)
	t2 = time.time()
	times[i] = t2-t1
	print "%s: %.03f (factor=%.03f)" % (handlers[i], times[i], times[i]/times[0])

i = len(handlers)
t1 = time.time()
v = []
for c in s:
	except UnicodeError:
		v.append("&#%d;" % ord(c))
res[i] = "".join(v)
t2 = time.time()
times[i] = t2-t1
print "workaround: %.03f (factor=%.03f)" % (times[i], times[i]/times[0])


From  Mon Aug 12 19:47:54 2002
From: (=?ISO-8859-15?Q?Walter_D=F6rwald?=)
Date: Mon, 12 Aug 2002 20:47:54 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <> <>	<>	<>	<>	<> <> <>
Message-ID: <>

M.-A. Lemburg wrote:

> Martin v. Loewis wrote:
 > [...]
>>> codecs.lookup_error("replace") != codecs.replace_errors
>> [...]
>> I recommend to fix this by implementing the registry in Python.
> This doesn't work as I've already explained before. The predefined
> error handling modes of builtin codecs must work with relying on
> the Python import mechanism.

s/with/without/ ?

At least "strict" should be implemented inline, because reading
broken .pyc files which contain (utf-8 encoded) unicode constants
would probably lead to all kinds of interesting problems.


    Walter Dörwald

From  Mon Aug 12 20:05:01 2002
From: (Steve Holden)
Date: Mon, 12 Aug 2002 15:05:01 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
References: <> <> <> <> <> <> <>              <>  <>
Message-ID: <086f01c24233$290dc9d0$>

----- Original Message -----
From: "Guido van Rossum" <>
To: "M.-A. Lemburg" <>
Cc: "Martin v. Loewis" <>; "Jack Jansen"
<>; <>
Sent: Monday, August 12, 2002 11:57 AM
Subject: Re: [Python-Dev] Deprecation warning on integer shifts and such

> > > Oops.  Darn.  You're right.  Sigh.  That's painful.  We have to add a
> > > new format code (or more) to accept signed 32-bit ints but also longs
> > > in range(32).
> >
> > Rather than inventing something new to be compatible to the existing
> > old status quo, I'd rather like to see new format codes for unsigned
> > integers and/or longs and have the existing ones support the new
> > status quo.
> That's okay too.  The function could be PyInt_AsUnsignedLong().  It
> could convert negative 32-bit ints to unsigned as a backward
> compatibility measure (with warning?) that will eventually disappear.
> The format code could be 'I' for unsigned int, but I don't know what
> to use for unsigned long.  Or perhaps still use 'k'/'K' for masK?

Does 32 here actually mean 32, or does it mean length of int -- I'm
presuming there are or will be platforms with 64-bit PyInts?

Steve Holden                       
Python Web Programming      

From  Mon Aug 12 20:13:38 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 15:13:38 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Mon, 12 Aug 2002 15:05:01 EDT."
References: <> <> <> <> <> <> <> <> <>
Message-ID: <>

> > > > Oops.  Darn.  You're right.  Sigh.  That's painful.  We have to add a
> > > > new format code (or more) to accept signed 32-bit ints but also longs
> > > > in range(32).
> > >
> > > Rather than inventing something new to be compatible to the existing
> > > old status quo, I'd rather like to see new format codes for unsigned
> > > integers and/or longs and have the existing ones support the new
> > > status quo.
> >
> > That's okay too.  The function could be PyInt_AsUnsignedLong().  It
> > could convert negative 32-bit ints to unsigned as a backward
> > compatibility measure (with warning?) that will eventually disappear.
> >
> > The format code could be 'I' for unsigned int, but I don't know what
> > to use for unsigned long.  Or perhaps still use 'k'/'K' for masK?
> Does 32 here actually mean 32, or does it mean length of int -- I'm
> presuming there are or will be platforms with 64-bit PyInts?

Good question.  Much code that uses these features assumes 32 bits.
OTOH the same problem occurs for real on 64-bit systems at the 64-bit

--Guido van Rossum (home page:

From  Mon Aug 12 20:56:23 2002
From: (Jack Jansen)
Date: Mon, 12 Aug 2002 21:56:23 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
Message-ID: <>

On maandag, augustus 12, 2002, at 01:48 , Martin v. L=F6wis wrote:

> The PEP describes a Windows-only change to Unicode in file names: On
> Windows NT/2k/XP, Python would allow arbitrary Unicode strings as file
> names and pass them to the OS, instead of converting them to CP_ACP
> first. This applies to open() and all os functions that accept
> filenames.
> In addition, os.list() would return Unicode filenames if the argument
> is Unicode.

This is the bit I still don't like (at least, if I'm not=20
mistaken I commented on it a while ago too). A routine could be=20
doing an os.list() expecting strings, but suddenly someone=20
passes it a unicode directoryname and the return value would=20

I would much prefer an optional encoding argument whereby you=20
give the encoding in which you want the return value. Default=20
would be the local filesystem encoding. If you pass unicode you=20
will get direct unicode on XP/2K, and a converted string on=20
other platforms (but always unicode).

Oh yes, the same reasoning would hold for readlink(), getcwd()=20
and any other call that returns filenames.
- Jack Jansen        <>       =20 -
- If I can't dance I don't want to be part of your revolution --=20
Emma Goldman -

From  Mon Aug 12 21:07:46 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 16:07:46 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: Your message of "Mon, 12 Aug 2002 21:56:23 +0200."
References: <>
Message-ID: <>

> >
> >
> > The PEP describes a Windows-only change to Unicode in file names: On
> > Windows NT/2k/XP, Python would allow arbitrary Unicode strings as file
> > names and pass them to the OS, instead of converting them to CP_ACP
> > first. This applies to open() and all os functions that accept
> > filenames.
> >
> > In addition, os.list() would return Unicode filenames if the argument
> > is Unicode.
> This is the bit I still don't like (at least, if I'm not 
> mistaken I commented on it a while ago too). A routine could be 
> doing an os.list() expecting strings, but suddenly someone 
> passes it a unicode directoryname and the return value would 
> change.

Hm, that would be the responsibility of whoever passes it Unicode.
Most code works just fine when presented with Unicode where 8-bit
strings are expected.  It's only code that assumes the 8-bit strings
are Latin-1 (or something else besides ASCII) that gets in trouble.

But shouldn't it return Unicode whenever there are filenames in the
directory that can't represented as ASCII?

That's what Tkinter does: Tk gives back UTF-8, which degenerates to
ASCII if there are only ASCII chars; if any high bits are detected,
Tkinter decodes the UTF-8, turning the return string into Unicode.

> I would much prefer an optional encoding argument whereby you 
> give the encoding in which you want the return value. Default 
> would be the local filesystem encoding. If you pass unicode you 
> will get direct unicode on XP/2K, and a converted string on 
> other platforms (but always unicode).

Hm, I don't know if I'd like os.listdir() to have an encoding
argument.  Sounds like the wrong solution somehow.

> Oh yes, the same reasoning would hold for readlink(), getcwd() 
> and any other call that returns filenames.


--Guido van Rossum (home page:

From  Mon Aug 12 21:18:26 2002
From: (Jack Jansen)
Date: Mon, 12 Aug 2002 22:18:26 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

On maandag, augustus 12, 2002, at 05:37 , Guido van Rossum wrote:
> Oops.  Darn.  You're right.  Sigh.  That's painful.  We have to add a
> new format code (or more) to accept signed 32-bit ints but also longs
> in range(32).  This should be added in 2.3 so extensions can start
> using it now, and user code can start passing longs in range(2**32)
> now.  I propose 'k' (for masK).  We should backport this to 2.2.2 as
> well.  Plus a variant on PyInt_AsLong() with the same semantics, maybe
> named PyInt_AsMask().

Ow, pleeeeeeeeeeeeeeeeeeeeeeeeaaaaaase........

Just before 2.1 was released (or was it 2.0?) on a whim someone 
"fixed" the short integer handling to bother about signs, in a 
backward-incompatible way, despite that fact that about 95% of 
the short PyArg_Parse formats in the core were mine, and I asked 
for some form of backward compatibility. I spent about 2 weeks 
going over a few thousand API calls to fix this mess at a time I 
had more than enough other work on my hands.

Can we please make this change in a backwards-compatible way, 
i.e. leave the i and l formats alone and use something new for 
"range-checked-int" and "range-checked-long"?

I already fear that I have to come up with some sort of a fix 
for the range-check warning (more than 6000 lines worth of 
constant definitions that can currently be copied verbatim from 
C header files to Python will have to be parsed, and computed, 
and all these things can contain references to other constants, 
strings and who knows what more, see Mac/Lib/Carbon/*.py), I 
really could do without more work on my plate...
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Mon Aug 12 21:49:54 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 16:49:54 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Mon, 12 Aug 2002 22:18:26 +0200."
References: <>
Message-ID: <>

> On maandag, augustus 12, 2002, at 05:37 , Guido van Rossum wrote:
> > Oops.  Darn.  You're right.  Sigh.  That's painful.  We have to add a
> > new format code (or more) to accept signed 32-bit ints but also longs
> > in range(32).  This should be added in 2.3 so extensions can start
> > using it now, and user code can start passing longs in range(2**32)
> > now.  I propose 'k' (for masK).  We should backport this to 2.2.2 as
> > well.  Plus a variant on PyInt_AsLong() with the same semantics, maybe
> > named PyInt_AsMask().
> Ow, pleeeeeeeeeeeeeeeeeeeeeeeeaaaaaase........
> Just before 2.1 was released (or was it 2.0?) on a whim someone 
> "fixed" the short integer handling to bother about signs, in a 
> backward-incompatible way, despite that fact that about 95% of 
> the short PyArg_Parse formats in the core were mine, and I asked 
> for some form of backward compatibility. I spent about 2 weeks 
> going over a few thousand API calls to fix this mess at a time I 
> had more than enough other work on my hands.

Oops.  That wasn't intended of course.

> Can we please make this change in a backwards-compatible way, 
> i.e. leave the i and l formats alone and use something new for 
> "range-checked-int" and "range-checked-long"?

Um, they *already* do range checking.  'i' requires that the value
(whether it comes from a Python int or a Python long) is in the range
[INT_MIN, INT_MAX].  'l' doesn't do range checking on Python ints
(because they are defined to fit in a C long), but for a Python long
it requires that the value fits in the range of a signed C long,
i.e. [-sys.maxint-1, sys.maxint].

The problem is that hex constants and shifted values may be
represented by signed Python ints, abusing the sign bit as a mask bit,
*or* by Python longs that represent corresponding values by
nonnegative values in the range [0, 2*sys.maxint+1].  Since
PyArg_Parse* doesn't know whether the value will be used as a mask or
as a real signed int, we must allow negative Python longs (in the
range [-sys.maxint-1, -1] as well.

That means that the range checking (for long values) has to be
defective from the point of view of a function that doesn't want a
mask value but really expects a signed C int or long: Python code can
pass in a Python long value that's too large, but because it's still
in the 32-bit range the Python code won't be told about the overflow
error, and the C code will happily be given a large negative value.

If we really believe that there's more code (in the world, not just in
the core CVS tree) that uses 'i' or 'l' for masks than that uses it
for signed values, we cold fix 'i' and 'l' this way, and add new codes
for code that really wants signed values.  Still, all that code would
have to be fixed somehow and we would have to track it down.

And then we'd still be stuck with PyInt_AsLong() -- should it use the
same rule?  I hope not.

> I already fear that I have to come up with some sort of a fix 
> for the range-check warning (more than 6000 lines worth of 
> constant definitions that can currently be copied verbatim from 
> C header files to Python will have to be parsed, and computed, 
> and all these things can contain references to other constants, 
> strings and who knows what more, see Mac/Lib/Carbon/*.py), I 
> really could do without more work on my plate...

Before you start to panic, can you please try to import all those
modules and see how many cause warnings?  I only found one, line 11, but there are many files that I can't run because
they require extension modules I don't have.  I do note that none of
them generate warnings in the parser.  I also found a SyntaxError that
contradicts your assertion about "can currently be copied verbatim
from C header files".

--Guido van Rossum (home page:

From  Mon Aug 12 21:51:08 2002
From: (Jeff Epler)
Date: Mon, 12 Aug 2002 15:51:08 -0500
Subject: [Python-Dev] Performance (non)optimization: 31-bit ints in pointers
Message-ID: <>

Many Lisp interpreters use 'tagged types' to, among other things, let
small ints reside directly in the machine registers.

Python might wish to take advantage of this by designating pointers to odd
addresses stand for integers according to the following relationship:
    p = (i<<1) | 1
    i = (p>>1)
(due to alignment requirements on all common machines, all valid
pointers-to-struct have 0 in their low bit)  This means that all integers
which fit in 31 bits can be stored without actually allocating or deallocating

I modified a Python interpreter to the point where it could run simple
programs.  The changes are unfortunately very invasive, because they
make any C code which simply executes
or otherwise dereferences a PyObject* invalid when presented with a
small int.  This would obviously affect a huge amount of existing code in
extensions, and is probably enough to stop this from being implemented
before Python 3000.

This also introduces another conditional branch in many pieces of code, such
as any call to PyObject_TypeCheck().

Performance results are mixed.  A small program designed to test the
speed of all-integer arithmetic comes out faster by 14% (3.38 vs 2.90
"user" time on my machine) but pystone comes out 5% slower (14124 vs 13358

I don't know if anybody's barked up this tree before, but I think
these results show that it's almost certainly not worth the effort to
incorporate this "performance" hack in Python.  I'll keep my tree around
for awhile, in case anybody else wants to see it, but beware that it
still has serious issues even in the core:
    >>> 0+0j
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    TypeError: unsupported operand types for +: 'int' and 'complex'
    >>> (0).__class__
    Segmentation fault

PS The program that shows the advantage of this optimization is as follows:

j = 0
for k in range(10):
    for i in range(100000) + range(1<<30, 1<<30 + 100000):
	j = j ^ i
print j

From  Mon Aug 12 22:07:31 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 17:07:31 -0400
Subject: [Python-Dev] Performance (non)optimization: 31-bit ints in pointers
In-Reply-To: Your message of "Mon, 12 Aug 2002 15:51:08 CDT."
References: <>
Message-ID: <>

> Many Lisp interpreters use 'tagged types' to, among other things, let
> small ints reside directly in the machine registers.
> Python might wish to take advantage of this by designating pointers to odd
> addresses stand for integers according to the following relationship:
>     p = (i<<1) | 1
>     i = (p>>1)
> (due to alignment requirements on all common machines, all valid
> pointers-to-struct have 0 in their low bit)  This means that all integers
> which fit in 31 bits can be stored without actually allocating or deallocating
> anything.
> I modified a Python interpreter to the point where it could run simple
> programs.  The changes are unfortunately very invasive, because they
> make any C code which simply executes
>     o->ob_type
> or otherwise dereferences a PyObject* invalid when presented with a
> small int.  This would obviously affect a huge amount of existing code in
> extensions, and is probably enough to stop this from being implemented
> before Python 3000.
> This also introduces another conditional branch in many pieces of code, such
> as any call to PyObject_TypeCheck().
> Performance results are mixed.  A small program designed to test the
> speed of all-integer arithmetic comes out faster by 14% (3.38 vs 2.90
> "user" time on my machine) but pystone comes out 5% slower (14124 vs 13358
> "pystones/second").
> I don't know if anybody's barked up this tree before, but I think
> these results show that it's almost certainly not worth the effort to
> incorporate this "performance" hack in Python.  I'll keep my tree around
> for awhile, in case anybody else wants to see it, but beware that it
> still has serious issues even in the core:
>     >>> 0+0j
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in ?
>     TypeError: unsupported operand types for +: 'int' and 'complex'
>     >>> (0).__class__
>     Segmentation fault

We used *exactly* this approach in ABC.  I decided not to go with it
in Python, for two reasons that are essentially what you write up
here: (1) the changes are very pervasive (in ABC, we kept finding
places where we had pointer-manipulating code that had to be fixed to
deal with the small ints), and (2) it wasn't at all clear if it was a
performance win in the end (all the extra tests and special cases
may cost as much as you gain).

In general, ABC tried to use many tricks from the books (e.g. it used
asymptotically optimal B-tree algorithms to represent dicts, lists and
strings, to guarantee performance of slicing and dicing operations for
strings of absurd lengths).  In Python I decided to stay away from
cleverness except when extensive performance analysis showed there was
a real need to speed something up.  That got us super-fast dicts, for
example, and .pyc files to cache the work of the (slow, but
trick-free) parser.

--Guido van Rossum (home page:

From  Mon Aug 12 22:24:40 2002
From: (Jack Jansen)
Date: Mon, 12 Aug 2002 23:24:40 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

On maandag, augustus 12, 2002, at 10:49 , Guido van Rossum wrote:
>> Can we please make this change in a backwards-compatible way,
>> i.e. leave the i and l formats alone and use something new for
>> "range-checked-int" and "range-checked-long"?
> Um, they *already* do range checking.  'i' requires that the value
> (whether it comes from a Python int or a Python long) is in the range
> [INT_MIN, INT_MAX].  'l' doesn't do range checking on Python ints
> (because they are defined to fit in a C long), but for a Python long
> it requires that the value fits in the range of a signed C long,
> i.e. [-sys.maxint-1, sys.maxint].

Yes, but due to the way the parser works everything works fine 
for me. In
the constant definition file it says "foo = 0xff000000". The 
parser turns this
into a negative integer. Fine with me, the bits are the same, 
and PyArg_Parse
doesn't complain.

> If we really believe that there's more code (in the world, not just in
> the core CVS tree) that uses 'i' or 'l' for masks than that uses it
> for signed values, we cold fix 'i' and 'l' this way, and add new codes
> for code that really wants signed values.  Still, all that code would
> have to be fixed somehow and we would have to track it down.

I think what it boils down to is what Python's model of the 
world is: C or mathematics. It used to be C, which is probably 
the one reason Python caught on initially (whereas ABC with it's 
mathematical model didn't, really). I can see the reason behind 
moving towards a more consistent world view, where integers are 
integers, be they 32 bits or more, where strings are strings, be 
they unicode or ascii, and I even agree with it, up to a point.

The drawback is that it will make it more difficult to interface 
Python to the real world, where integers have a size, characters 
are 8 bits, binary data is "char *" too, unicode has funny APIs, 
etc. And I happen to feel responsible for a lot of this real 
world interfacing code:-)

> Before you start to panic, can you please try to import all those
> modules

I just did so, see my other mail. You're right, the problem is 
theoretically big, but pretty manageable in practice.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Mon Aug 12 22:06:29 2002
From: (Jack Jansen)
Date: Mon, 12 Aug 2002 23:06:29 +0200
Subject: [Python-Dev] Correction: Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

On maandag, augustus 12, 2002, at 10:18 , Jack Jansen wrote:
> I already fear that I have to come up with some sort of a fix 
> for the range-check warning (more than 6000 lines worth of 
> constant definitions that can currently be copied verbatim from 
> C header files to Python will have to be parsed, and computed, 
> and all these things can contain references to other constants, 
> strings and who knows what more, see Mac/Lib/Carbon/*.py), I 
> really could do without more work on my plate...

I'll retract this statement after a bit of research: it turns 
out there are only very few of those 6000 constants that 
actually run afoul of the warning, so I can fix those by hand.

That is, if there's actually a warning for every bad constant, 
not just once per module...
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Mon Aug 12 22:29:04 2002
From: (Martin v. Loewis)
Date: 12 Aug 2002 23:29:04 +0200
Subject: [Python-Dev] 32-bit values (was RE: [Python-checkins] python/dist/src/Lib/test,1.18,1.19)
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> Though Tim brought up that then you won't always get back what you put
> in (if you put in a value > sys.maxint, it comes back negative).
> Is that a problem or not?  I tend to think that's not how this is most
> often used.

I withdraw my earlier recommendation: It *is* desirable that
struct.pack/unpack gives back the same value.


From  Mon Aug 12 22:29:12 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 17:29:12 -0400
Subject: [Python-Dev] Correction: Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Mon, 12 Aug 2002 23:06:29 +0200."
References: <>
Message-ID: <>

> I'll retract this statement after a bit of research: it turns 
> out there are only very few of those 6000 constants that 
> actually run afoul of the warning, so I can fix those by hand.
> That is, if there's actually a warning for every bad constant, 
> not just once per module...

If you want to get the warnings for each line, use "python -Wall".

--Guido van Rossum (home page:

From  Mon Aug 12 22:31:26 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 17:31:26 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Mon, 12 Aug 2002 23:24:40 +0200."
References: <>
Message-ID: <>

> I think what it boils down to is what Python's model of the 
> world is: C or mathematics. It used to be C, which is probably 
> the one reason Python caught on initially (whereas ABC with it's 
> mathematical model didn't, really). I can see the reason behind 
> moving towards a more consistent world view, where integers are 
> integers, be they 32 bits or more, where strings are strings, be 
> they unicode or ascii, and I even agree with it, up to a point.
> The drawback is that it will make it more difficult to interface 
> Python to the real world, where integers have a size, characters 
> are 8 bits, binary data is "char *" too, unicode has funny APIs, 
> etc. And I happen to feel responsible for a lot of this real 
> world interfacing code:-)

The issue is not that the new approach makes it more difficult to
interface to the real world.  The issue is that you have to change how
you interface to the real world.  Writing something from scratch that
uses the new approach won't take any more work.  It's the backwards
compatibility that bites you.

--Guido van Rossum (home page:

From  Mon Aug 12 22:41:38 2002
From: (Martin v. Loewis)
Date: 12 Aug 2002 23:41:38 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
References: <> <>
Message-ID: <>

"M.-A. Lemburg" <> writes:

> The problem is the same as what we had with the
> module early on in the 1.6 alphas: if this module isn't found
> all kinds of things start failing. The same would happen when
> you start to use builtin codecs which have external error handler
> implementation as .py files, e.g. unicode('utf-8', 'replace')
> could then fail because of an ImportError.

What kinds of things would start failing? If you get an interactive
prompt (i.e. Python still manages to start up), or you get a traceback
indicating the problem in non-interactive mode, I don't see this as a
problem - *of course* Python will stop working if you remove essential

This is like saying you expect the interpreter to continue to work
after you remove python23.dll.

So, if your worry is that things would not work if you remove a Python
file - don't worry. Python already relies on Python files being
present in various places.

> For the charmap codec it's mostly about performance. I don't
> have objections for other codecs which rely on external
> resources.

Please remember that we are still about error handling here, and that
the normal case will be "strict", which usually results in aborting
the computation.

So I don't see the performance issue even for the charmap codec.


From  Mon Aug 12 23:12:31 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 00:12:31 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
References: <> <>
Message-ID: <>

Walter D=F6rwald <> writes:

> Output is as follows:
> 1790000 chars, 2.330% unenc
> ignore: 0.022 (factor=3D1.000)
> xmlcharrefreplace: 0.044 (factor=3D1.962)
> xml2: 0.267 (factor=3D12.003)
> xml3: 0.723 (factor=3D32.506)
> workaround: 5.151 (factor=3D231.702)
> i.e. a 1.7MB string with 2.3% unencodable characters was
> encoded.

Those numbers are impressive. Can you please add

def xml4(exc):
  if isinstance(exc, UnicodeEncodeError):
    if exc.end-exc.start =3D=3D 1:
      return u"&#"+str(ord(exc.object[exc.start]))+u";"
      r =3D []
      for c in exc.object[exc.start:exc.end]:
        r.extend([u"&#", str(ord(c)), u";"])
      return u"".join(r)
    raise TypeError("don't know how to handle %r" % exc)

and report how that performs (assuming I made no error)?

> Using a callback instead of the inline implementation is a factor of
> 12 slower than ignore.

For the purpose of comparing C and Python, this isn't relevant, is it?
Only the C version of xmlcharrefreplace and a Python version should be

> It can't really be fixed for codecs implemented in Python. For codecs
> that use the C functions we could add the functionality that e.g.
> PyUnicodeEncodeError_SetReason(exc) sets exc.reason and exc.args[3],
> but AFAICT it can't be done easily for Python where attribute assignment
> directly goes to the instance dict.

You could add methods into the class set_reason etc, which error
handler authors would have to use.

Again, these methods could be added through Python code, so no C code
would be necessary to implemenet them.

You could even implement a setattr method in Python - although you'ld
have to search this from C while initializing the class.


From  Mon Aug 12 23:27:51 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 00:27:51 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Trent Mick <> writes:

> We (ActiveState) are committed to getting the functionality in -- which
> means that David and myself can help with coding and testing.  Depending
> on the schedule, I can help with e.g. filling out the test suite.  We
> can also setup some test machines to test things that are probably too
> odd to fit in the test suite.
> Martin, if there is anything we can do to help with for this patch,
> please let us know.

At the moment, a review would be most appreciated. Please consider
Guido's comments as being taken care of, except for the test suite.

Neil's original patch ( has a number of test cases,
for various file names, and for open, os.{stat, rename, mkdir, chdir,
_getfullpathname}. You could start from that, or make your own test
cases - "full coverage" (in terms of tested functions) appears to be
desirable, and it should not print to stdout, but assert things

In addition, I have no W9x system. I assume Neil is right in his
analysis that the *W functions are not useful/not available on W9x?
If so, somebody should test that the resulting Python binary still
runs on W9x, falling back to the FileSystemDefaultEncoding.


From  Mon Aug 12 23:34:54 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 00:34:54 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen <> writes:

> This is the bit I still don't like (at least, if I'm not mistaken I
> commented on it a while ago too). A routine could be doing an
> os.list() expecting strings, but suddenly someone passes it a
> unicode directoryname and the return value would change.

Sure, but within reasonable limitations, "nothing bad" would happen:
those file names most likely use only ASCII, so the default encoding
treats them nicely whereever they appear.

> I would much prefer an optional encoding argument whereby you give the
> encoding in which you want the return value. Default would be the
> local filesystem encoding. If you pass unicode you will get direct
> unicode on XP/2K, and a converted string on other platforms (but
> always unicode).

I would not like that. First of all, it isn't any more portable than
PEP 277: on Unix, to implement that feature, you'll have to know the
encoding of filenames on disk first - which alone is tricky.

Furthermore, it is easy to implement that on top of PEP 277: just
write a wrapper than encodes the result.

> Oh yes, the same reasoning would hold for readlink(), getcwd() and
> any other call that returns filenames.

These are more tricky, indeed. Fortunately, they are not in the domain
of PEP 277: readlink is not supported on Windows, and getcwd not
considered in the PEP. If that is an issue, I'd add a "return_unicode"
flag to getcwd.

Allowing the application to specify an encoding at the file system API
is not really helpful, as the encoding at the file system API is
usually mandated by the application.


From  Mon Aug 12 23:13:16 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 00:13:16 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
In-Reply-To: <>
References: <> <>
 <> <>
Message-ID: <>

Walter D=F6rwald <> writes:

> At least "strict" should be implemented inline, because reading
> broken .pyc files which contain (utf-8 encoded) unicode constants
> would probably lead to all kinds of interesting problems.

If you have a broken .pyc file, you have more problems than that,


From  Mon Aug 12 23:50:51 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 00:50:51 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> But shouldn't it return Unicode whenever there are filenames in the
> directory that can't represented as ASCII?

Unfortunately, on Windows, there is no way to find out: If you use the
ANSI function (which not only covers ASCII, but the full user's code
page), and you have a file name not representable in this code page,
the system returns a file name that contains question marks.

Of course, you could always use the Win32 Wide API (unicode) function,
and convert the pure-ASCII strings into byte strings. That gives a
number of options:
- always return Unicode for Unicode directory argument,
- return Unicode only for non-ASCII, and only for Unicode argument,
- return Unicode only for non-ASCII, regardless of Unicode argument,
- return Unicode only for non-MBCS (again depending or not depending
  on whether the argument is Unicode).

In the third case, if you have a non-representable file name, you
currently get a string like "??????.txt", whereas you then get
u"\uabcd\uefgh...txt". What might be worse: If the file name is
representable in "mbcs", yet outside ASCII, you currently get a "good"
byte string, and you get a Unicode string under option three.

So the MBCS options sound better. Unfortunately, testing whether a
string encodes as MBCS might be expensive.

> Hm, I don't know if I'd like os.listdir() to have an encoding
> argument.  Sounds like the wrong solution somehow.

I don't like that, either.

> > Oh yes, the same reasoning would hold for readlink(), getcwd() 
> > and any other call that returns filenames.
> Ditto.

For readlink, if you trust FileSystemDefaultEncoding, you could return
a Unicode object if you find non-ASCII in the link contents.

For getcwd, you again have the issue of reliably detecting non-ASCII
if you use the ANSI function; if you use the Wide function, you again
have the choice of returning Unicode only if non-ASCII, or only if


From  Mon Aug 12 23:59:41 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 00:59:41 +0200
Subject: [Python-Dev] Re: [PythonLabs] PEP 2
In-Reply-To: <>
References: <>
Message-ID: <>

Martijn Faassen <> writes:

> [Barry]
>  I was going to point David at PEP 2 as the guidelines for getting modules 
>  added to the standard library, but I don't think PEP 2 really describes 
>  current practice.

> What PEP 2 tries to supply is a procedure to follow if people
> have already decided they would like to try to get a module or set of
> modules accepted into the standard library. They can decide this before
> or after they write the module; the PEP doesn't care -- as long as the
> module is there when they submit the library PEP. At least they know
> there'll be Integrators that will review things, and they know they had
> better come up with some maintainers before submitting the PEP.

I always read the PEP in precisely that way, and I think it is just
fine as it stands.

*Of course*, the BDFL can decide to incorporate any new modules any
time he wants. The PEP is to give people a guideline if they want to
get a module "in" that the BDFL doesn't outright want: they need to
offer supporting it, and they need to document it, provide test cases,
etc - then there is a good chance that the BDFL won't object.

This also gives the BDFL the explicit power to remove the module when
problems surface with it and the original authors ran away - it
essentially ties contributors to their contribution, which I see as a
good thing.


From  Tue Aug 13 00:04:48 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 01:04:48 +0200
Subject: [Python-Dev] Performance (non)optimization: 31-bit ints in pointers
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> In Python I decided to stay away from cleverness except when
> extensive performance analysis showed there was a real need to speed
> something up.  That got us super-fast dicts, for example, and .pyc
> files to cache the work of the (slow, but trick-free) parser.

For small ints, it also got you the small int cache, which has nearly
the same storage requirements as a pointer-as-int, and is probably as
expensive (you can drop the tests for odd addresses, but need to add
increfs and decrefs for ints).


From  Mon Aug 12 23:21:20 2002
From: (Gordon McMillan)
Date: Mon, 12 Aug 2002 18:21:20 -0400
Subject: [Python-Dev] Performance (non)optimization: 31-bit ints in pointers
In-Reply-To: <>
References: Your message of "Mon, 12 Aug 2002 15:51:08 CDT." <>
Message-ID: <3D57FCA0.30691.64578AF5@localhost>

On 12 Aug 2002 at 17:07, Guido van Rossum wrote:

> ...  In Python I decided to
> stay away from cleverness except when extensive
> performance analysis showed there was a real need to
> speed something up. 

For which some of us are very grateful, even
without Tim nagging us.

-- Gordon

From  Tue Aug 13 00:57:17 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 01:57:17 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen <> writes:

> Yes, but due to the way the parser works everything works fine for
> me. In the constant definition file it says "foo = 0xff000000". The
> parser turns this into a negative integer. Fine with me, the bits
> are the same, and PyArg_Parse doesn't complain.

Please notice that this will stop working some day: 0xff000000 will be
a positive number, and the "i" parser will raise an OverflowError.

By that time, you might be using the "k" parser, which will accept
0xff000000 both as a negative and a positive number, and fill the int
with 0xff000000.

Before that happens, you might want to anticipate that problem, and
propose an implementation that means minimum changes for you - it then
will likely mean minimum changes for everybody else, as well. Perhaps
"k" isn't such a good solution, perhaps "I" is better, or perhaps "i"
should weaken its range checking, and emit a deprecationwarning when
an unsigned number is passed.


From  Tue Aug 13 01:15:40 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 20:15:40 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: Your message of "Tue, 13 Aug 2002 00:50:51 +0200."
References: <> <>
Message-ID: <>

> Unfortunately, on Windows, there is no way to find out: If you use the
> ANSI function (which not only covers ASCII, but the full user's code
> page), and you have a file name not representable in this code page,
> the system returns a file name that contains question marks.
> Of course, you could always use the Win32 Wide API (unicode) function,
> and convert the pure-ASCII strings into byte strings. That gives a
> number of options:
> - always return Unicode for Unicode directory argument,
> - return Unicode only for non-ASCII, and only for Unicode argument,
> - return Unicode only for non-ASCII, regardless of Unicode argument,
> - return Unicode only for non-MBCS (again depending or not depending
>   on whether the argument is Unicode).
> In the third case, if you have a non-representable file name, you
> currently get a string like "??????.txt", whereas you then get
> u"\uabcd\uefgh...txt". What might be worse: If the file name is
> representable in "mbcs", yet outside ASCII, you currently get a "good"
> byte string, and you get a Unicode string under option three.

Why is getting Unicode worse than getting MBCS?  #3 looks right to me...

> So the MBCS options sound better. Unfortunately, testing whether a
> string encodes as MBCS might be expensive.

I still don't fully understand MBCS.  I know there's a variable
assignment of codes to the upper half of the 8-bit space, based on a
user setting.  But is that always a simply mapping to 128 non-ASCII
characters, or are there multi-byte codes that expand the total
character set to more than 256?

> For readlink, if you trust FileSystemDefaultEncoding, you could return
> a Unicode object if you find non-ASCII in the link contents.

What is FileSystemDefaultEncoding and when can you trust it?

> For getcwd, you again have the issue of reliably detecting non-ASCII
> if you use the ANSI function; if you use the Wide function, you again
> have the choice of returning Unicode only if non-ASCII, or only if
> non-MBCS.

Wide + Unicode (if non-ASCII) sounds good to me.  The fewer places an
app has to deal with MBCS the better, it seems to me.

--Guido van Rossum (home page:

From  Tue Aug 13 01:27:16 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 20:27:16 -0400
Subject: [Python-Dev] Performance (non)optimization: 31-bit ints in pointers
In-Reply-To: Your message of "Tue, 13 Aug 2002 01:04:48 +0200."
References: <> <>
Message-ID: <>

> > In Python I decided to stay away from cleverness except when
> > extensive performance analysis showed there was a real need to speed
> > something up.  That got us super-fast dicts, for example, and .pyc
> > files to cache the work of the (slow, but trick-free) parser.
> For small ints, it also got you the small int cache, which has nearly
> the same storage requirements as a pointer-as-int, and is probably as
> expensive (you can drop the tests for odd addresses, but need to add
> increfs and decrefs for ints).

But the increfs and decrefs for ints are goodness, because they
simplify the code.  You can incref/decref any object without having to
know its type.

--Guido van Rossum (home page:

From  Tue Aug 13 01:34:00 2002
From: (Guido van Rossum)
Date: Mon, 12 Aug 2002 20:34:00 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Tue, 13 Aug 2002 01:57:17 +0200."
References: <>
Message-ID: <>

> Jack Jansen <> writes:
> > Yes, but due to the way the parser works everything works fine for
> > me. In the constant definition file it says "foo = 0xff000000". The
> > parser turns this into a negative integer. Fine with me, the bits
> > are the same, and PyArg_Parse doesn't complain.

> Please notice that this will stop working some day: 0xff000000 will be
> a positive number, and the "i" parser will raise an OverflowError.
> By that time, you might be using the "k" parser, which will accept
> 0xff000000 both as a negative and a positive number, and fill the int
> with 0xff000000.
> Before that happens, you might want to anticipate that problem, and
> propose an implementation that means minimum changes for you - it then
> will likely mean minimum changes for everybody else, as well. Perhaps
> "k" isn't such a good solution, perhaps "I" is better, or perhaps "i"
> should weaken its range checking, and emit a deprecationwarning when
> an unsigned number is passed.

Why is it so hard to get people to think about what they need?  (I
mean beyond "I don't want anything to change" or vague things like
that.  I am looking for an API that will make developers like Jack as
well as other extension developers happy, but it feels like pulling
teeth.  Have I not explained the issues and boundary conditions
clearly enough?  (About the only non-negotiable thing is that at some
point there shall be no difference in how ints and longs with the same
mathematical value are treated; and the fact that 0xffffffff shall be
a positive number whose value is 4294967295.)

--Guido van Rossum (home page:

From  Tue Aug 13 01:44:46 2002
From: (Tim Peters)
Date: Mon, 12 Aug 2002 20:44:46 -0400
Subject: [Python-Dev] Performance (non)optimization: 31-bit ints in pointers
In-Reply-To: <>
Message-ID: <>

[Jeff Epler]
> Many Lisp interpreters use 'tagged types' to, among other things, let
> small ints reside directly in the machine registers.

And many Lisp interpreters derive from ones written for once-trendy Lisp
hardware, which had special support for tag bits.  Simulating this in
software is a PITA.

> (due to alignment requirements on all common machines, all valid
> pointers-to-struct have 0 in their low bit)

Not so on word-addressed machines, though, or on machines using low-order
pointer bits for their own notion of tag bits.

patiently-awaiting-seymour-cray's-resurrection-ly y'rs  - tim

From  Tue Aug 13 04:59:11 2002
From: (Dan Sugalski)
Date: Mon, 12 Aug 2002 23:59:11 -0400
Subject: [Python-Dev] Bugs in the python grammar?
Message-ID: <a05111b09b97e3419ee68@[]>

We've been digging through the python grammar, looking to build up a 
parser for it, and have come across what look to be bugs:

In :

a_expr ::=
              m_expr | aexpr "+" m_expr
               aexpr "-" m_expr

		aexpr "=" m_expr

should be:

		| aexpr "=" m_expr

lambda_form ::=
	"lambda" [parameter_list]: expression

'[]:' doesn't make much sense. Do you mean:

	"lambda" [parameter_list]":" expression

parameter_list ::=
              (defparameter ",")*
                 ("*" identifier [, "**" identifier]
                 | "**" identifier
                   | defparameter [","])

		("*" identifier [, "**" identifier]

should be:
		("*" identifier ["," "**" identifier]

These known issues? Or have we mis-analyzed things somewhere?

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai                         have teddy bears and even
                                       teddy bears get drunk

From  Tue Aug 13 07:15:06 2002
From: (Oren Tirosh)
Date: Tue, 13 Aug 2002 02:15:06 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Tue, Aug 13, 2002 at 01:57:17AM +0200, Martin v. Loewis wrote:
> Jack Jansen <> writes:
> > Yes, but due to the way the parser works everything works fine for
> > me. In the constant definition file it says "foo = 0xff000000". The
> > parser turns this into a negative integer. Fine with me, the bits
> > are the same, and PyArg_Parse doesn't complain.
> Please notice that this will stop working some day: 0xff000000 will be
> a positive number, and the "i" parser will raise an OverflowError.

The problem is that many programmers have 0xFFFFFFFF pretty much hard-wired
into their brains as -1. How about treating plain hexadecimal literals as
signed 32 bits regardless of platform integer size?  I think that this will
produce the smallest number of incompatibilities for existing code and
maintain compatibility with C header files on 32 bit platforms. In this case 
0xff000000 will always be interpreted as -16777216 and the 'i' parser will 
happily convert it to wither 0xFF000000 or 0xFFFFFFFFFF000000, depending on
the native platform word size - which is probably what the programmer 

I don't think that interpreting 0xFF000000 as 4278190080 will really help
anyone. This includes users of 64 bit platforms - I don't think many of them
actually think of 0xFF000000 as 4278190080. To them it probably means "Danger,
Will Robinson! Unportable code!". So what's the point of having Python 
interpret it as 4278190080?  If what I really meant was 4278190080 I can
represent it portably as 0xFF000000L and in this case the 'i' parser will 
complain on 32 bit platforms - with a good reason.

To support header files that can be included from Python and C and produce
unambigous results on both 32 and 64 bit platforms it is possible to add
support for the C suffixes UL/LU and ULL/LLU.  


From  Tue Aug 13 07:51:28 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 08:51:28 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> Why is getting Unicode worse than getting MBCS?  #3 looks right to me...

If people do

out =3D open("names.txt","w")
for f in os.listdir("."):
  print >>out, f

then this will print all filenames in mbcs. Under your proposed
changed, it will raise a UnicodeError.

> I still don't fully understand MBCS.  I know there's a variable
> assignment of codes to the upper half of the 8-bit space, based on a
> user setting.  But is that always a simply mapping to 128 non-ASCII
> characters, or are there multi-byte codes that expand the total
> character set to more than 256?

Yes, the "mbcs" might be truly multibyte. Microsoft calls it the "ANSI
code page", CP_ACP, which varies with the localization. They currently

code page region                 encoding style
1250      Central Europe         8-bit
1251      Cyrillic               8-bit
1252      Western Europe         8-bit
1253      Greek                  8-bit
1254      Turkish                8-bit
1255      Hebrew                 8-bit
1256      Arabic                 8-bit
1257      Baltic                 8-bit
1258      Vietnamese             8-bit

874       Thai                   multi-byte
932       Japan                  Shift-JIS, multi-byte
936       Simplified Chinese     GB2312, multi-byte
949       Korea                  multi-byte
950       Traditional Chinese    BIG5, multi-byte

The multi-byte codes fall in two categories: those that use bytes <128
for multi-byte codes (e.g. 950) and those that don't (e.g. 932); the
latter ones restrict themselves to bytes >=3D128 for multi-byte
characters (I believe this is what the Shift in Shift-JIS tries to

> > For readlink, if you trust FileSystemDefaultEncoding, you could return
> > a Unicode object if you find non-ASCII in the link contents.
> What is FileSystemDefaultEncoding and when can you trust it?

It's a global variable (really called Py_FileSystemDefaultEncoding),
introduced by Mark Hammond, and should be set to the encoding that the
operating system uses to encode file names, on the file system API.

On Windows, this is reliably CP_ACP/"mbcs". On Unix, it is the
locale's encoding by convention, which is set only if
setlocale(LC_CTYPE,"") was called. Some Unix users may not follow the
convention, or may have file names which cannot be represented in
their locale's encoding.

> Wide + Unicode (if non-ASCII) sounds good to me.  The fewer places an
> app has to deal with MBCS the better, it seems to me.

Ok, I'll update the PEP.

You may have been under the impression that MBCS is only relevant in
Far East, so let me stress this point: It applies to all windows
versions, e.g. a user of a French installation who has a file named
C:\Docs\Boulot\S=E9minaireLORIA-jan2002\DemoCORBA (bug #509117)
currently gets a byte string when listing C:\Docs\Boulot, but will
get a Unicode string under the modified PEP 277.


From  Tue Aug 13 08:07:49 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 09:07:49 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
References: <>
Message-ID: <>

Oren Tirosh <> writes:

> The problem is that many programmers have 0xFFFFFFFF pretty much hard-wired
> into their brains as -1. How about treating plain hexadecimal literals as
> signed 32 bits regardless of platform integer size?  

The idea is that, for any sequence S of digits, S, and SL, should mean
the same thing. So 0xFFFFFFFF should mean the same thing as

Programmers may have that hard-wired, but this is not a problem; a
problem only arises if their code breaks. In many cases, it won't.

> I think that this will produce the smallest number of
> incompatibilities for existing code and maintain compatibility with
> C header files on 32 bit platforms. In this case 0xff000000 will
> always be interpreted as -16777216 and the 'i' parser will happily
> convert it to wither 0xFF000000 or 0xFFFFFFFFFF000000, depending on
> the native platform word size - which is probably what the
> programmer meant.

This means you suggest that PEP 237 is not implemented, or atleast
frozen at the current stage.

> So what's the point of having Python interpret it as 4278190080?

It allows to unify ints and longs, see PEP 237.

> If what I really meant was 4278190080 I can represent it portably as
> 0xFF000000L and in this case the 'i' parser will complain on 32 bit
> platforms - with a good reason.

Yes, but the L suffix will go away one day.


From  Tue Aug 13 10:11:01 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 05:11:01 -0400
Subject: [Python-Dev] Bugs in the python grammar?
In-Reply-To: Your message of "Mon, 12 Aug 2002 23:59:11 EDT."
References: <a05111b09b97e3419ee68@[]>
Message-ID: <>

> We've been digging through the python grammar, looking to build up a 
> parser for it, and have come across what look to be bugs:
> In :

I don't know where that file comes from; it's not the official
grammar.  Fred will fix the typos you found.

This one is correct (we use it to generate our parser):

Or download Python and look at Grammar/Grammar .

--Guido van Rossum (home page:

From  Tue Aug 13 10:41:54 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 05:41:54 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: Your message of "Tue, 13 Aug 2002 08:51:28 +0200."
References: <> <> <> <>
Message-ID: <>

> > Why is getting Unicode worse than getting MBCS?  #3 looks right to me...
> If people do
> out = open("names.txt","w")
> for f in os.listdir("."):
>   print >>out, f
> then this will print all filenames in mbcs. Under your proposed
> changed, it will raise a UnicodeError.

OK, you've convinced me.  I guess the best compromise then is 8-bit
in, MBCS out, and Unicode in, Unicode out.

> > I still don't fully understand MBCS.  I know there's a variable
> > assignment of codes to the upper half of the 8-bit space, based on a
> > user setting.  But is that always a simply mapping to 128 non-ASCII
> > characters, or are there multi-byte codes that expand the total
> > character set to more than 256?
> Yes, the "mbcs" might be truly multibyte. Microsoft calls it the "ANSI
> code page", CP_ACP, which varies with the localization. They currently
> use:
> code page region                 encoding style
> 1250      Central Europe         8-bit
> 1251      Cyrillic               8-bit
> 1252      Western Europe         8-bit
> 1253      Greek                  8-bit
> 1254      Turkish                8-bit
> 1255      Hebrew                 8-bit
> 1256      Arabic                 8-bit
> 1257      Baltic                 8-bit
> 1258      Vietnamese             8-bit
> 874       Thai                   multi-byte
> 932       Japan                  Shift-JIS, multi-byte
> 936       Simplified Chinese     GB2312, multi-byte
> 949       Korea                  multi-byte
> 950       Traditional Chinese    BIG5, multi-byte
> The multi-byte codes fall in two categories: those that use bytes <128
> for multi-byte codes (e.g. 950) and those that don't (e.g. 932); the
> latter ones restrict themselves to bytes >=128 for multi-byte
> characters (I believe this is what the Shift in Shift-JIS tries to
> indicate).

Aha!  So MBCS is not an encoding: it's an indirection for a variety of
encodings.  (Is there a way to find out what the encoding is?)

> > > For readlink, if you trust FileSystemDefaultEncoding, you could return
> > > a Unicode object if you find non-ASCII in the link contents.
> > 
> > What is FileSystemDefaultEncoding and when can you trust it?
> It's a global variable (really called Py_FileSystemDefaultEncoding),
> introduced by Mark Hammond, and should be set to the encoding that the
> operating system uses to encode file names, on the file system API.
> On Windows, this is reliably CP_ACP/"mbcs".

Do you mean that the condition on

#if defined(HAVE_LANGINFO_H) && defined(CODESET)

is reliably false on Windows?  Otherwise _locale.setlocale() could set

> On Unix, it is the locale's encoding by convention, which is set
> only if setlocale(LC_CTYPE,"") was called. Some Unix users may not
> follow the convention, or may have file names which cannot be
> represented in their locale's encoding.

So as long as they use 8-bit it's not our problem, right.  Another
reason to avoid prodicing Unicode without a clue that the app expects
Unicode (alas).  (BTW I find a Unicode argument to os.listdir() a
sufficient clue.  IOW os.listdir(u".") should return Unicode.)

> > Wide + Unicode (if non-ASCII) sounds good to me.  The fewer places an
> > app has to deal with MBCS the better, it seems to me.
> Ok, I'll update the PEP.

To what?  (It would be bad if I convinced you at the same time you
convinced me of the opposite. :-)

> You may have been under the impression that MBCS is only relevant in
> Far East, so let me stress this point: It applies to all windows
> versions, e.g. a user of a French installation who has a file named
> C:\Docs\Boulot\SéminaireLORIA-jan2002\DemoCORBA (bug #509117)
> currently gets a byte string when listing C:\Docs\Boulot, but will
> get a Unicode string under the modified PEP 277.

No, I was aware of that part.  I guess they should get MBCS on
os.listdir('C:\\Docs\\Boulot') but Unicode on

--Guido van Rossum (home page:

From  Tue Aug 13 11:02:59 2002
From: (Jack Jansen)
Date: Tue, 13 Aug 2002 12:02:59 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>


> Before that happens, you might want to anticipate that problem, and
> propose an implementation that means minimum changes for you - it then
> will likely mean minimum changes for everybody else, as well. Perhaps
> "k" isn't such a good solution, perhaps "I" is better, or perhaps "i"
> should weaken its range checking, and emit a deprecationwarning when
> an unsigned number is passed.

The least amount of work for me would be caused by keeping "i" semantics 
as they are, of course.

If we switch to "k" for integers in the range -2**-31..2**31-1 that 
would not be too much work, as a lot of the code is generated (I would 
take the quick and dirty approach of using k for all my integers). Only 
the hand-written code would have to be massaged by hand.

If we have only pure signed and pure unsigned converters it would mean 
an extraordinary amount of work, but luckily it seems that that is not 
going to happen.

On Tuesday, August 13, 2002, at 02:34 , Guido van Rossum wrote:
> Why is it so hard to get people to think about what they need?  (I
> mean beyond "I don't want anything to change" or vague things like
> that.  I am looking for an API that will make developers like Jack as
> well as other extension developers happy, but it feels like pulling
> teeth.

It feels that way because pulling teeth is probably exactly the right 
analogy: what you're doing is probably a good idea in the long run, but 
right now it hurts...

- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- Emma 
Goldman -

From  Tue Aug 13 11:20:31 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 06:20:31 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Tue, 13 Aug 2002 12:02:59 +0200."
References: <>
Message-ID: <>

> The least amount of work for me would be caused by keeping "i" semantics 
> as they are, of course.

So you're using 'i', not 'l'?  Any particular reason?

> If we switch to "k" for integers in the range -2**-31..2**31-1 that 
> would not be too much work, as a lot of the code is generated (I would 
> take the quick and dirty approach of using k for all my integers). Only 
> the hand-written code would have to be massaged by hand.

Glad, that's my preferred choice too.  But note that in Python 2.4 and
beyond, 'k' will only accept positive inputs, so you'll really have to
find a way to mark your signed integer arguments up differently.

In 2.3 (and 2.2.2), I propose the following semantics for 'k': if the
argument is a Python int, a signed value within the range
[INT_MIN,INT_MAX] is required; if it is a Python long, a nonnegative
value in the range [0, 2*INT_MAX+1] is required.  These are the same
semantics that are currently used by struct.pack() for 'L', I found
out; I like these.

We'll have to niggle about the C type corresponding to 'k'.  Should it
be 'int' or 'long'?  It may not matter for you, since you expect to be
running on 32-bit hardware forever; but it matters for other potential
users of 'k'.  We could also have both 'k' and 'K', where 'k' stores
into a C int and 'K' into a C long.

I also propose to have a C API PyInt_AsUnsignedLong, which will
implement the semantics of 'K'.  Like 'i', 'k' will have to do an
explicit range test.

> If we have only pure signed and pure unsigned converters it would mean 
> an extraordinary amount of work, but luckily it seems that that is not 
> going to happen.

Not until 2.4, that is -- then 'k' (and 'K') will change to pure
unsigned.  But your hex constants and results of left shifts will
*also* be pure unsigned then; the only problem would be with Python
code that uses ~0 or -1 as a shorthand for 0xffffffff (which it ain't
on 64-bit machines today).

> It feels that way because pulling teeth is probably exactly the right 
> analogy: what you're doing is probably a good idea in the long run, but 
> right now it hurts...

OK, that's a good extension of the analogy.

Glad we're moving forward on this.

--Guido van Rossum (home page:

From  Tue Aug 13 11:55:52 2002
From: (M.-A. Lemburg)
Date: Tue, 13 Aug 2002 12:55:52 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
References: <>
Message-ID: <>

Jack Jansen wrote:
> On Tuesday, August 13, 2002, at 02:34 , Guido van Rossum wrote:
>> Why is it so hard to get people to think about what they need?  (I
>> mean beyond "I don't want anything to change" or vague things like
>> that.  I am looking for an API that will make developers like Jack as
>> well as other extension developers happy, but it feels like pulling
>> teeth.
> It feels that way because pulling teeth is probably exactly the right 
> analogy: what you're doing is probably a good idea in the long run, but 
> right now it hurts...

Here's an slightly different idea:

   Bit shifting is hardly ever done on signed data and since
   Python does not provide an unsigned integer object, most
   developers stick to the integer object and interpret its value
   as unsigned object (much like many people interpret strings
   as having a Latin-1 value). If people use integers that way,
   they know what they are doing and they are usually not interested
   in the sign of the value at all, as long as the bits stay the
   same when they pass the value to various Python APIs (including
   C APIs).

   Conclusion: Offer developers a better way to deal with unsigned
   data, e.g. an unsigned 32-bit integer type as subtype of int and
   let the bit manipulation operators return this unsigned type.

   For backward compat. make sure that common parser markers continue
   to work as they do now and add new ones for unsigned values for
   future use. PyInt_AS_LONG(unsignedInteger) would return the
   value of unsignedInteger casted to a signed one and extensions
   would be happy.

   Only if a value gets shifted beyond the first 32 bits,
   convert it to a long.

That should solve most backward compat problems for bit
shifters while still unifying ints and longs.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Tue Aug 13 12:24:34 2002
From: (Duncan Booth)
Date: Tue, 13 Aug 2002 12:24:34 +0100
Subject: [Python-Dev] Bugs in the python grammar?
References: <a05111b09b97e3419ee68@[]> <>
Message-ID: <Xns92697D259330Eduncanrcpcouk@>

Guido van Rossum <> wrote in 
> This one is correct (we use it to generate our parser):
> rammar/Grammar?rev=1.48&content-type=text/vnd.viewcvs-markup 
> Or download Python and look at Grammar/Grammar .
Does the program for converting the grammar to a railroad diagram still 
exist anywhere? I've searched Google, but I can't find any trace of it 

Duncan Booth                                   
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?

From  Tue Aug 13 12:31:15 2002
From: (=?ISO-8859-15?Q?Walter_D=F6rwald?=)
Date: Tue, 13 Aug 2002 13:31:15 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <> <>	<>	<>	<>	<>	<>	<> <>
Message-ID: <>

Martin v. Loewis wrote:

> Walter Dörwald <> writes:
>>Output is as follows:
>>1790000 chars, 2.330% unenc
>>ignore: 0.022 (factor=1.000)
>>xmlcharrefreplace: 0.044 (factor=1.962)
>>xml2: 0.267 (factor=12.003)
>>xml3: 0.723 (factor=32.506)
>>workaround: 5.151 (factor=231.702)
>>i.e. a 1.7MB string with 2.3% unencodable characters was
> Those numbers are impressive. Can you please add
> def xml4(exc):
>   if isinstance(exc, UnicodeEncodeError):
>     if exc.end-exc.start == 1:
>       return u"&#"+str(ord(exc.object[exc.start]))+u";"
>     else:
>       r = []
>       for c in exc.object[exc.start:exc.end]:
>         r.extend([u"&#", str(ord(c)), u";"])
>       return u"".join(r)
>   else:
>     raise TypeError("don't know how to handle %r" % exc)
> and report how that performs (assuming I made no error)?

You must return a tuple (replacement, new input position)
otherwise the code is correct. It tried it and two new

def xml5(exc):
     if isinstance(exc, UnicodeEncodeError):
         return (u"&#%d;" % ord(exc.object[exc.start]), exc.start+1)
         raise TypeError("don't know how to handle %r" % exc)

def xml6(exc):
     if isinstance(exc, UnicodeEncodeError):
         return (u"&#" + str(ord(exc.object[exc.start]) + u";"), 
         raise TypeError("don't know how to handle %r" % exc)

Here are the results:

1790000 chars, 2.330% unenc
ignore: 0.022 (factor=1.000)
xmlcharrefreplace: 0.042 (factor=1.935)
xml2: 0.264 (factor=12.084)
xml3: 0.733 (factor=33.529)
xml4: 0.504 (factor=23.057)
xml5: 0.474 (factor=21.649)
xml6: 0.481 (factor=22.010)
workaround: 5.138 (factor=234.862)

>>Using a callback instead of the inline implementation is a factor of
>>12 slower than ignore.
> For the purpose of comparing C and Python, this isn't relevant, is it?
> Only the C version of xmlcharrefreplace and a Python version should be
> compared.

I was just to lazy to code this. ;)

Python is a factor of 2.7 slower than the C callback
(or 1.9 for your version).

>>It can't really be fixed for codecs implemented in Python. For codecs
>>that use the C functions we could add the functionality that e.g.
>>PyUnicodeEncodeError_SetReason(exc) sets exc.reason and exc.args[3],
>>but AFAICT it can't be done easily for Python where attribute assignment
>>directly goes to the instance dict.
> You could add methods into the class set_reason etc, which error
> handler authors would have to use.
> Again, these methods could be added through Python code, so no C code
> would be necessary to implemenet them.
> You could even implement a setattr method in Python - although you'ld
> have to search this from C while initializing the class.

For me this sounds much more complicated than the current C functions, 
especially for using them from C, which most codecs probably will.

    Walter Dörwald

From  Tue Aug 13 12:38:53 2002
From: (=?ISO-8859-1?Q?Walter_D=F6rwald?=)
Date: Tue, 13 Aug 2002 13:38:53 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <> <>	<>	<>	<>	<>	<>	<>	<>	<> <>
Message-ID: <>

Martin v. Loewis wrote:

> [...]
>>For the charmap codec it's mostly about performance. I don't
>>have objections for other codecs which rely on external
> Please remember that we are still about error handling here, and that
> the normal case will be "strict", which usually results in aborting
> the computation.
> So I don't see the performance issue even for the charmap codec.

I guess this code might be used inside a webserver that outputs XML
results and that honors the Accept-Charset header from the client,
so it must do encoding on the fly.

So I want the code to be as fast as possible.

    Walter Dörwald

From  Tue Aug 13 12:32:33 2002
From: (Jack Jansen)
Date: Tue, 13 Aug 2002 13:32:33 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
Message-ID: <>

I was going to suggest that if we return mixed sets of unicode/string=20
values from listdir() we could also do the same thing for platforms=20
where FileSystemDefaultEncoding is utf-8, such as MacOSX.

But as usual with unicode, when I actually try this it doesn't work, and=20=

I don't understand why not. Why is unicode always something that seems=20=

so simple and logical until you actually try it??!?!?

Here's a transcript of my Python session. The terminal has been set to=20=

render in latin-1. The directory contains one file, "fr=F6r"=20
sap!jack- python
Python 2.3a0 (#32, Aug 12 2002, 15:31:25)
[GCC 2.95.2 19991024 (release)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
 >>> import os
 >>> os.listdir('.')
 >>> utf8name =3D os.listdir('.')[0]
 >>> unicodename =3D utf8name.decode('utf-8')
 >>> unicodename
 >>> print unicodename.encode('latin-1')
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
UnicodeError: Latin-1 encoding error: ordinal not in range(256)

Sigh. \u0308 is not in the range(256), but the whole point of=20
encode('latin-1') is to make it so, isn't it? And o-umlaut definitely=20
has a latin-1 encoding. I tried the same with macroman in stead of=20
latin-1 (just to make sure this wasn't a bug in the latin-1 encoder),=20
but still no go.

What am I doing wrong?
- Jack Jansen        <>       =20 -
- If I can't dance I don't want to be part of your revolution -- Emma=20
Goldman -

From  Tue Aug 13 12:50:59 2002
From: (
Date: Tue, 13 Aug 2002 06:50:59 -0500
Subject: [Python-Dev] Performance (non)optimization: 31-bit ints in pointers
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Mon, Aug 12, 2002 at 08:44:46PM -0400, Tim Peters wrote:
> [Jeff Epler]
> > (due to alignment requirements on all common machines, all valid
> > pointers-to-struct have 0 in their low bit)
> Not so on word-addressed machines, though, or on machines using low-order
> pointer bits for their own notion of tag bits.

Of course, "all common machines" simply means "x86 machines".


From  Tue Aug 13 12:46:52 2002
From: (Jack Jansen)
Date: Tue, 13 Aug 2002 13:46:52 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

On Tuesday, August 13, 2002, at 12:20 , Guido van Rossum wrote:

>> The least amount of work for me would be caused by keeping "i" 
>> semantics
>> as they are, of course.
> So you're using 'i', not 'l'?  Any particular reason?

No, sorry. It's l everywhere.

>> If we switch to "k" for integers in the range -2**-31..2**31-1 that
>> would not be too much work, as a lot of the code is generated (I would
>> take the quick and dirty approach of using k for all my integers). Only
>> the hand-written code would have to be massaged by hand.
> Glad, that's my preferred choice too.  But note that in Python 2.4 and
> beyond, 'k' will only accept positive inputs, so you'll really have to
> find a way to mark your signed integer arguments up differently.

Huh??! Now you've confused me. If "k" means "32 bit mask", why would it 
be changed in 2.4 not to accept negative values? "-1" is a perfectly 
normal way to specify "0xffffffff" in C usage...

> In 2.3 (and 2.2.2), I propose the following semantics for 'k': if the
> argument is a Python int, a signed value within the range
> [INT_MIN,INT_MAX] is required; if it is a Python long, a nonnegative
> value in the range [0, 2*INT_MAX+1] is required.  These are the same
> semantics that are currently used by struct.pack() for 'L', I found
> out; I like these.

I don't see the point, really. Why not allow [INT_MIN, 2*INT_MAX+1]? If 
the "k" specifier is especially meant for bit patterns why not have 
semantics of "anything goes, unless we are absolutely sure it isn't 
going to fit"?

> We'll have to niggle about the C type corresponding to 'k'.  Should it
> be 'int' or 'long'?  It may not matter for you, since you expect to be
> running on 32-bit hardware forever; but it matters for other potential
> users of 'k'.  We could also have both 'k' and 'K', where 'k' stores
> into a C int and 'K' into a C long.

How about k1 for a byte, k2 for a short, k4 for a long and k8 for a long 
> I also propose to have a C API PyInt_AsUnsignedLong, which will
> implement the semantics of 'K'.  Like 'i', 'k' will have to do an
> explicit range test.

In my proposal these would then probably become PyInt_As1Byte, 
PyInt_As2Bytes, PyInt_As4Bytes and PyInt_As8Bytes.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- Emma 
Goldman -

From  Tue Aug 13 13:08:51 2002
From: (Jason Tishler)
Date: Tue, 13 Aug 2002 08:08:51 -0400
Subject: [Python-Dev] Bugs #544740: test_commands test fails under Cygwin
Message-ID: <>

Neil Norwitz suggested that I discuss the following on python-dev:

The problem is that test_commands does not handle spaces in either user
or group names.  Although this is probably only an issue under Cygwin,
this could affect other Unixes too (yup, I'm clutching at straws).
Anyway, suggestions on how to fix this will be greatly appreciated.


From  Tue Aug 13 13:08:38 2002
From: (Barry A. Warsaw)
Date: Tue, 13 Aug 2002 08:08:38 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
References: <>
Message-ID: <>

>>>>> "GvR" == Guido van Rossum <> writes:

    GvR> In 2.3 (and 2.2.2), I propose the following semantics for
    GvR> 'k': if the argument is a Python int, a signed value within
    GvR> the range [INT_MIN,INT_MAX] is required; if it is a Python
    GvR> long, a nonnegative value in the range [0, 2*INT_MAX+1] is
    GvR> required.  These are the same semantics that are currently
    GvR> used by struct.pack() for 'L', I found out; I like these.

It's too bad struct.pack() and PyArg_ParseTuple() can't share the same
format character for the same semantics.  Py3k.


From  Tue Aug 13 13:13:27 2002
From: (=?ISO-8859-1?Q?Walter_D=F6rwald?=)
Date: Tue, 13 Aug 2002 14:13:27 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
References: <>
Message-ID: <>

Jack Jansen wrote:

> [...]
> Here's a transcript of my Python session. The terminal has been set to 
> render in latin-1. The directory contains one file, "frör" (fr-o-umlaut-r).
> sap!jack- python
> Python 2.3a0 (#32, Aug 12 2002, 15:31:25)
> [GCC 2.95.2 19991024 (release)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>  >>> import os
>  >>> os.listdir('.')
> ['fro\xcc\x88r']
>  >>> utf8name = os.listdir('.')[0]
>  >>> unicodename = utf8name.decode('utf-8')
>  >>> unicodename
> u'fro\u0308r'

'COMBINING DIAERESIS', i.e. the ö got decomposed into

> [...]

    Walter Dörwald

From  Tue Aug 13 13:17:54 2002
From: (Fredrik Lundh)
Date: Tue, 13 Aug 2002 14:17:54 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
References: <>
Message-ID: <025901c242c3$796b5b50$0900a8c0@spiff>

jack wrote:

> Sigh. \u0308 is not in the range(256), but the whole point of=20
> encode('latin-1') is to make it so, isn't it?

Define "make it so"?

The encoders convert unicode code points to corresponding code
points in the given 8-bit encoding.  One character in, one character
out (unless the target encoding is a multibyte encoding, like utf-8).

This works perfectly well if producers follow the "early uniform
normalization" rule (everything else is madness).  For some reason,
your listdir implementation doesn't.

Instead of returning LATIN SMALL LETTER O WITH DIARESIS (\u00f6),
it returns multiple unicode characters.  I'd say it's broken.

As far as I know, there's no standard unicode normalizer in Python.


From  Tue Aug 13 13:57:59 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 08:57:59 -0400
Subject: [Python-Dev] Bugs in the python grammar?
In-Reply-To: Your message of "Tue, 13 Aug 2002 12:24:34 BST."
References: <a05111b09b97e3419ee68@[]> <>
Message-ID: <>

> > Or download Python and look at Grammar/Grammar .
> > 
> Does the program for converting the grammar to a railroad diagram still 
> exist anywhere? I've searched Google, but I can't find any trace of it 
> anywhere.

No, I don't have it and I don't think the original author has it
either.  That was 11-12 years ago...

--Guido van Rossum (home page:

From  Tue Aug 13 14:03:21 2002
From: (Stepan Koltsov)
Date: Tue, 13 Aug 2002 17:03:21 +0400
Subject: [Python-Dev] q about __dict__
Message-ID: <>

Hi, Guido, other python developers and other subscribers :-)

Anybody, please explain me, is this a bug? :

=== begin ===
class Dict(dict):
	def __setitem__(*x):
		raise Exception, "Please, do not touch me!"

class A:
	def __init__(*x):
		self.__dict__ = Dict()

A().x = 12
===  end  ===
doesn't raise Exception, i. e. my __setitem__ function is not called.

Patching that is simple: in Objects/classobject.c need to replace
	PyDict_SetItem -> PyObject_SetItem
	PyDict_GetItem -> PyObject_GetItem
	PyDict_DelItem -> PyObject_DelItem
and maybe something else (not much).

Overhead is minimal, and as bonus python gets ability of assigning
object of any type (not inherited from dict) to __dict__.


Somethimes I want to write strange classes, for example, class with
ordered attributes. I know, that it is possibe to implement this
redefining class attributes __setattr__, etc., but setting __dict__
looks clearer.

Thanks for reading this letter till the end ;-)

mailto: Stepan Koltsov <>

From  Tue Aug 13 14:04:30 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 09:04:30 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: Your message of "Tue, 13 Aug 2002 14:17:54 +0200."
References: <>
Message-ID: <>

> This works perfectly well if producers follow the "early uniform
> normalization" rule (everything else is madness).  For some reason,
> your listdir implementation doesn't.

My guess it's not his listdir() or filesystem, but the keyboard

--Guido van Rossum (home page:

From  Tue Aug 13 14:01:25 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 09:01:25 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: Your message of "Tue, 13 Aug 2002 13:32:33 +0200."
References: <>
Message-ID: <>

> Here's a transcript of my Python session. The terminal has been set to 
> render in latin-1. The directory contains one file, "frör" 
> (fr-o-umlaut-r).
> sap!jack- python
> Python 2.3a0 (#32, Aug 12 2002, 15:31:25)
> [GCC 2.95.2 19991024 (release)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>  >>> import os
>  >>> os.listdir('.')
> ['fro\xcc\x88r']
>  >>> utf8name = os.listdir('.')[0]
>  >>> unicodename = utf8name.decode('utf-8')
>  >>> unicodename
> u'fro\u0308r'
>  >>> print unicodename.encode('latin-1')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in ?
> UnicodeError: Latin-1 encoding error: ordinal not in range(256)
>  >>>
> Sigh. \u0308 is not in the range(256), but the whole point of 
> encode('latin-1') is to make it so, isn't it? And o-umlaut definitely 
> has a latin-1 encoding. I tried the same with macroman in stead of 
> latin-1 (just to make sure this wasn't a bug in the latin-1 encoder), 
> but still no go.
> What am I doing wrong?

Looks like it isn't you: the filename somehow contains a character
that's not in the Latin-1 subset of Unicode, and no encoding can fix
that for you.  I don't know why -- you'll have to figure out why your
keyboard generates that character when you type o-umlaut.

--Guido van Rossum (home page:

From  Tue Aug 13 14:08:53 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 09:08:53 -0400
Subject: [Python-Dev] Bugs #544740: test_commands test fails under Cygwin
In-Reply-To: Your message of "Tue, 13 Aug 2002 08:08:51 EDT."
References: <>
Message-ID: <>

> Neil Norwitz suggested that I discuss the following on python-dev:
> The problem is that test_commands does not handle spaces in either user
> or group names.  Although this is probably only an issue under Cygwin,
> this could affect other Unixes too (yup, I'm clutching at straws).
> Anyway, suggestions on how to fix this will be greatly appreciated.

The obvious fix would be a better regular expression.  Please submit a

--Guido van Rossum (home page:

From  Tue Aug 13 14:12:16 2002
From: (Neil Hodgson)
Date: Tue, 13 Aug 2002 23:12:16 +1000
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
References: <>
Message-ID: <04ff01c242cb$0c0bf620$3da48490@neil>

> Please comment on the PEP. There is an updated patch on
>; please comment on the patch as well.

   I received off-list replies to the PEP from about 5 people. All were in
favour but it doesn't show a great deal of interest. It is hard to place a
good limit on how far this PEP should extend. My initial proposal was just
to allow opening files with Unicode names. The extension to other functions
suggested on the list were worthwhile, especially listdir, but since NT
supports Unicode in all system calls, it could end up being applied to less
useful calls such as popen and getenv.

   There was a suggestion from David Ascher that supporting a Unicode
version of getcwd would be useful and I agree as this will often feed into
the other file handling calls. This one can't be finessed by checking an
input argument for Unicode, so needs an extra name such as getcwdu. It'd be
a good idea here to work out a naming convention for this distinction now so
it can be used for more functions in the future.

> Aren't there some #ifdefs missing? posix_[12]str have code
> that's only relevant for Windows but isn't #ifdef'ed out
> like it is elsewhere.

   I didn't have more #ifdefs to shorten the code. The #ifdefs that exist
are to hide symbols (like _wmkdir) that may only be available on Windows.
The Unicode paths are guarded by unicode_file_names() so will be avoided on
other platforms. It doesn't matter greatly to me if there are additional
compile time guards although taking it further to have the extra (wide)
arguments to posix_[12]str only on Windows would obfuscate the code.


From  Tue Aug 13 14:11:18 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 09:11:18 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Tue, 13 Aug 2002 12:55:52 +0200."
References: <>
Message-ID: <>

> Here's an slightly different idea:
>    Bit shifting is hardly ever done on signed data and since
>    Python does not provide an unsigned integer object, most
>    developers stick to the integer object and interpret its value
>    as unsigned object (much like many people interpret strings
>    as having a Latin-1 value). If people use integers that way,
>    they know what they are doing and they are usually not interested
>    in the sign of the value at all, as long as the bits stay the
>    same when they pass the value to various Python APIs (including
>    C APIs).
>    Conclusion: Offer developers a better way to deal with unsigned
>    data, e.g. an unsigned 32-bit integer type as subtype of int and
>    let the bit manipulation operators return this unsigned type.
>    For backward compat. make sure that common parser markers continue
>    to work as they do now and add new ones for unsigned values for
>    future use. PyInt_AS_LONG(unsignedInteger) would return the
>    value of unsignedInteger casted to a signed one and extensions
>    would be happy.
>    Only if a value gets shifted beyond the first 32 bits,
>    convert it to a long.
> That should solve most backward compat problems for bit
> shifters while still unifying ints and longs.


We are *already* offering developers a way to deal with unsigned data:
use longs.  Bit shifting works just fine on longs, and the results are
positive unless you "or" in a negative number.  Getting a 32-bit
result in Python is trivial (mask with 0xffffffffL).  The Python C API
already supports getting an unsigned C long out of a Python long

--Guido van Rossum (home page:

From  Tue Aug 13 15:23:28 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 10:23:28 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: Your message of "Tue, 13 Aug 2002 23:12:16 +1000."
References: <>
Message-ID: <>

> There was a suggestion from David Ascher that supporting a Unicode
> version of getcwd would be useful and I agree as this will often feed into
> the other file handling calls. This one can't be finessed by checking an
> input argument for Unicode, so needs an extra name such as getcwdu. It'd be
> a good idea here to work out a naming convention for this distinction now so
> it can be used for more functions in the future.

It's gonna be ugly anyhow, so appending a 'u' is fine with me.

> Guido:
> > Aren't there some #ifdefs missing? posix_[12]str have code
> > that's only relevant for Windows but isn't #ifdef'ed out
> > like it is elsewhere.
>    I didn't have more #ifdefs to shorten the code. The #ifdefs that exist
> are to hide symbols (like _wmkdir) that may only be available on Windows.
> The Unicode paths are guarded by unicode_file_names() so will be avoided on
> other platforms. It doesn't matter greatly to me if there are additional
> compile time guards although taking it further to have the extra (wide)
> arguments to posix_[12]str only on Windows would obfuscate the code.

Those are all details.  We can finesse that when we get closer to
agreeing on the semantics.  I think code that we know will never be
executed on Unix should be inside #ifdefs.  Maybe we should reconsider
moving the Windows code to a separate file...

--Guido van Rossum (home page:

From  Tue Aug 13 15:33:19 2002
From: (Dan Sugalski)
Date: Tue, 13 Aug 2002 10:33:19 -0400
Subject: [Python-Dev] Bugs in the python grammar?
In-Reply-To: <>
References: <a05111b09b97e3419ee68@[]>
Message-ID: <a05111b01b97ec9262a7e@[]>

At 5:11 AM -0400 8/13/02, Guido van Rossum wrote:
>  > We've been digging through the python grammar, looking to build up a
>>  parser for it, and have come across what look to be bugs:
>>  In :
>I don't know where that file comes from; it's not the official
>grammar.  Fred will fix the typos you found.
>This one is correct (we use it to generate our parser):

Okay, cool, thanks.

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai                         have teddy bears and even
                                       teddy bears get drunk

From  Tue Aug 13 15:45:36 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 16:45:36 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> Aha!  So MBCS is not an encoding: it's an indirection for a variety of
> encodings.  (Is there a way to find out what the encoding is?)

Correct. In Python, locale.getdefaultlocale()[1] returns the encoding;
the underlying API function is GetACP, and Python uses it as

    PyOS_snprintf(encoding, sizeof(encoding), "cp%d", GetACP());

There is a second indirection, the "OEM code page", which they use:
- for on-disk FAT short file names,
- for the cmd.exe window

Python currently offers no access to GetOEMCP().

> Do you mean that the condition on
> #if defined(HAVE_LANGINFO_H) && defined(CODESET)
> is reliably false on Windows?  Otherwise _locale.setlocale() could set
> it.

Correct. nl_langinfo is a Sun invention (I believe) which made it into
Posix; Microsoft ignores it.

> So as long as they use 8-bit it's not our problem, right.  Another
> reason to avoid prodicing Unicode without a clue that the app expects
> Unicode (alas).  (BTW I find a Unicode argument to os.listdir() a
> sufficient clue.  IOW os.listdir(u".") should return Unicode.)

Indeed, that would be consistent. I deliberately want to leave this
out of PEP 277. On Unix, things are not that clear - as Jack points
out, readlink() and getcwd() also need consideration.

> > Ok, I'll update the PEP.
> To what?  (It would be bad if I convinced you at the same time you
> convinced me of the opposite. :-)

I haven't changed anything yet, and I won't. 

In this terrain, Windows has the cleaner API (they consider file names
as character strings, not as byte strings), so doing the right thing
is easier.


From  Tue Aug 13 15:48:09 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 16:48:09 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen <> writes:

> I was going to suggest that if we return mixed sets of unicode/string
> values from listdir() we could also do the same thing for platforms
> where FileSystemDefaultEncoding is utf-8, such as MacOSX.

Indeed, on MacOS, I think returning Unicode objects is a safe thing to
do as well.


From  Tue Aug 13 15:51:40 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 10:51:40 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: Your message of "Tue, 13 Aug 2002 16:45:36 +0200."
References: <> <> <> <> <> <>
Message-ID: <>

> > Aha!  So MBCS is not an encoding: it's an indirection for a variety of
> > encodings.  (Is there a way to find out what the encoding is?)
> Correct. In Python, locale.getdefaultlocale()[1] returns the encoding;
> the underlying API function is GetACP, and Python uses it as
>     PyOS_snprintf(encoding, sizeof(encoding), "cp%d", GetACP());
> There is a second indirection, the "OEM code page", which they use:
> - for on-disk FAT short file names,
> - for the cmd.exe window
> Python currently offers no access to GetOEMCP().
> > Do you mean that the condition on
> > 
> > #if defined(HAVE_LANGINFO_H) && defined(CODESET)
> > 
> > is reliably false on Windows?  Otherwise _locale.setlocale() could set
> > it.
> Correct. nl_langinfo is a Sun invention (I believe) which made it into
> Posix; Microsoft ignores it.
> > So as long as they use 8-bit it's not our problem, right.  Another
> > reason to avoid prodicing Unicode without a clue that the app expects
> > Unicode (alas).  (BTW I find a Unicode argument to os.listdir() a
> > sufficient clue.  IOW os.listdir(u".") should return Unicode.)
> Indeed, that would be consistent. I deliberately want to leave this
> out of PEP 277. On Unix, things are not that clear - as Jack points
> out, readlink() and getcwd() also need consideration.
> > > Ok, I'll update the PEP.
> > 
> > To what?  (It would be bad if I convinced you at the same time you
> > convinced me of the opposite. :-)
> I haven't changed anything yet, and I won't. 
> In this terrain, Windows has the cleaner API (they consider file names
> as character strings, not as byte strings), so doing the right thing
> is easier.

OK.  I leave this further in your capable hands, Martin!

--Guido van Rossum (home page:

From  Tue Aug 13 15:50:59 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 16:50:59 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <025901c242c3$796b5b50$0900a8c0@spiff>
References: <>
Message-ID: <>

"Fredrik Lundh" <> writes:

> As far as I know, there's no standard unicode normalizer in Python.

Maybe that example shows that there should be: the codecs which use
combining characters should then normalize the string on error,
probably to NFC, and retry.


From  Tue Aug 13 15:54:08 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 16:54:08 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> Looks like it isn't you: the filename somehow contains a character
> that's not in the Latin-1 subset of Unicode, and no encoding can fix
> that for you.  I don't know why -- you'll have to figure out why your
> keyboard generates that character when you type o-umlaut.

As Walter explains, he has \u006f\u0308, which is


This could be normalized to


which then can be encoded as Latin-1. This, of course, requires the
databases for normalization (canonical composition and decomposition).


From  Tue Aug 13 15:59:05 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 16:59:05 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen <> writes:

> If we have only pure signed and pure unsigned converters it would mean
> an extraordinary amount of work, but luckily it seems that that is not
> going to happen.

Now I'm confused: "l" *is* a "pure signed converter", no? I.e. it
won't accept a value above 2**31-1, right?


From  Tue Aug 13 15:56:33 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 16:56:33 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <04ff01c242cb$0c0bf620$3da48490@neil>
References: <>
Message-ID: <>

"Neil Hodgson" <> writes:

>    There was a suggestion from David Ascher that supporting a Unicode
> version of getcwd would be useful and I agree as this will often feed into
> the other file handling calls. This one can't be finessed by checking an
> input argument for Unicode, so needs an extra name such as getcwdu. It'd be
> a good idea here to work out a naming convention for this distinction now so
> it can be used for more functions in the future.

Alternatively, a flag could do. Alas, it currently isn't in the PEP,
and unless there is easy agreement on how it should work, I think this
must be left for further study.


From  Tue Aug 13 16:04:32 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 11:04:32 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: Your message of "Tue, 13 Aug 2002 16:54:08 +0200."
References: <> <>
Message-ID: <>

> As Walter explains, he has \u006f\u0308, which is
> This could be normalized to
> which then can be encoded as Latin-1. This, of course, requires the
> databases for normalization (canonical composition and decomposition).

But if you pass the normalized string (or the Latin-1 string) to
open(), will it find the file?  I.e. if the filesystem has the
unnormalized name stored in its directory, will filesystem requests
normalize filenames before comparing them?

Jack, can you try to do that?  Can you try open('fr\xf6r') in that

--Guido van Rossum (home page:

From  Tue Aug 13 14:56:55 2002
From: (M.-A. Lemburg)
Date: Tue, 13 Aug 2002 15:56:55 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
References: <>              <> <>
Message-ID: <>

Guido van Rossum wrote:
>>Here's an slightly different idea:
>>   Bit shifting is hardly ever done on signed data and since
>>   Python does not provide an unsigned integer object, most
>>   developers stick to the integer object and interpret its value
>>   as unsigned object (much like many people interpret strings
>>   as having a Latin-1 value). If people use integers that way,
>>   they know what they are doing and they are usually not interested
>>   in the sign of the value at all, as long as the bits stay the
>>   same when they pass the value to various Python APIs (including
>>   C APIs).
>>   Conclusion: Offer developers a better way to deal with unsigned
>>   data, e.g. an unsigned 32-bit integer type as subtype of int and
>>   let the bit manipulation operators return this unsigned type.
>>   For backward compat. make sure that common parser markers continue
>>   to work as they do now and add new ones for unsigned values for
>>   future use. PyInt_AS_LONG(unsignedInteger) would return the
>>   value of unsignedInteger casted to a signed one and extensions
>>   would be happy.
>>   Only if a value gets shifted beyond the first 32 bits,
>>   convert it to a long.
>>That should solve most backward compat problems for bit
>>shifters while still unifying ints and longs.
> -100.
> We are *already* offering developers a way to deal with unsigned data:
> use longs.  Bit shifting works just fine on longs, and the results are
> positive unless you "or" in a negative number.  Getting a 32-bit
> result in Python is trivial (mask with 0xffffffffL).  The Python C API
> already supports getting an unsigned C long out of a Python long
> (PyLong_AsUnsignedLong()).

You are turning in circles here. Longs are not compatible
to integers at C level. That's what I was trying to

Longs don't offer the performance you'd expect from bit operations,
so they are not a real-life alternative to native 32-bit or 64-bit
integers or bit fields. They are from a language designer's POV,
but then I'd suggest to drop the difference between ints and longs
completely in Py3k instead and make them a single hybrid type for
multi-precision numbers which uses native C number types or arrays
of bytes as necessary.

BTW, what do you mean by:

	"hex()/oct() of negative int will return "
	"a signed string in Python 2.4 and up"

Are you suggesting that hex(0xff000000) returns
"-0x1000000" ?

That looks like another potentially harmful change.

Is it really worth breaking these things just for the sake
of trying to avoid OverflowErrors where a simple explicit
cast by the programmer is all that's needed to avoid them ?

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Tue Aug 13 16:15:30 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 11:15:30 -0400
Subject: [Python-Dev] q about __dict__
In-Reply-To: Your message of "Tue, 13 Aug 2002 17:03:21 +0400."
References: <>
Message-ID: <>

> Anybody, please explain me, is this a bug? :

No and yes.  It's curently defined as a feature -- you can subclass
dict, but it's not always safe to override operations like __getitem__
because Python internally takes shortcuts for dicts used to implement
namespaces.  For dicts used for namespaces, it's only safe to add new
methods; for dicts not used for namespaces, it's safe to override
special methods.

> Code
> === begin ===
> class Dict(dict):
> 	def __setitem__(*x):
> 		raise Exception, "Please, do not touch me!"
> class A:
> 	def __init__(*x):
> 		self.__dict__ = Dict()
> A().x = 12
> ===  end  ===
> doesn't raise Exception, i. e. my __setitem__ function is not called.
> Patching that is simple: in Objects/classobject.c need to replace
> 	PyDict_SetItem -> PyObject_SetItem
> 	PyDict_GetItem -> PyObject_GetItem
> 	PyDict_DelItem -> PyObject_DelItem
> and maybe something else (not much).
> Overhead is minimal, and as bonus python gets ability of assigning
> object of any type (not inherited from dict) to __dict__.

Have you tried this?  Because PyDict_GetItem() doesn't set an
exception condition when the key is not found, a lot of code would
have to be changed.

> Motivation:
> Somethimes I want to write strange classes, for example, class with
> ordered attributes. I know, that it is possibe to implement this
> redefining class attributes __setattr__, etc., but setting __dict__
> looks clearer.

For this particular situation (instance variables) I'm not totally
against fixing this, but I don't find it has a high priority.  You can
help by providing a patch that implements your idea above, and showing
some benchmark results (e.g. based on PyBench) that indicate the
minimal performance impact you're claiming.  If you don't feel like
doing this yourself (e.g. because you're not confident about your C
coding skills), ask around on comp.lang.python for help.

--Guido van Rossum (home page:

From  Tue Aug 13 16:17:25 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 11:17:25 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Tue, 13 Aug 2002 16:59:05 +0200."
References: <>
Message-ID: <>

> Jack Jansen <> writes:
> > If we have only pure signed and pure unsigned converters it would mean
> > an extraordinary amount of work, but luckily it seems that that is not
> > going to happen.

> Now I'm confused: "l" *is* a "pure signed converter", no? I.e. it
> won't accept a value above 2**31-1, right?

Correct.  I think Jack's worry is that in 2.3, mask expressions can be
negative, and if "k" were a pure unsigned converter, negative masks
would not be accepted.

--Guido van Rossum (home page:

From  Tue Aug 13 16:26:26 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 11:26:26 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Tue, 13 Aug 2002 15:56:55 +0200."
References: <> <> <>
Message-ID: <>

> > We are *already* offering developers a way to deal with unsigned data:
> > use longs.  Bit shifting works just fine on longs, and the results are
> > positive unless you "or" in a negative number.  Getting a 32-bit
> > result in Python is trivial (mask with 0xffffffffL).  The Python C API
> > already supports getting an unsigned C long out of a Python long
> > (PyLong_AsUnsignedLong()).
> You are turning in circles here. Longs are not compatible
> to integers at C level. That's what I was trying to
> address.

In what sense are longs not compatible to C integers?  Because they
can hold larger values?  PyLong_AsUnsignedLong() does a range check;
Python code can force the value to be in range by using (e.g.)

> Longs don't offer the performance you'd expect from bit operations,

Oh puleeeeeeze.  The overhead of the VM, object creation and
deallocation, overflow checks, etc., completely drown the time it
takes to do the meazly a<<b or a|b operation.  Doing the operation for
a Python long isn't significantly slower.

> so they are not a real-life alternative to native 32-bit or 64-bit
> integers or bit fields.

Of course they aren't, and that's not what we need.  We're talking
here about being able to pass various bits and masks to system and
library calls that.

> They are from a language designer's POV,
> but then I'd suggest to drop the difference between ints and longs
> completely in Py3k instead and make them a single hybrid type for
> multi-precision numbers which uses native C number types or arrays
> of bytes as necessary.

Yeah, in Py3k, there will be only one type.  Pep 237 tries to
approximate this without having to change *every* line of code dealing
with ints.  Unless you use isinstance() or type(), eventually you
won't be able to tell the difference (and we'll provide a way to
abstract away from those too, e.g. a baseint class).

> BTW, what do you mean by:
> 	"hex()/oct() of negative int will return "
> 	"a signed string in Python 2.4 and up"
> Are you suggesting that hex(0xff000000) returns
> "-0x1000000" ?

No, because 0xff000000 will be a positive number. :-)

However, hex(-1) will return '-0x1' rather than '0xffffffff'.

> That looks like another potentially harmful change.

That's why I'm adding warnings now.

I'm frustrated that you apparently didn't read PEP 237 when it was
discussed in the first place.

> Is it really worth breaking these things just for the sake
> of trying to avoid OverflowErrors where a simple explicit
> cast by the programmer is all that's needed to avoid them ?


--Guido van Rossum (home page:

From  Tue Aug 13 17:05:51 2002
From: (Jonathan Riehl)
Date: Tue, 13 Aug 2002 11:05:51 -0500 (CDT)
Subject: [Python-Dev] PEP 269 will live again.
Message-ID: <Pine.BSF.4.33.0208131101140.93711-100000@localhost>

I move to move PEP 269 (pgen module for Python) to zombie monster status
and refer all interested parties to my post today in the parser-sig for
more details.  Apparently I am only interested in parser generators in the
month of August (PEP 269 was drafted in Aug.2001).

From  Tue Aug 13 17:12:40 2002
From: (M.-A. Lemburg)
Date: Tue, 13 Aug 2002 18:12:40 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
References: <> <> <>              <> <>
Message-ID: <>

Guido van Rossum wrote:
>>>We are *already* offering developers a way to deal with unsigned data:
>>>use longs.  Bit shifting works just fine on longs, and the results are
>>>positive unless you "or" in a negative number.  Getting a 32-bit
>>>result in Python is trivial (mask with 0xffffffffL).  The Python C API
>>>already supports getting an unsigned C long out of a Python long
>>You are turning in circles here. Longs are not compatible
>>to integers at C level. That's what I was trying to
> In what sense are longs not compatible to C integers? 

PyInt_Check() doesn't accept longs. PyInt_AS_LONG() returns

> I'm frustrated that you apparently didn't read PEP 237 when it was
> discussed in the first place.

I was on vacation at the time you discussed this and I
had never expected that you are actually trying to force
long usage instead of integer usage. My impression was that
you were aiming at providing ways to be able to pass longs
to integer aware APIs which is goodness.

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Tue Aug 13 17:20:06 2002
From: (Brian Quinlan)
Date: Tue, 13 Aug 2002 09:20:06 -0700
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
Message-ID: <001e01c242e5$49697ff0$bd5d4540@Dell2>

Guido van Rossum wrote:
> But if you pass the normalized string (or the Latin-1 string) to
> open(), will it find the file?  

I tried opening a file using both "o\xcc\x88" and "\xc3\xb6". Both
result in the same file being opened.

> I.e. if the filesystem has the
> unnormalized name stored in its directory, will filesystem requests
> normalize filenames before comparing them?

It could be that Apple is decomposing the filenames before comparing
them. Either way works.


From  Tue Aug 13 17:26:26 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 12:26:26 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Tue, 13 Aug 2002 18:12:40 +0200."
References: <> <> <> <> <>
Message-ID: <>

> > In what sense are longs not compatible to C integers? 
> PyInt_Check() doesn't accept longs. PyInt_AS_LONG() returns
> garbage.

Since you were proposing a new type, I don't see how that matters.
(Making unsigned a subtype of int won't work.)

> > I'm frustrated that you apparently didn't read PEP 237 when it was
> > discussed in the first place.
> I was on vacation at the time you discussed this and I
> had never expected that you are actually trying to force
> long usage instead of integer usage. My impression was that
> you were aiming at providing ways to be able to pass longs
> to integer aware APIs which is goodness.

PyInt_AsLong() and the 'i' and 'l' formats to PyArg_Parse* have
accepted longs for a long time.  The proper idiom is either to use
PyArg_Parse* with an 'i' or 'l' format, or to call PyInt_AsLong()
*without* first using PyInt_Check().

--Guido van Rossum (home page:

From  Tue Aug 13 17:36:22 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 12:36:22 -0400
Subject: [Python-Dev] PEP 269 will live again.
In-Reply-To: Your message of "Tue, 13 Aug 2002 11:05:51 CDT."
References: <Pine.BSF.4.33.0208131101140.93711-100000@localhost>
Message-ID: <>

> I move to move PEP 269 (pgen module for Python) to zombie monster status
> and refer all interested parties to my post today in the parser-sig for
> more details.  Apparently I am only interested in parser generators in the
> month of August (PEP 269 was drafted in Aug.2001).

I suppose you're referring to this message:

I have not retired your PEP and am glad you're interested in this
subject again.  Let's try to reach a conclusion before August is over.

I don't think you should try to tell the Jython folks what to do.  A
pgen module that only works in CPython is still valuable.  If you want
to port pgen to Jython and support it as a module, that's fine, but I
don't think you should try to get the Jython developers to use pgen as
their parser.  After all, Jython's *implementation* is *supposed* to
be Javaesque.

Are you interested in implementing PEP 269 as it currently stands?
Then fine, let's do it and get it into Python 2.3.

If you want to expand the scope, I predict that it'll never happen, so
then let's retire the PEP.  It's up to you.

Note that Jeremy has a new Python compiler package (Lib/python/ in the
Python 2.3 CVS tree), which currently uses parse trees as produced by
the old 'parser' module as input, and then restructures them into more
abstract syntax trees.  This compiler is easily retargetable to other
input and output structures though -- I believe Finn Bock already has
a Jython version of it.  I don't know what it generates, I doubt it
generates CPython bytecode, maybe it generates Java source or JVM
assembler; I believe it takes the same parse tree that Jython uses as

I think it would be useful if you use the same form of abstract syntax
trees as Jeremy's parser uses (not the parser module output, but the
restructured abstract syntax trees); I think they are quite flexible
and useful.

If you don't want to do this, you'll have to motivate why your
alternative is better, and also show how Jeremy's compiler package can
be easily adapted to use your form of parse trees.

--Guido van Rossum (home page:

From  Tue Aug 13 17:37:20 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 12:37:20 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: Your message of "Tue, 13 Aug 2002 09:20:06 PDT."
References: <001e01c242e5$49697ff0$bd5d4540@Dell2>
Message-ID: <>

> Guido van Rossum wrote:
> > But if you pass the normalized string (or the Latin-1 string) to
> > open(), will it find the file?  
> I tried opening a file using both "o\xcc\x88" and "\xc3\xb6". Both
> result in the same file being opened.
> > I.e. if the filesystem has the
> > unnormalized name stored in its directory, will filesystem requests
> > normalize filenames before comparing them?
> It could be that Apple is decomposing the filenames before comparing
> them. Either way works.

Hm, that sucks (either way) -- because you get unnormalized Unicode
out of directory listings, which is harder to turn into local

--Guido van Rossum (home page:

From  Tue Aug 13 17:50:14 2002
From: (Brian Quinlan)
Date: Tue, 13 Aug 2002 09:50:14 -0700
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
Message-ID: <001f01c242e9$7f45cfd0$bd5d4540@Dell2>

Guido van Rossum wrote:
> > It could be that Apple is decomposing the filenames before comparing
> > them. Either way works.
> Hm, that sucks (either way) -- because you get unnormalized Unicode
> out of directory listings, which is harder to turn into local
> encodings.

Here is a relevant URI:

"""In addition, all code that calls BSD system routines should ensure
that the const *char parameters of these routines are in UTF-8 encoding.
All BSD system functions expect their string parameters to be in UTF-8
encoding and nothing else. An additional caveat is that string
parameters for files, paths, and other file-system entities must be in
canonical UTF-8. In a canonical UTF-8 Unicode string, all decomposable
characters are decomposed; for example, ? (0x00E9) is represented as e
(0x0065) + =B4(0x0301). To put things in canonical UTF-8 encoding, use =
"file-system representation" APIs defined in Cocoa and Carbon (including
Core Foundation). For example, to get a canonical UTF-8 character string
in Cocoa, use NSString's fileSystemRepresentation method; for
noncanonical UTF-8 strings, use NSString's UTF8String method"""


From  Tue Aug 13 17:54:49 2002
From: (M.-A. Lemburg)
Date: Tue, 13 Aug 2002 18:54:49 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
References: <001e01c242e5$49697ff0$bd5d4540@Dell2> <>
Message-ID: <>

Guido van Rossum wrote:
>>Guido van Rossum wrote:
>>>But if you pass the normalized string (or the Latin-1 string) to
>>>open(), will it find the file?  
>>I tried opening a file using both "o\xcc\x88" and "\xc3\xb6". Both
>>result in the same file being opened.
>>>I.e. if the filesystem has the
>>>unnormalized name stored in its directory, will filesystem requests
>>>normalize filenames before comparing them?
>>It could be that Apple is decomposing the filenames before comparing
>>them. Either way works.

The recommended way of doing normalization is to go by
Normalization Form C: Canonical Decomposition,
followed by Canonical Composition.


Note that for proper collation suppotr, Unicode strings mus first be
normalized. See

> Hm, that sucks (either way) -- because you get unnormalized Unicode
> out of directory listings, which is harder to turn into local
> encodings.

You can easily normalize it again (provided you have a normalization
lib at hand).

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Tue Aug 13 18:41:43 2002
From: (Samuele Pedroni)
Date: Tue, 13 Aug 2002 19:41:43 +0200
Subject: [Python-Dev] Re: PEP 269 will live again.
Message-ID: <015401c242f0$b08cd8c0$6d94fea9@newmexico>

>Note that Jeremy has a new Python compiler package (Lib/python/ in the
>Python 2.3 CVS tree), which currently uses parse trees as produced by
>the old 'parser' module as input, and then restructures them into more
>abstract syntax trees.  This compiler is easily retargetable to other
>input and output structures though -- I believe Finn Bock already has
>a Jython version of it.  I don't know what it generates, I doubt it
>generates CPython bytecode, maybe it generates Java source or JVM
>assembler; I believe it takes the same parse tree that Jython uses as

yup, it is more than a prototype,
the compilers in the current Jython CVS are based on that.

>I think it would be useful if you use the same form of abstract syntax
>trees as Jeremy's parser uses (not the parser module output, but the
>restructured abstract syntax trees); I think they are quite flexible
>and useful.

Yup and the point of the exercise is to make possible for
the future versions of PyChecker etc to work with Jython too.

>If you don't want to do this, you'll have to motivate why your
>alternative is better, and also show how Jeremy's compiler package can
>be easily adapted to use your form of parse trees.

yes ideally it should output a superset of that, or something with
small changes that can be easely backported to the above effort in
Jython, otherwise is a kind of step backward:

when first proposed the PEP would have been a furher blessing
for the awful parser module output format,

now the efforts of Jeremy and Finn have moved a bit both
Python and Jython away from that and on a parallel track.


From  Tue Aug 13 19:02:27 2002
From: (Andrew Koenig)
Date: Tue, 13 Aug 2002 14:02:27 -0400 (EDT)
Subject: [Python-Dev] type categories
Message-ID: <>

While I was driving to work today, I had a thought about the
iterator/iterable discussion of a few weeks ago.  My impression is
that that discussion was inconclusive, but a few general principles
emerged from it:

	1) Some types are iterators -- that is, they support calls
	   to next() and raise StopIteration when they have no more
	   information to give.

	2) Some types are iterables -- that is, they support calls
	   to __iter__() that yield an iterator as the result.

	3) Every iterator is also an iterable, because iterators are
	   required to implement __iter__() as well as next().

	4) The way to determine whether an object is an iterator
	   is to call its next() method and see what happens.

	5) The way to determine whether an object is an iterable
	   is to call its __iter__() method and see what happens.

I'm uneasy about (4) because if an object is an iterator, calling its
next() method is destructive.  The implication is that you had better
not use this method to test if an object is an iterator until you are
ready to take irrevocable action based on that test.  On the other
hand, calling __iter__() is safe, which means that you can test
nondestructively whether an object is an iterable, which includes
all iterators.

Here is what I realized this morning.  It may be obvious to you,
but it wasn't to me (until after I realized it, of course):

     ``iterator'' and ``iterable'' are just two of many type
     categories that exist in Python.

Some other categories:

     floating-point number
     complex number

As far as I know, there is no uniform method of determining into which
category or categories a particular object falls.  Of course, there
are non-uniform ways of doing so, but in general, those ways are, um,
nonuniform.  Therefore, if you want to check whether an object is in
one of these categories, you haven't necessarily learned much about
how to check if it is in a different one of these categories.

So what I wonder is this:  Has there been much thought about making
these type categories more explicitly part of the type system?

From  Tue Aug 13 20:43:20 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 21:43:20 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> Those are all details.  We can finesse that when we get closer to
> agreeing on the semantics.  I think code that we know will never be
> executed on Unix should be inside #ifdefs.  Maybe we should reconsider
> moving the Windows code to a separate file...

Indeed, I had PEP 277 in mind when I created the ntmodule patch :-)


From  Tue Aug 13 20:46:29 2002
From: (Jason Tishler)
Date: Tue, 13 Aug 2002 15:46:29 -0400
Subject: [Python-Dev] Bugs #544740: test_commands test fails under Cygwin
In-Reply-To: <>
References: <>
Message-ID: <>

On Tue, Aug 13, 2002 at 09:08:53AM -0400, Guido van Rossum wrote:
> > Anyway, suggestions on how to fix this will be greatly appreciated.
> The obvious fix would be a better regular expression.

I thought of that before I submitted my bug report.  Unfortunately,
deriving the better regular expression is not obvious.

The following is what I have come up with:

pat = r'''d.........             # directory
          \s+\d+                 # number of links
          (\s+\w+)+              # user and group which can contain spaces
          \s+\d+                 # size
***>      \s+\w+\s+\d+\s+[\d:]+  # date <***
          \s+/\.                 # file name

Unfortunately, I had to make an assumption on the date format in order
to match the user and group names regardless of the number of embedded
spaces.  Is my date regular expression acceptable?  Will it work in non
US locales?

> Please submit a patch.

I will do so, once I get some feedback.


From  Tue Aug 13 20:45:29 2002
From: (Michael McLay)
Date: Tue, 13 Aug 2002 15:45:29 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

On Tuesday 13 August 2002 02:02 pm, Andrew Koenig wrote:
> I'm uneasy about (4) because if an object is an iterator, calling its
> next() method is destructive.  The implication is that you had better
> not use this method to test if an object is an iterator until you are
> ready to take irrevocable action based on that test.  

The test would be non-destructive if the test only checks for the existence of 
the next() method. 

> Here is what I realized this morning.  It may be obvious to you,
> but it wasn't to me (until after I realized it, of course):
>      ``iterator'' and ``iterable'' are just two of many type
>      categories that exist in Python.
> Some other categories:
>      callable
>      sequence
>      generator
>      class
>      instance
>      type
>      number
>      integer
>      floating-point number
>      complex number
>      mutable
>      tuple
>      mapping
>      method
>      built-in
> As far as I know, there is no uniform method of determining into which
> category or categories a particular object falls.  Of course, there
> are non-uniform ways of doing so, but in general, those ways are, um,
> nonuniform.  Therefore, if you want to check whether an object is in
> one of these categories, you haven't necessarily learned much about
> how to check if it is in a different one of these categories.
> So what I wonder is this:  Has there been much thought about making
> these type categories more explicitly part of the type system?

The category names look like general purpose interface names. The addition of 
interfaces has been discussed quite a bit. While many people are interested 
in having interfaces added to Python, there are many design issues that will 
have to be resolved before it happens. Hopefully the removal of the 
class/type wart and the use of interfaces in Zope will hasten the addition of 

I like your list of the basic Python interfaces. Perhaps a weak version of 
interface definitions could be added to Python prior to a full featured 
capability. The weak version would simply add a __category__ attribute to the 
each type definition. This attribute would reference an object that defines 
the distinguishing features of the category interface.  Enforcement would be 
optional, but at least the definition would be published.  Adding just the 
definition of the type interface would create a direct benefit, but it would 
provide a hook for developers to use in work on optimization and testing.

From  Tue Aug 13 20:49:33 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 21:49:33 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> But if you pass the normalized string (or the Latin-1 string) to
> open(), will it find the file?  I.e. if the filesystem has the
> unnormalized name stored in its directory, will filesystem requests
> normalize filenames before comparing them?
> Jack, can you try to do that?  Can you try open('fr\xf6r') in that
> directory?

If my understanding of OS X is correct, then this won't work: OS X
demands UTF-8 for all file names.

The interesting question is whether u"fr\xf6r".encode("utf-8") allows
one to open the file. If that won't work, it could be considered a bug
in OS X, and I trust Apple that they can get such things right (if
they had considered them).

BTW, the same question holds on Windows: If you create a file on NTFS
with \xf6 in it, can you open it by passing \x6f\u0308? I can't try at
the moment...


From  Tue Aug 13 21:02:21 2002
From: (Brian Quinlan)
Date: Tue, 13 Aug 2002 13:02:21 -0700
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
Message-ID: <002b01c24304$56108a90$bd5d4540@Dell2>

> If my understanding of OS X is correct, then this won't work: OS X
> demands UTF-8 for all file names.

That is correct, at least at the BSD API level.
> The interesting question is whether u"fr\xf6r".encode("utf-8") allows
> one to open the file. If that won't work, it could be considered a bug
> in OS X, and I trust Apple that they can get such things right (if
> they had considered them).
It will work.


From  Tue Aug 13 21:02:07 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 22:02:07 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <001e01c242e5$49697ff0$bd5d4540@Dell2>
Message-ID: <>

Guido van Rossum <> writes:

> > It could be that Apple is decomposing the filenames before comparing
> > them. Either way works.
> Hm, that sucks (either way) -- because you get unnormalized Unicode
> out of directory listings, which is harder to turn into local
> encodings.

Notice that, most likely, Apple *does* normalize them - they just use
Normal Form D (which favours decomposition, instead of using
precomposed characters) - this is what Apple apparently calls

That choice is not surprising - NFD is "more logical", as precomposed
characters are available only arbitrarily (e.g. the WITH TILDE
combinations exist for a, i, e, n, o, u, v, y, but not for, say, x).

The Unicode FAQ
( says

Q: Which forms of normalization should I support?

A: The choice of which to use depends on the particular program or
system.  The most commonly supported form is NFC, since it is more
compatible with strings converted from legacy encodings. This is also
the choice for the web, as per the recommendations in "Character Model
for the World Wide Web" from the W3C. The other normalization forms
are useful for other domains.

So I guess Python should atleast provide NFC - precisely because of
the legacy encodings.


From  Tue Aug 13 21:27:29 2002
From: (Martin v. Loewis)
Date: 13 Aug 2002 22:27:29 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

Andrew Koenig <> writes:

> So what I wonder is this:  Has there been much thought about making
> these type categories more explicitly part of the type system?

Certainly. Such a feature has been called "interface" or "protocol"; I
usually associate with "interface" a static property (a type
implements an interface, by means of a declaration) and with
"protocol" a dynamic property (an object conforms to a protocol, by
acting according to the rules that the protocol set).

Your question exist in many variations. One of it lead to the creation
of the types-sig, another one triggered papers titled "Optional Static
Typing", see

The most recent version of an attempt to making interfaces part of
Python is PEP 245,

I believe there is agreement by now that there will be difference
between declared interfaces and implemented protocols: an object may
follow the protocol even if it did not declare the interface, and an
object may violate a protocol even if its type did declare the

Beyond that, there is little agreement.


From  Tue Aug 13 21:29:09 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 16:29:09 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Tue, 13 Aug 2002 13:46:52 +0200."
References: <>
Message-ID: <>

> >> If we switch to "k" for integers in the range -2**-31..2**31-1 that
> >> would not be too much work, as a lot of the code is generated (I would
> >> take the quick and dirty approach of using k for all my integers). Only
> >> the hand-written code would have to be massaged by hand.
> >
> > Glad, that's my preferred choice too.  But note that in Python 2.4 and
> > beyond, 'k' will only accept positive inputs, so you'll really have to
> > find a way to mark your signed integer arguments up differently.
> Huh??! Now you've confused me. If "k" means "32 bit mask", why would it 
> be changed in 2.4 not to accept negative values? "-1" is a perfectly 
> normal way to specify "0xffffffff" in C usage...

Hm, in Python I'd hope that people would write 0xffffffff if they want
32 one bits -- -1L has an infinite number of one bits, and on 64-bit
systems, -1 has 64 one-bits instead of 32.  Most masks are formed by
taking a small positive constant (e.g. 1 or 0xff) and shifting it
left.  In Python 2.4 that will always return a positive value.

But if you really don't like this, we could do something different --
'k' could simply give you the lower 32 bits of the value.  (Or the
lower sizeof(long)*8 bits???).

> > In 2.3 (and 2.2.2), I propose the following semantics for 'k': if the
> > argument is a Python int, a signed value within the range
> > [INT_MIN,INT_MAX] is required; if it is a Python long, a nonnegative
> > value in the range [0, 2*INT_MAX+1] is required.  These are the same
> > semantics that are currently used by struct.pack() for 'L', I found
> > out; I like these.
> I don't see the point, really. Why not allow [INT_MIN, 2*INT_MAX+1]? If 
> the "k" specifier is especially meant for bit patterns why not have 
> semantics of "anything goes, unless we are absolutely sure it isn't 
> going to fit"?

In the end, I see two possibilities: lenient, taking the lower N bits,
or strict, requiring [0 .. 2**32-1].  The proposal I made above was an
intermediate move on the way to the strict approach (given the reality
that in 2.3, 1<<31 is negative).

> > We'll have to niggle about the C type corresponding to 'k'.  Should it
> > be 'int' or 'long'?  It may not matter for you, since you expect to be
> > running on 32-bit hardware forever; but it matters for other potential
> > users of 'k'.  We could also have both 'k' and 'K', where 'k' stores
> > into a C int and 'K' into a C long.
> How about k1 for a byte, k2 for a short, k4 for a long and k8 for a long 
> long?

Hm, the format characters typically correspond to a specific C type.
We already have 'b' for unsigned char and 'B' for signed/unsigned
char, 'h' for unsigned short and 'H' for signed/unsigned short.  These
are unfortunately inconsistent with 'i' for signed int and 'l' for
signed long.

So I'd rather you pick a C type for 'k' (and a policy about range

> > I also propose to have a C API PyInt_AsUnsignedLong, which will
> > implement the semantics of 'K'.  Like 'i', 'k' will have to do an
> > explicit range test.
> In my proposal these would then probably become PyInt_As1Byte, 
> PyInt_As2Bytes, PyInt_As4Bytes and PyInt_As8Bytes.

And what would their return types be?

--Guido van Rossum (home page:

From  Tue Aug 13 22:02:01 2002
From: (M.-A. Lemburg)
Date: Tue, 13 Aug 2002 23:02:01 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
References: <001e01c242e5$49697ff0$bd5d4540@Dell2>	<> <>
Message-ID: <>

Martin v. Loewis wrote:
> Guido van Rossum <> writes:
>>>It could be that Apple is decomposing the filenames before comparing
>>>them. Either way works.
>>Hm, that sucks (either way) -- because you get unnormalized Unicode
>>out of directory listings, which is harder to turn into local
> Notice that, most likely, Apple *does* normalize them - they just use
> Normal Form D (which favours decomposition, instead of using
> precomposed characters) - this is what Apple apparently calls
> "canonical".

Both the decomposition and the composition are called "canonical" --
simply because both operations lead to predefined results (those
defined by the Unicode database).

has all the details.

As always with Unicode, things are slightly more complicated than
what people are normally used to (but for good reasons). The introduction
of that tech report describes these things in details. Canonical
equivalence basically means that the graphemes for the Unicode
code points when rendered look the same to the user -- even though
the code point combinations may be different.

Normalization takes care of mapping this visual equivalence to
an algorithm.

Now, if the OS uses canonical equivalence to find file names,
then all possible combinations of code points resulting in the
same sequence of graphemes will give you a match; for a good
reason: because the user of a GUI file manager wouldn't be
able to distinguish between two canonically equivalent file

> That choice is not surprising - NFD is "more logical", as precomposed
> characters are available only arbitrarily (e.g. the WITH TILDE
> combinations exist for a, i, e, n, o, u, v, y, but not for, say, x).

... but in a well-defined manner and that's what's important.

> The Unicode FAQ
> ( says
> Q: Which forms of normalization should I support?
> A: The choice of which to use depends on the particular program or
> system.  The most commonly supported form is NFC, since it is more
> compatible with strings converted from legacy encodings. This is also
> the choice for the web, as per the recommendations in "Character Model
> for the World Wide Web" from the W3C. The other normalization forms
> are useful for other domains.
> So I guess Python should atleast provide NFC - precisely because of
> the legacy encodings.

At least is good :-) NFC is NFD + canonical composition. Decomposition
isn't all that hard (using unicodedata.decomposition()). For
composition the situation is different: not all information is
available in the unicodedata database (the exclusion list) and
the database also doesn't provide the reverse mapping from
decomposed code points to composed one. See the Annexes to the
tech report to get an impression of just how hard combining is...

Still, would be nice to have (written in C for speed, since
this would be a very common operation). Zope Corp. will certainly
be interested in this for Zope3 ;-)

Marc-Andre Lemburg
CEO Software GmbH
_______________________________________________________________________ -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                     
Python Software:          

From  Tue Aug 13 22:15:58 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 17:15:58 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Tue, 13 Aug 2002 14:02:27 EDT."
References: <>
Message-ID: <>

> While I was driving to work today, I had a thought about the
> iterator/iterable discussion of a few weeks ago.  My impression is
> that that discussion was inconclusive, but a few general principles
> emerged from it:
> 	1) Some types are iterators -- that is, they support calls
> 	   to next() and raise StopIteration when they have no more
> 	   information to give.
> 	2) Some types are iterables -- that is, they support calls
> 	   to __iter__() that yield an iterator as the result.
> 	3) Every iterator is also an iterable, because iterators are
> 	   required to implement __iter__() as well as next().
> 	4) The way to determine whether an object is an iterator
> 	   is to call its next() method and see what happens.
> 	5) The way to determine whether an object is an iterable
> 	   is to call its __iter__() method and see what happens.
> I'm uneasy about (4) because if an object is an iterator, calling its
> next() method is destructive.  The implication is that you had better
> not use this method to test if an object is an iterator until you are
> ready to take irrevocable action based on that test.  On the other
> hand, calling __iter__() is safe, which means that you can test
> nondestructively whether an object is an iterable, which includes
> all iterators.

Alex Martelli introduced the "Look Before You Leap" (LBYL) syndrome
for your uneasiness with (4) (and (5), I might add -- I don't know
that __iter__ is always safe).  He contrasts it with a different
attitude, which might be summarized as "It's easier to ask forgiveness
than permission."  In many cases, there is no reason for LBYL
syndrome, and it can actually cause subtle bugs.  For example, a LBYL
programmer could write

  if not os.path.exists(fn):
    print "File doesn't exist:", fn
  fp = open(fn)
  ...use fp...

A "forgiveness" programmer would write this as follows instead:

    fp = open(fn)
  except IOError, msg:
    print "Can't open", fn, ":", msg
  ...use fp...

The latter is safer; there are many reasons why the open() call in the
first version could fail despite the exists() test succeeding,
including insufficient permissions, lack of operating resources, a
hard file lock, or another process that removed the file in the mean

While it's not an absolute rule, I tend to dislike interface/protocol
checking as an example of LBYL syndrome.  I prefer to write this:

  def f(x):
    print x[0]

rather than this:

  def f(x):
    if not hasattr(x, "__getitem__"):
      raise TypeError, "%r doesn't support __getitem__" % x
    print x[0]

Admittedly this is an extreme example that looks rather silly, but
similar type checks are common in Python code written by people coming
from languages with stronger typing (and a bit of paranoia).

The exception is when you need to do something different based on the
type of an object and you can't add a method for what you want to do.
But that is relatively rare.

> Here is what I realized this morning.  It may be obvious to you,
> but it wasn't to me (until after I realized it, of course):
>      ``iterator'' and ``iterable'' are just two of many type
>      categories that exist in Python.
> Some other categories:
>      callable
>      sequence
>      generator
>      class
>      instance
>      type
>      number
>      integer
>      floating-point number
>      complex number
>      mutable
>      tuple
>      mapping
>      method
>      built-in

You missed the two that are most commonly needed in practice: string
and file. :-)  I believe that the notion of an informal or "lore" (as
Jim Fulton likes to call it) protocol first became apparent when we
started to use the idea of a "file-like object" as a valid value for

> As far as I know, there is no uniform method of determining into which
> category or categories a particular object falls.  Of course, there
> are non-uniform ways of doing so, but in general, those ways are, um,
> nonuniform.  Therefore, if you want to check whether an object is in
> one of these categories, you haven't necessarily learned much about
> how to check if it is in a different one of these categories.
> So what I wonder is this:  Has there been much thought about making
> these type categories more explicitly part of the type system?

I think this has been answered by other respondents.

Interestingly enough, Jim Fulton asked me to critique the Interface
package as it exists in Zope 3, from the perspective of adding
(something like) it to Python 2.3.

This is a descendant of the "scarecrow" proposal, (see

The Zope3 implementation can be viewed here:

--Guido van Rossum (home page:

From  Tue Aug 13 22:14:30 2002
From: (Jack Jansen)
Date: Tue, 13 Aug 2002 23:14:30 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
Message-ID: <>

On dinsdag, augustus 13, 2002, at 03:01 , Guido van Rossum wrote:
> Looks like it isn't you: the filename somehow contains a character
> that's not in the Latin-1 subset of Unicode, and no encoding can fix
> that for you.  I don't know why -- you'll have to figure out why your
> keyboard generates that character when you type o-umlaut.

No, it's the way the filesystem stores filenames, apparently.=20
Or, at least, it's the way the filesystem API's expose those=20
filenames. Here's a session again (this time I'm using the=20
terminal in utf-8 mode):

 >>> x =3D "fr\xc3\xb6r"
 >>> os.listdir(".")
 >>> open(x, "w")
<open file 'fr=F6r', mode 'w' at 0x130838>
 >>> os.listdir(".")
['.DS_Store', 'fro\xcc\x88r']
 >>> os.path.exists('fro\xcc\x88r')
 >>> os.path.exists("fr\xc3\xb6r")

If I create a file with an o-umlaut it gets decomposed into an o=20
and a combining umlaut.

[Jack goes off and wrestles his way through a gazillion websites=20
with Unicode information]

If I understand the unicode standard (according to
correctly this means that MacOS stores filenames in NFD=20
normalized form, with all combining characters split out, and=20
this is the preferred normalized form. Am I correct here?

But, even if NFC is the preferred normalized form (the documents=20
I saw hinted that this may have been the case in previous=20
Unicode standards:-): both NFC and NFD renditions of this string=20
are legal unicode, aren't they? And if they are then both should=20
be converted to the same latin-1 string, shouldn't they?

Do I misunderstand something, or this this a bug (limitation?)=20
in the unicode->latin-1 decoder?
- Jack Jansen        <>       =20 -
- If I can't dance I don't want to be part of your revolution --=20
Emma Goldman -

From  Tue Aug 13 22:27:19 2002
From: (Andrew Koenig)
Date: Tue, 13 Aug 2002 17:27:19 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <> (message from Guido
 van Rossum on Tue, 13 Aug 2002 17:15:58 -0400)
References: <> <>
Message-ID: <>

Guido> Alex Martelli introduced the "Look Before You Leap" (LBYL) syndrome
Guido> for your uneasiness with (4) (and (5), I might add -- I don't know
Guido> that __iter__ is always safe).  He contrasts it with a different
Guido> attitude, which might be summarized as "It's easier to ask forgiveness
Guido> than permission."  In many cases, there is no reason for LBYL
Guido> syndrome, and it can actually cause subtle bugs.  For example, a LBYL
Guido> programmer could write

Guido>   if not os.path.exists(fn):
Guido>     print "File doesn't exist:", fn
Guido>     return
Guido>   fp = open(fn)
Guido>   ...use fp...

Guido> A "forgiveness" programmer would write this as follows instead:

Guido>   try:
Guido>     fp = open(fn)
Guido>   except IOError, msg:
Guido>     print "Can't open", fn, ":", msg
Guido>     return
Guido>   ...use fp...

Guido> The latter is safer; there are many reasons why the open() call in the
Guido> first version could fail despite the exists() test succeeding,
Guido> including insufficient permissions, lack of operating resources, a
Guido> hard file lock, or another process that removed the file in the mean
Guido> time.

Guido> While it's not an absolute rule, I tend to dislike interface/protocol
Guido> checking as an example of LBYL syndrome.  I prefer to write this:

Guido>   def f(x):
Guido>     print x[0]

Guido> rather than this:

Guido>   def f(x):
Guido>     if not hasattr(x, "__getitem__"):
Guido>       raise TypeError, "%r doesn't support __getitem__" % x
Guido>     print x[0]

I completely agree with you so far.  If you have an object that you
know that you intend to use in only a single way, it is usually right
to just go ahead and use it that way rather than asking first.

Guido> Admittedly this is an extreme example that looks rather silly,
Guido> but similar type checks are common in Python code written by
Guido> people coming from languages with stronger typing (and a bit of
Guido> paranoia).

Guido> The exception is when you need to do something different based
Guido> on the type of an object and you can't add a method for what
Guido> you want to do.  But that is relatively rare.

Perhaps the reason it's rare is that it's difficult to do.

One of the cases I was thinking of was the built-in * operator,
which does something completely diferent if one of its operands
is an integer.  Another one was the buffering iterator we were
discussing earlier, which ideally would omit buffering entirely
if asked to buffer a type that already supports multiple iteration.

>> Some other categories:

>> callable
>> sequence
>> generator
>> class
>> instance
>> type
>> number
>> integer
>> floating-point number
>> complex number
>> mutable
>> tuple
>> mapping
>> method
>> built-in

Guido> You missed the two that are most commonly needed in practice:
Guido> string and file. :-)

Actually, I thought of them but omitted them to avoid confusion between
a type and a category with a single element.

Guido> I believe that the notion of an informal or "lore" (as Jim
Guido> Fulton likes to call it) protocol first became apparent when we
Guido> started to use the idea of a "file-like object" as a valid
Guido> value for sys.stdout.

OK.  So what I'm asking about is a way of making notions such as
"file-like object" more formal and/or automatic.

Of course, one reason for my interest is my experience with a
language that supports compile-time overloading -- what I'm really
seeing on the horizon is some kind of notion of overloading in
Python, perhaps along the lines of ML's clausal function definitions
(which I think are truly elegant).

Guido> Interestingly enough, Jim Fulton asked me to critique the Interface
Guido> package as it exists in Zope 3, from the perspective of adding
Guido> (something like) it to Python 2.3.

Guido> This is a descendant of the "scarecrow" proposal,
Guido> (see
Guido> also

Guido> The Zope3 implementation can be viewed here:

I'll have a look; thanks!

From  Tue Aug 13 22:34:43 2002
From: (Jack Jansen)
Date: Tue, 13 Aug 2002 23:34:43 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
Message-ID: <>

On dinsdag, augustus 13, 2002, at 11:14 , Jack Jansen wrote:
> If I create a file with an o-umlaut it gets decomposed into an 
> o and a combining umlaut.

After a few more experiments I did manage to confuse the 
filesystem APIs: it turns out ligatures are not correctly 
decomposed. I.e. if you create a file "\uFB03" you cannot open 
it as "ffi".

- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Tue Aug 13 22:51:29 2002
From: (Jack Jansen)
Date: Tue, 13 Aug 2002 23:51:29 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

On dinsdag, augustus 13, 2002, at 10:29 , Guido van Rossum wrote:
>> Huh??! Now you've confused me. If "k" means "32 bit mask", why 
>> would it
>> be changed in 2.4 not to accept negative values? "-1" is a perfectly
>> normal way to specify "0xffffffff" in C usage...
> Hm, in Python I'd hope that people would write 0xffffffff if they want
> 32 one bits -- -1L has an infinite number of one bits, and on 64-bit
> systems, -1 has 64 one-bits instead of 32.  Most masks are formed by
> taking a small positive constant (e.g. 1 or 0xff) and shifting it
> left.  In Python 2.4 that will always return a positive value.

That is all fine if you're on an island. But if you transcribe 
existing C code to Python, or use examples or manuals written 
for C, then I would think there's no reason not to be lenient.

But (you hear Jack's reasoning collapsing in the distance) I 
haven't checked that Apple still uses -1 to mean "all ones" in 
their sample code. They used to do that a lot, but they may have 
stopped that. I don't know.

> In the end, I see two possibilities: lenient, taking the lower N bits,
> or strict, requiring [0 .. 2**32-1].  The proposal I made above was an
> intermediate move on the way to the strict approach (given the reality
> that in 2.3, 1<<31 is negative).

I would say strict on positive values, lenient on negative ones. 
Not too lenient, of course: -0x100000000L should not be a passed 
as 0 but give an exception.

This would correspond to everyday use in languages such as C.

>>> We'll have to niggle about the C type corresponding to 'k'.  
>>> Should it
>>> be 'int' or 'long'?  It may not matter for you, since you 
>>> expect to be
>>> running on 32-bit hardware forever; but it matters for other 
>>> potential
>>> users of 'k'.  We could also have both 'k' and 'K', where 'k' stores
>>> into a C int and 'K' into a C long.
>> How about k1 for a byte, k2 for a short, k4 for a long and k8 
>> for a long
>> long?
> Hm, the format characters typically correspond to a specific C type.
> We already have 'b' for unsigned char and 'B' for signed/unsigned
> char, 'h' for unsigned short and 'H' for signed/unsigned short.  These
> are unfortunately inconsistent with 'i' for signed int and 'l' for
> signed long.
> So I'd rather you pick a C type for 'k' (and a policy about range
> checks).

ok. How about uint32_t? And, while we're at it, add Q for uint64_t?

>> In my proposal these would then probably become PyInt_As1Byte,
>> PyInt_As2Bytes, PyInt_As4Bytes and PyInt_As8Bytes.
> And what would their return types be?

uint8_t, uint16_t, uint32_t and uint64_t.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Tue Aug 13 22:49:16 2002
From: (Paul Svensson)
Date: Tue, 13 Aug 2002 17:49:16 -0400 (EDT)
Subject: [Python-Dev] Deprecation warning on integer shifts and suchA
In-Reply-To: <>
Message-ID: <>

On Tue, 13 Aug 2002, Guido van Rossum wrote:

>> >> If we switch to "k" for integers in the range -2**-31..2**31-1 that
>> >> would not be too much work, as a lot of the code is generated (I would
>> >> take the quick and dirty approach of using k for all my integers). Only
>> >> the hand-written code would have to be massaged by hand.
>> >
>> > Glad, that's my preferred choice too.  But note that in Python 2.4 and
>> > beyond, 'k' will only accept positive inputs, so you'll really have to
>> > find a way to mark your signed integer arguments up differently.
>> Huh??! Now you've confused me. If "k" means "32 bit mask", why would it
>> be changed in 2.4 not to accept negative values? "-1" is a perfectly
>> normal way to specify "0xffffffff" in C usage...
>Hm, in Python I'd hope that people would write 0xffffffff if they want
>32 one bits -- -1L has an infinite number of one bits, and on 64-bit
>systems, -1 has 64 one-bits instead of 32.  Most masks are formed by
>taking a small positive constant (e.g. 1 or 0xff) and shifting it
>left.  In Python 2.4 that will always return a positive value.
>But if you really don't like this, we could do something different --
>'k' could simply give you the lower 32 bits of the value.  (Or the
>lower sizeof(long)*8 bits???).

For a mask, it makes some kind of sense to require all the high bits,
those not ports of the mask, to be all the same; it makes it less likely
that something important gets lost when they're trimmed off.
But, don't all ones make just as much sense as all zeros ?

Even with unified numbers, -1 (or ~0) seems to be a reasonable
way to spell a bitmask with all bits set, without having to know
how many "all" are.


From David Abrahams" <  Tue Aug 13 22:36:55 2002
From: David Abrahams" < (David Abrahams)
Date: Tue, 13 Aug 2002 17:36:55 -0400
Subject: [Python-Dev] type categories
References: <>  <>
Message-ID: <0c0501c24311$8cebbdc0$>

From: "Guido van Rossum" <>

> Alex Martelli introduced the "Look Before You Leap" (LBYL) syndrome
> for your uneasiness with (4) (and (5), I might add -- I don't know
> that __iter__ is always safe).  He contrasts it with a different
> attitude, which might be summarized as "It's easier to ask forgiveness
> than permission."  In many cases, there is no reason for LBYL
> syndrome, and it can actually cause subtle bugs.

> While it's not an absolute rule, I tend to dislike interface/protocol
> checking as an example of LBYL syndrome.


> The exception is when you need to do something different based on the
> type of an object and you can't add a method for what you want to do.
> But that is relatively rare.

The main reason I want to be able to LBYL (and, AFAICT, it's the same as
Alex's reason) is to support multiple dispatch. In other words, it wouldn't
be user code doing the looking. The best reason to support protocol
introspection is so that we can provide users with a way to write
more-elegant code, instead of messing around with manual type inspection.
What's your position on multiple dispatch?


From  Tue Aug 13 23:30:00 2002
From: (Skip Montanaro)
Date: Tue, 13 Aug 2002 17:30:00 -0500
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <001e01c242e5$49697ff0$bd5d4540@Dell2>
Message-ID: <15705.34920.804857.914875@localhost.localdomain>

    mal> As always with Unicode, things are slightly more complicated than
    mal> what people are normally used to ...

What's the current behavior?  If my program receives an input in utf-8
(let's say it comes from a form on a website), what form will it be in, or
can't I tell?  Is it possible I will get spurious inequalities today if I
compare two different unicode objects which were created from different
sources and in different normal forms?  What about a string and a unicode
object?  Where can I read all about it (Python and unicode normalization)?


From  Wed Aug 14 03:39:09 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 22:39:09 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Tue, 13 Aug 2002 23:51:29 +0200."
References: <>
Message-ID: <>

> > In the end, I see two possibilities: lenient, taking the lower N bits,
> > or strict, requiring [0 .. 2**32-1].  The proposal I made above was an
> > intermediate move on the way to the strict approach (given the reality
> > that in 2.3, 1<<31 is negative).
> I would say strict on positive values, lenient on negative ones. 
> Not too lenient, of course: -0x100000000L should not be a passed 
> as 0 but give an exception.

I;m not sure I like this.  On the one hand this is what the struct
modules does currently for 'L'.  On the other hand it seems not to
provide any more safety than simply taking the low N bits (using 2's
complement for negative values) and throwing the rest away.

> This would correspond to everyday use in languages such as C.

Actually, C is fairly careful: AFAIK on a 32-bit machine the type of
0xffffffff is unsigned long, so it's not strictly -1, and you'll have
to use a cast somewhere to be able to compare it to an int.

> > Hm, the format characters typically correspond to a specific C type.
> > We already have 'b' for unsigned char and 'B' for signed/unsigned
> > char, 'h' for unsigned short and 'H' for signed/unsigned short.  These
> > are unfortunately inconsistent with 'i' for signed int and 'l' for
> > signed long.
> >
> > So I'd rather you pick a C type for 'k' (and a policy about range
> > checks).
> ok. How about uint32_t? And, while we're at it, add Q for uint64_t?
> >> In my proposal these would then probably become PyInt_As1Byte,
> >> PyInt_As2Bytes, PyInt_As4Bytes and PyInt_As8Bytes.
> >
> > And what would their return types be?
> uint8_t, uint16_t, uint32_t and uint64_t.

Hm.  This is a big deviation from tradition.  Those types aren't
currently used or defined.

How about the following counterproposal.  This also changes some of
the other format codes to be a little more regular.

Code    C type          	Range check

b	unsigned char		0..UCHAR_MAX
B	unsigned char		none **
h	unsigned short		0..USHRT_MAX
H	unsigned short		none **
i	int			INT_MIN..INT_MAX
I *	unsigned int		0..UINT_MAX
k *	unsigned long		none
K *	unsigned long long	none


* New format codes.

** Changed from previous "range-and-a-half" to "none"; the
   range-and-a-half checking wasn't particularly useful.

If you need a uint32 mask, you can use the 'k' format and cast the
unsigned long you got to uint32; this should do the right thing.

If you really prefer your proposal with specific sized types, perhaps
you can show some coding example that would be easier using specific
sizes rather than char/short/int/long/long long?

--Guido van Rossum (home page:

From  Wed Aug 14 03:42:12 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 22:42:12 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Tue, 13 Aug 2002 17:36:55 EDT."
References: <> <>
Message-ID: <>

> The main reason I want to be able to LBYL (and, AFAICT, it's the same as
> Alex's reason) is to support multiple dispatch.

But isn't your application one where the types are mapped from C++?
Then you should be able to dispatch on type() of the arguments.  Or am
I misunderstanding, and do you want to make multi-dispatch a standard
paradigm in Python?

> In other words, it wouldn't
> be user code doing the looking. The best reason to support protocol
> introspection is so that we can provide users with a way to write
> more-elegant code, instead of messing around with manual type inspection.
> What's your position on multiple dispatch?

That it's too inefficient in a language with run-time dispatch to even
think about it.

--Guido van Rossum (home page:

From  Wed Aug 14 04:16:55 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 23:16:55 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Tue, 13 Aug 2002 17:27:19 EDT."
References: <> <>
Message-ID: <>

> Guido> The exception is when you need to do something different based
> Guido> on the type of an object and you can't add a method for what
> Guido> you want to do.  But that is relatively rare.
> Perhaps the reason it's rare is that it's difficult to do.

Perhaps...  Is it the chicken or the egg?

> One of the cases I was thinking of was the built-in * operator,
> which does something completely diferent if one of its operands
> is an integer.

Really?  I suppose you're thinking of sequence repetition.  I consider
that one of my early mistakes (it didn't make it to my "regrets" list
but probably should have).  It would have been much simpler if
sequences simply supported multiplcation, and in fact repeated changes
to the implementation (and subtle edge cases of the semantics) are
slowly nudging into this direction.

> Another one was the buffering iterator we were
> discussing earlier, which ideally would omit buffering entirely
> if asked to buffer a type that already supports multiple iteration.

How do you do that in C++?  I guess you overload the function that
asks for the iterator, and call that function in a template.  I think
in Python we can ask the caller to provide a buffering iterator when a
function needs one.  Since we really have very little power at compile
time, we sometimes need to do a little more work at run time.  But the
resulting language appears to be easier to understand (for most people
anyway) despite the theoretical deficiency.

I'm not quite sure why that is, but I am slowly developing a theory,
based on a remark by Samuele Pedroni; at least I believe it was he who
remarked at some point "Python has only run time", ehich got me
thinking.  My theory, partially developed though it is, is that it is
much harder (again, for most people :-) to understand in your head
what happens at compile time than it is to understand what goes at run
time.  Or perhaps that understanding *both* is harder than
understanding only one.

But I believe that for most people acquiring a sufficient mental model
for what goes on at run time is simpler than the mental model for what
goes on at compile time.  Possibly this is because compilers really
*do* rely on very sophisticated algorithms (such as deciding which
overloading function is called based upon type information and
available conversions).  Run time on the other hand is dead simple
most of the time -- it has to be, since it has to be executed by a
machine that has a very limited time to make its decisions.

All this reminds me of a remark that I believe is due to John
Ousterhout at the VHLL conference in '94 in Santa Fe, where you & I
first met.  (Strangely it was Perl's Tom Christiansen who was in a
large part responsible for the eclectic program.)  You gave a talk
about ML, and I believe it was in response to your talk that John
remarked that ML was best suited for people with an IQ of over 150.
That rang true to me, since the only other person besides you that I
know who is a serious ML user definitely falls into that category.
And ML is definitely a language that does more than the average
language at compile time.

> >> Some other categories:
> >> callable
> >> sequence
> >> generator
> >> class
> >> instance
> >> type
> >> number
> >> integer
> >> floating-point number
> >> complex number
> >> mutable
> >> tuple
> >> mapping
> >> method
> >> built-in
> Guido> You missed the two that are most commonly needed in practice:
> Guido> string and file. :-)
> Actually, I thought of them but omitted them to avoid confusion between
> a type and a category with a single element.

Can you explain?  Neither string (which has Unicode and 8-bit, plus a
few other objects that are sufficiently string-like to be
regex-searchable, like arrays) nor file (at least in the "lore
protocol" interpretation of file-like object) are categories with a
single element.

> Guido> I believe that the notion of an informal or "lore" (as Jim
> Guido> Fulton likes to call it) protocol first became apparent when we
> Guido> started to use the idea of a "file-like object" as a valid
> Guido> value for sys.stdout.
> OK.  So what I'm asking about is a way of making notions such as
> "file-like object" more formal and/or automatic.

Yeah, that's the holy Grail of interfaces in Python.

> Of course, one reason for my interest is my experience with a
> language that supports compile-time overloading -- what I'm really
> seeing on the horizon is some kind of notion of overloading in
> Python, perhaps along the lines of ML's clausal function definitions
> (which I think are truly elegant).

Honestly, I hadn't read this far ahead when I brought up ML above. :-)

I really hope that the holy grail can be found at run time rather than
compile time.  Python's compile time doesn't have enough information
easily available, and to gather the necessary information is very
expensive (requiring whole-program analysis) and not 100% reliable
(due to Python's extreme dynamic side).

> Guido> Interestingly enough, Jim Fulton asked me to critique the Interface
> Guido> package as it exists in Zope 3, from the perspective of adding
> Guido> (something like) it to Python 2.3.
> Guido> This is a descendant of the "scarecrow" proposal,
> Guido> (see
> Guido> also
> Guido> The Zope3 implementation can be viewed here:
> Guido>
> I'll have a look; thanks!

BTW A the original scarecrow proposal is at

--Guido van Rossum (home page:

From  Wed Aug 14 04:31:40 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 23:31:40 -0400
Subject: [Python-Dev] strange warnings from tempfile.mkstemped.__del__ on HP
Message-ID: <>

Lysator's snake-farm, which does regular builds of CVS Python
checkouts on a variety of uncommon platforms, has started reporting
two warnings that I don't understand.  (Never mind the
warnings; they're shallow; someone should fix them.)

The problem is the two exceptions ignored in __del__ methods.  If I
look at the code of the new module and its unittests, I see that there's a class mkstemped
defined in, which has a __del__ method that closes
the file descriptor.  The only way I can see this failing with an
AttributeError exception is if the instance never makes it through its
__init__ call.  But in that case I would have expect a failure
reported; the only instantiation of mkstemped() is inside a try/except
where the exceptclause calls self.failOnException() which causes the
unit tests to fail.  But the unittest doesn't report any failures?!

I don't see this happening on Linux, so it's hard to go beyond

--Guido van Rossum (home page:

------- Forwarded Message

Date:    Tue, 13 Aug 2002 23:04:56 -0400
Subject: [farm-report] Build python-HP_UX-B.11.00-9000_829-taylor was successfu

Build test succeeded. Any warnings are appended below.
- --
/mp/slaskdisk/tmp/sfarmer/python/dist/src/Lib/ DeprecationWarnin
g: hex/oct constants > sys.maxint will return positive values in Python 2.4 and
  LE_MAGIC = 0x950412de
/mp/slaskdisk/tmp/sfarmer/python/dist/src/Lib/ DeprecationWarnin
g: hex/oct constants > sys.maxint will return positive values in Python 2.4 and
  BE_MAGIC = 0xde120495
/mp/slaskdisk/tmp/sfarmer/python/dist/src/Lib/ DeprecationWarnin
g: hex/oct constants > sys.maxint will return positive values in Python 2.4 and
  MASK = 0xffffffff
Exception exceptions.AttributeError: "mkstemped instance has no attribute 'fd'"
 in <bound method mkstemped.__del__ of <test.test_tempfile.mkstemped instance a
t 0x40705ee0>> ignored
Exception exceptions.AttributeError: "mkstemped instance has no attribute 'fd'"
 in <bound method mkstemped.__del__ of <test.test_tempfile.mkstemped instance a
t 0x40afcdc8>> ignored


Snake-farm-report mailing list

------- End of Forwarded Message

From  Wed Aug 14 04:39:52 2002
From: (Guido van Rossum)
Date: Tue, 13 Aug 2002 23:39:52 -0400
Subject: [Python-Dev] strange warnings from tempfile.mkstemped.__del__ on HP
In-Reply-To: Your message of "Tue, 13 Aug 2002 23:31:40 EDT."
References: <>
Message-ID: <>

> The problem is the two exceptions ignored in __del__ methods.  If I
> look at the code of the new module and its
> unittests, I see that there's a class mkstemped
> defined in, which has a __del__ method that closes
> the file descriptor.  The only way I can see this failing with an
> AttributeError exception is if the instance never makes it through its
> __init__ call.  But in that case I would have expect a failure
> reported; the only instantiation of mkstemped() is inside a try/except
> where the exceptclause calls self.failOnException() which causes the
> unit tests to fail.  But the unittest doesn't report any failures?!

Mmm, it seems the test script doesn't show the test output.  Maybe one
of the tests is failing, but "make test" doesn't fail as a result?  Or
only the first test run is failing?  "make test" ignores the result of
the first test run (the tests are run twice, once without .pyc files
in place, once with).

--Guido van Rossum (home page:

From  Wed Aug 14 04:21:00 2002
From: (David Abrahams)
Date: Tue, 13 Aug 2002 23:21:00 -0400
Subject: [Python-Dev] type categories
References: <> <>              <0c0501c24311$8cebbdc0$>  <>
Message-ID: <0ccc01c24341$d839b130$>

From: "Guido van Rossum" <>

> > The main reason I want to be able to LBYL (and, AFAICT, it's the same
> > Alex's reason) is to support multiple dispatch.
> But isn't your application one where the types are mapped from C++?

Not all of them, not hardly! Boost.Python is about interoperability, not
just about wrapping C++. My users are writing functions that want to accept
any Python sequence as one argument (for some definition of "sequence").
They'd like to dispatch to different implementations of that function based
on whether that argument is a sequence or a scalar numeric type.

> Then you should be able to dispatch on type() of the arguments.  Or am
> I misunderstanding, and do you want to make multi-dispatch a standard
> paradigm in Python?


> > In other words, it wouldn't
> > be user code doing the looking. The best reason to support protocol
> > introspection is so that we can provide users with a way to write
> > more-elegant code, instead of messing around with manual type
> > What's your position on multiple dispatch?
> That it's too inefficient in a language with run-time dispatch to even
> think about it.

That's funny, my users are very happy with how fast it works in
Boost.Python. I don't see any reason it should have to be much less
efficient in pure Python for most cases... the important "type categories"
could be builtins. And as others have pointed out, it could even be used to
get certain optimzations.


           David Abrahams * Boost Consulting *

From  Wed Aug 14 05:00:04 2002
From: (Tim Peters)
Date: Wed, 14 Aug 2002 00:00:04 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

> ...
> Actually, C is fairly careful: AFAIK on a 32-bit machine the type of
> 0xffffffff is unsigned long,

Close:  it's unsigned int.

> so it's not strictly -1,

Not close:  it's nothing at all like -1!  Try this:

#include <stdio.h>
void main() {printf("%d\n", 5 / 0xffffffff);}

If it "acted like" -1 this would print -5 instead of 0 (and if it doesn't
print 0, your compiler is broken).  Maybe more obvious is to do

    printf("%g\n", (double)0xffffffff);

That should pring something close to <wink>




> and you'll have to use a cast somewhere to be able to compare it
> to an int.

If you want it treated like -1, definitely, because it's not -1.  If you
want it treated like 4294967295, then in the absence of an explict cast the
int you're comparing it to will get silently promoted to unsigned int too
(or with a warning msg, if your compiler is helpful).

>> uint8_t, uint16_t, uint32_t and uint64_t.

> Hm.  This is a big deviation from tradition.  Those types aren't
> currently used or defined.

Nor are they required to exist, not even in C99, where all the new "exact
size" typedefs are optional -- some boxes simply don't have these types.
Most Cray boxes don't have a two-byte type, for example, and some don't have
a 32-bit type.

> ...
> If you really prefer your proposal with specific sized types, perhaps
> you can show some coding example that would be easier using specific
> sizes rather than char/short/int/long/long long?

Since we can't promise to supply specific-sized types, let's cut that short.
You never need specific-sized types, and Python-Dev has had this argument
before.  Whenever it's come up, the code that relied on specific-sized types
got simpler after making it portable.  What you do need is a type *at least*
as big as the size you need in the end (and C99 has required typedefs for
that concept; Python could grow some too).

From  Wed Aug 14 05:05:54 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 00:05:54 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Tue, 13 Aug 2002 23:21:00 EDT."
References: <> <> <0c0501c24311$8cebbdc0$> <>
Message-ID: <>

> That's funny, my users are very happy with how fast it works in
> Boost.Python. I don't see any reason it should have to be much less
> efficient in pure Python for most cases... the important "type categories"
> could be builtins. And as others have pointed out, it could even be used to
> get certain optimzations.

Time to write a PEP.  Maybe there's an implementation trick you
haven't told us about?

--Guido van Rossum (home page:

From David Abrahams" <  Wed Aug 14 04:59:29 2002
From: David Abrahams" < (David Abrahams)
Date: Tue, 13 Aug 2002 23:59:29 -0400
Subject: [Python-Dev] type categories
References: <> <> <0c0501c24311$8cebbdc0$> <>              <0ccc01c24341$d839b130$>  <>
Message-ID: <0ce601c24346$ff967f60$>

From: "Guido van Rossum" <>

> > That's funny, my users are very happy with how fast it works in
> > Boost.Python. I don't see any reason it should have to be much less
> > efficient in pure Python for most cases... the important "type
> > could be builtins. And as others have pointed out, it could even be
used to
> > get certain optimzations.
> Time to write a PEP.

I don't know how these things usually work, but isn't it a bit early for
that? I would like to have some discussion about multiple dispatch (and
especially matching criteria) before investing in a formal proposal. That's
what my earlier posting which got banished to the types-sig was trying to
do. Getting a feel for what people are thinking about this, and getting
feedback from those with lots more experience than I in matters Pythonic is
important to me.

> Maybe there's an implementation trick you haven't told us about?

There's not all that much to what I'm doing. I have a really simple-minded
dispatching scheme which checks each overload in sequence, and takes the
first one which can get a match for all arguments. That causes some
problems for people who want to overload on Python float vs. int types, for
example, because each one matches the other. When I get some time I plan to
move to a more-sophisticated scheme which rates each match and picks the
best one. It doesn't seem like it should cause a significant slowdown, but
that's just intuition (AKA bullshit) talking. My users generally think

    C++ = fast (but hard)
    Python = slow (but easy)

[no rude remarks from the peanut gallery, please!]
So they don't tend to worry too much about the speed at the Python/C++
boundary, where this mechanism lies. It could be that they don't notice the
cost because they're putting all time-critical functionality completely
inside the C++ part.

           David Abrahams * Boost Consulting *

From  Wed Aug 14 05:53:14 2002
From: (Tim Peters)
Date: Wed, 14 Aug 2002 00:53:14 -0400
Subject: [Python-Dev] strange warnings from tempfile.mkstemped.__del__ on HP
In-Reply-To: <>
Message-ID: <>

> Lysator's snake-farm, which does regular builds of CVS Python
> checkouts on a variety of uncommon platforms, has started reporting
> two warnings that I don't understand.  (Never mind the
> warnings; they're shallow; someone should fix them.)

I submitted a patch for that to SF and assigned it to Barry (I have no idea
how to test

> The problem is the two exceptions ignored in __del__ methods.  If I
> look at the code of the new module and its
> unittests, I see that there's a class mkstemped
> defined in, which has a __del__ method that closes
> the file descriptor.  The only way I can see this failing with an
> AttributeError exception is if the instance never makes it through its
> __init__ call.

I agree, and, indeed, that's what would happen if it did fail during the
call to mkstemped.__init__().  So the call to tempfile._mkstemp_inner()
fails in two test cases (there were two distinct instances of the "no
attribute 'fd'" message), but we don't know which ones.

> ...
> But in that case I would have expect a failure reported; the only
> instantiation of mkstemped() is inside a try/except where the
> exceptclause calls self.failOnException() which causes the
> unit tests to fail.  But the unittest doesn't report any failures?!

Well, I didn't see *any* test output in the report, neither successes nor
failures, just Python-produced exceptions and warnings.  Maybe the script
only captures stderr?  A failing unittest run *under* doesn't
normally print anything to stderr.  It would have printed this to stdout,

test test_tempfile failed -- errors occurred; run in verbose mode for
1 test failed:

So even if we had that, it wouldn't have helped.  stdout from a regrtest -v
run is what we need, or from running directly (w/o

From  Wed Aug 14 06:06:35 2002
From: (Barry A. Warsaw)
Date: Wed, 14 Aug 2002 01:06:35 -0400
Subject: [Python-Dev] hex constants, bit patterns, PEP 237 warnings and gettext
Message-ID: <>

Ok, I admit that I've only tangentially followed the thread on PEP 237
deprecation warnings, and I've just skimmed PEP 237 but I'm pretty
tired, so I must be missing something.

The deprecation warnings on compiling Lib/ are complaining
about these three hex constants:

    # Magic number of .mo files
    LE_MAGIC = 0x950412de
    BE_MAGIC = 0xde120495

    def _parse(self, fp):
        """Override this method to support alternative .mo formats."""
        # We need to & all 32 bit unsigned integers with 0xffffffff for
        # portability to 64 bit machines.
        MASK = 0xffffffff

These really are intended as 32 bit patterns, not signed integers.
Hex constants seem like the most straightforward way to spell such bit
patterns.  If I wanted MASK to be -1 I would have spelled it that way!

What gettext wants to do is to read the first 4 bytes from a file, &
it with the MASK to get a 32 bit pattern and then compare that against
two known patterns to see if we're looking at big-ending or
little-ending.  This is as recommended in the GNU gettext docs.

Now, if I add a trailing `L' to each of those constants the warnings
go away, which seems odd to me given that PEP 237 is trying to do away
with the int/long distinction and will eventually make the trailing
`L' illegal!

So I clearly don't understand why I need to add the trailing-L to
quiet the warnings, and PEP 237 doesn't quiet help me understand why
hex and oct constants > sys.maxint have to have warnings (I understand
why shifts and such need warnings).  Maybe it's just me, but if I want
a bit pattern, I write a hex constant; I'm never going to write -1 as

So if "0x950412de" isn't the right way to write a 32 bit pattern, what
is? "0x950412deL"?  If so, what happens when the trailing-L becomes
illegal?  Seems like I'll be caught in a trap -- help me out! :)


From  Wed Aug 14 06:16:58 2002
From: (Tim Peters)
Date: Wed, 14 Aug 2002 01:16:58 -0400
Subject: [Python-Dev] hex constants, bit patterns,
 PEP 237 warnings and gettext
In-Reply-To: <>
Message-ID: <>

[Barry A. Warsaw]
> ...
> So if "0x950412de" isn't the right way to write a 32 bit pattern,

It isn't today, but will be in 2.4.

> what is? "0x950412deL"?

That's what my patch did (along with using 'I' codes in unpack,
and getting rid of all the "& MASK" fiddling) -- check it out, it's already
assigned to you for your convenience <wink>.)

> If so, what happens when the trailing-L becomes illegal?

I think that's more of a Python 3 thing.  But if not:

> Seems like I'll be caught in a trap -- help me out! :)

Easy:  we take the trailing-L away again someday <heh>.  My bet is that
trailing-L will never go away, though (why bother?).

From  Wed Aug 14 06:18:27 2002
From: (Barry A. Warsaw)
Date: Wed, 14 Aug 2002 01:18:27 -0400
Subject: [Python-Dev] hex constants, bit patterns, PEP 237 warnings and gettext
References: <>
Message-ID: <>

>>>>> "BAW" == Barry A Warsaw <> writes:

    BAW> These really are intended as 32 bit patterns, not signed
    BAW> integers.  Hex constants seem like the most straightforward
    BAW> way to spell such bit patterns.  If I wanted MASK to be -1 I
    BAW> would have spelled it that way!

>>>>> "TP" == Tim Peters <> writes:

    Guido> (Never mind the > warnings; they're shallow;
    Guido> someone should fix them.)

    TP> I submitted a patch for that to SF and assigned it to Barry (I
    TP> have no idea how to test

BTW, I grok the other changes in Tim's patch.  struct's got an `I'
code for unpacking unsigned ints now, so it makes perfect sense to use
that instead of `i' w/ masking.


From  Wed Aug 14 07:23:42 2002
From: (Martin v. Loewis)
Date: 14 Aug 2002 08:23:42 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <15705.34920.804857.914875@localhost.localdomain>
References: <001e01c242e5$49697ff0$bd5d4540@Dell2>
Message-ID: <>

Skip Montanaro <> writes:

> What's the current behavior?  If my program receives an input in utf-8
> (let's say it comes from a form on a website), what form will it be in, or
> can't I tell?  

In general, you cannot tell in advance - it will depend on the data

W3C advocates "early normalization" towards "NFC", meaning that in the
Internet, you should always see NFC data - unless you are primary data
source, e.g. by reading from a terminal, or after decoding some legacy
encoding. It turns out that most Python codecs will produce NFC
already, so normalization to NFC would be required only for user input,
and - as it turns out - when reading file names on OS X.

> Is it possible I will get spurious inequalities today if I compare
> two different unicode objects which were created from different
> sources and in different normal forms?

If they are in different normal forms, you *will* get inequalities
reliably. In the real world, inequalities will be spurious.

> What about a string and a unicode object?  Where can I read all
> about it (Python and unicode normalization)?

Python does no normalization, so there is nothing to read. For
Unicode, you may want to start with the Normalization FAQ


From  Wed Aug 14 07:28:36 2002
From: (Martin v. Loewis)
Date: 14 Aug 2002 08:28:36 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen <> writes:

> After a few more experiments I did manage to confuse the filesystem
> APIs: it turns out ligatures are not correctly decomposed. I.e. if you
> create a file "\uFB03" you cannot open it as "ffi".

LATIN SMALL LIGATURE FFI is a compatibility character. Those are not
normalized under NFD, only under NFKD (in which case it would decay to
ffi). Since NFKD loses information (of typographical nature in this
case), NFKD is only recommended for restricted domains (identifiers
being an explicit example).


From  Wed Aug 14 07:33:13 2002
From: (Martin v. Loewis)
Date: 14 Aug 2002 08:33:13 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen <> writes:

> If I understand the unicode standard (according to
> correctly this means that MacOS stores filenames in NFD normalized
> form, with all combining characters split out, and this is the
> preferred normalized form. Am I correct here?

You are correct that this is likely the form that OS X uses on-disk,
and at the APIs. This is not really the preferred form - W3C favours
and advocates NFC - precisely because it is easier to transform into
legacy encodings (as you just observed).

> But, even if NFC is the preferred normalized form (the documents I saw
> hinted that this may have been the case in previous Unicode
> standards:-): both NFC and NFD renditions of this string are legal
> unicode, aren't they? And if they are then both should be converted to
> the same latin-1 string, shouldn't they?

Yes, and yes.

> Do I misunderstand something, or this this a bug (limitation?) in the
> unicode->latin-1 decoder?

It's a limitation, in all codecs. Contributions of normalization code
are welcome. Since this is hard work, this is unlikely to be fixed in
Python 2.3 - unless somebody has a really good incentive for fixing


From  Wed Aug 14 08:25:23 2002
From: (Oren Tirosh)
Date: Wed, 14 Aug 2002 10:25:23 +0300
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>; from on Tue, Aug 13, 2002 at 09:07:49AM +0200
References: <> <> <> <>
Message-ID: <>

On Tue, Aug 13, 2002 at 09:07:49AM +0200, Martin v. Loewis wrote:
> Oren Tirosh <> writes:
> > I think that this will produce the smallest number of
> > incompatibilities for existing code and maintain compatibility with
> > C header files on 32 bit platforms. In this case 0xff000000 will
> > always be interpreted as -16777216 and the 'i' parser will happily
> > convert it to wither 0xFF000000 or 0xFFFFFFFFFF000000, depending on
> > the native platform word size - which is probably what the
> > programmer meant.
> This means you suggest that PEP 237 is not implemented, or atleast
> frozen at the current stage.

Not at all! Removing the differences between ints and longs is good. 
My reservations are about thehexadecimal representation.

    - Currently, the '%u', '%x', '%X' and '%o' string formatting
      operators and the hex() and oct() built-in functions behave
      differently for negative numbers: negative short ints are
      formatted as unsigned C long, while negative long ints are
      formatted with a minus sign.  This will be changed to use the
      long int semantics in all cases (but without the trailing 'L'
      that currently distinguishes the output of hex() and oct() for
      long ints).  Note that this means that '%u' becomes an alias for
      '%d'.  It will eventually be removed.

In Python up to 2.2 it's inconsistent between ints and longs:
>>> hex(-16711681)
>>> hex(-16711681L)
'-0xff0001L'		# ??!?!?

The hex representation of ints gives me useful information about their 
bit structure. After all, it is not immediately apparent to most mortals 
that the number above is a mask for bits 16-23.

The hex representation of longs is something I find quite misleading and 
I think it's also unprecedented.  This wart has bothered me for a long 
time now but I didn't have any use for it so I didn't mind too much. Now 
it is proposed to extend this useless representation to ints so I do.

So we have two elements of the language that are inconsistent. One of 
them is in widespread use and the other is... ahem... 

Which one of them should be changed to conform to the other? 

My proposal: 

On 32 bit platforms:
>>> hex(-16711681)
>>> hex(-16711681L)

On 64 bit platforms:
>>> hex(-16711681)
>>> hex(-16711681L)

The 'LL' suffix means that this number is to be treated as a 64 bit
*signed* number. This is consistent with the way it is interpreted by 
GCC and other unix compilers on both 32 and 64 bit platforms.  

What to do about numbers from 2**31 to 2**32-1?

>>> hex(4278255615)

The U suffix, also borrowed from C, makes it unambigous on 32 and 64 bit 
platforms for both Python and C. 

Representation of positive numbers:

 0x00000000   -         0x7fffffff   : unambigous on all platforms
 0x80000000U  -         0xffffffffU  : representation adds U suffix
0x100000000LL - 0x7fffffffffffffffLL : representation adds LL suffix

Representation of negative numbers:
 0x80000000  - 0xffffffff (-2147483648 to -1):
	8 digits on 32 bit platforms
 0xffffffff80000000LL  - 0xffffffffffffffffLL  (same range):
	16 digits and LL suffix on 64 bit platforms

 others negative numbers: 16 digits and LL suffix on all platforms.

This makes the hex representation of a number informative and consistent 
between int and long on all platforms. It is also consistent with the
C compiler on the same platform. Yes, it will produce a different text
representation of some numbers on different platforms but this conveys
important information about the bit structure of the number which really
is different between platforms. eval()ing it back to a number is still 

When converting in the other direction (hex representation to number) 
there is an ambigous range from 0x80000000 to 0xffffffff.  Should it be 
treated as signed or unsigned?  The current interpretation is signed. PEP
237 proposes to change it to unsigned. I propose to do neither - this range
should be deprecated and some explicit notation should be used instead.

There's no need to be in a hurry about deprecating it, though. The
overwhelming majority of Python code will run on 32 bit platforms for some
time yet.

I propose that on 32 bit platforms this will produce a silent warning. No 
code will break. Running the program with -Wall will inform the programmer 
that the code may not work for some future version of Python.

On 64 bit platforms this will be interpreted the same way as on a 32 bit 
platform (signed 32 bits) but produce a noisy warning.  If the code was 
written on a 64 bit platform and the programmer meant the number to be 
treated as unsigned an explicit U suffix can be added to make it 
unambigously unsigned. If the code was written on a 32 bit platform and 
the programmer meant the number to be treated as signed it's possible to 
just live with the warning (the code should still run correctly) or add 8 
leading 'F's and an 'LL' suffix to make it unambigously signed. The 
modified code will run without warning on both 32 and 64 bit platforms.


The number 4000000000 would be represented in hex as 0xEE6B2800U whether 
it's as an int on a 64 bit platform or a long on either 32 or 64 bit 
platforms.  The representation depends only on the numeric value, not the 
type. This proposal therefore does not contradict the purpose of PEP 237
because ints and longs are treated identically.

What's the hex representation of numbers outside the range of 64 bit 
integers? Frankly, I don't care.  I'll go with any proposed solution as
long as eval(hex(x)) == x.

On Microsoft platforms 64 bit literals use the suffix 'i64', not 'LL'.
Python may either use 'LL' exclusively or produce 'i64' on Microsoft
platforms and 'LL' on other platforms. In the latter case it should 
accept either suffix on all platforms.

Yes, this proposal is more complicated and has special treatment for
different ranges but that is because the issue is not trivial and cannot
be brushed aside using a one-size-doesn't-fit-anyone approach. This
reminds me a lot of unicode issues.

What about the L suffix? This proposal adopts the LL and U suffixes from
C and ensures that they are interpreted consistently on both languages.
But the L suffix is not consistent with C for the range 0x80000000L to 
0xFFFFFFFFL. Should the L suffix be deprecated? Should it produce a 
warning for the possibly ambigous range?


From  Wed Aug 14 09:43:34 2002
From: (Oren Tirosh)
Date: Wed, 14 Aug 2002 04:43:34 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Tue, Aug 13, 2002 at 05:15:58PM -0400, Guido van Rossum wrote:
> Alex Martelli introduced the "Look Before You Leap" (LBYL) syndrome
> for your uneasiness with (4) (and (5), I might add -- I don't know
> that __iter__ is always safe).  He contrasts it with a different
> attitude, which might be summarized as "It's easier to ask forgiveness
> than permission."  In many cases, there is no reason for LBYL
> syndrome, and it can actually cause subtle bugs.  For example, a LBYL
> programmer could write
>   if not os.path.exists(fn):
>     print "File doesn't exist:", fn
>     return
>   fp = open(fn)
>   ...use fp...
> A "forgiveness" programmer would write this as follows instead:
>   try:
>     fp = open(fn)
>   except IOError, msg:
>     print "Can't open", fn, ":", msg
>     return
>   ...use fp...

So far I have proposed two "forgiveness" solutions to the re-iterability 

One was to raise an error if .next() is called after StopIteration so an 
attempt to iterate twice over an iterator would fail noisily.  You have 
rejected this idea, probably because too much code depends on the current 
documented behavior.

My other proposed solution is at
I suspect it got lost in the noise, though.


From  Wed Aug 14 10:29:42 2002
From: (Jack Jansen)
Date: Wed, 14 Aug 2002 11:29:42 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

On Wednesday, August 14, 2002, at 04:39 , Guido van Rossum wrote:
> How about the following counterproposal.  This also changes some of
> the other format codes to be a little more regular.
> Code    C type          	Range check
> b	unsigned char		0..UCHAR_MAX
> B	unsigned char		none **
> h	unsigned short		0..USHRT_MAX
> H	unsigned short		none **
> i	int			INT_MIN..INT_MAX
> I *	unsigned int		0..UINT_MAX
> l	long			LONG_MIN..LONG_MAX
> k *	unsigned long		none
> L	long long		LLONG_MIN..LLONG_MAX
> K *	unsigned long long	none
> Notes:
> * New format codes.
> ** Changed from previous "range-and-a-half" to "none"; the
>    range-and-a-half checking wasn't particularly useful.

Fine with me.

My only reason for suggesting the uint32_t and friends was because I was 
under the impression that you were unhappy with "unsigned long" having  
a different size on different platforms. I'm perfectly happy with 
char/short/long/long long.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- Emma 
Goldman -

From  Wed Aug 14 10:46:52 2002
From: (Jack Jansen)
Date: Wed, 14 Aug 2002 11:46:52 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
Message-ID: <>

On Wednesday, August 14, 2002, at 08:33 , Martin v. Loewis wrote:
>> Do I misunderstand something, or this this a bug (limitation?) in the
>> unicode->latin-1 decoder?
> It's a limitation, in all codecs. Contributions of normalization code
> are welcome. Since this is hard work, this is unlikely to be fixed in
> Python 2.3 - unless somebody has a really good incentive for fixing
> it.

Why is this hard work? I would guess that a simple table lookup would 
suffice, after all there are only a finite number of unicode characters 
that can be split up, and each one can be split up in only a small 
number of ways.

Wouldn't something like
for c in input:
	if not canbestartofcombiningsequence.has_key(c):
      nlookahead = MAXCHARSTOCOMBINE
      while nlookahead > 1:
		attempt = lookahead next nlookahead bytes from input
		if combine.has_key(attempt):
			skip the lookahead in input
do the trick, if the two dictionaries are initialized intelligently?
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- Emma 
Goldman -

From  Wed Aug 14 11:18:19 2002
From: (Oren Tirosh)
Date: Wed, 14 Aug 2002 06:18:19 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Tue, Aug 13, 2002 at 03:45:29PM -0400, Michael McLay wrote:
> > So what I wonder is this:  Has there been much thought about making
> > these type categories more explicitly part of the type system?
> The category names look like general purpose interface names. The addition of 
> interfaces has been discussed quite a bit. While many people are interested 
> in having interfaces added to Python, there are many design issues that will 
> have to be resolved before it happens. 

Nope. Type categories are fundamentally different from interfaces.  An 
interface must be declared by the type while a category can be an 
observation about an existing type. 

Two types that are defined independently in different libraries may in 
fact fit under the same category because they implement the same protocol.
With named interfaces they may in fact be compatible but they will not 
expose the same explicit interface. Requiring them to import the interface 
from a common source starts to sound more like Java than Python and would
introduce dependencies and interface version issues in a language that is 
wonderfully free from such arbitrary complexities.

Python is a dymanic language. It deserves a dynamic type category system,
not static interfaces that must be declared. It's fine to write a class and
somehow say "I intend this class to be in category X, please warn me if I 
write a method that will make it incompatible". But I don't want declarations 
to be a *requirement* for being considered compatible with a protocol. I 
have noticed that a lots of protocols are defined retroactively by 
observation of the behavior of existing code. There shoudln't be any need 
to go tag someone else's code as conforming to a protocol or put a wrapper
around it just to be able to use it.

A category is defined mathematically by a membership predicate. So what we
need for type categories is a system for writing predicates about types.

Standard Python expressions should not be used for defining a category
membership predicate. A Python expression is not a pure function. This
makes it impossible to cache the results of which type belongs to what
category for efficiency. Another problem is that many different expressions 
may be equivalent but if two independently defined categories use equivalent 
predicates they should *be* the same category.  They should be merged at 
runtime just like interned strings. 

About a year ago I worked on a system for predicates having a canonical 
representation for security applications. . While I was working on it I 
realized that it would be perfect for implementing a type category system
for Python. It would be useful at runtime for error detection and runtime
queries of protocols. It would also be useful at compile time for early
detection of some errors and possibly for optimization. By implementing
an optional strict mode the early error detection could be improved to the
point where it's effectively a static type system.

Just a quick example of the usefulness of canonical predicates: if I
calculate the intersection of two predicates and reduce it to canonical
form it will reduce to the FALSE predicate if no input will satisfy both
predicates. It will be equal to one of the predicate if it is contained
by the other.

I spent countless hours thinking about these issues, probably more than 
most people on this list... I think I have the foundation for a powerful 
yet unobtrusive type category system. Unfortunately it will take me some 
time to put it in writing and I don't have enough free time (who does?)


From  Wed Aug 14 11:47:47 2002
From: (=?ISO-8859-1?Q?Walter_D=F6rwald?=)
Date: Wed, 14 Aug 2002 12:47:47 +0200
Subject: [Python-Dev] PEP 293, Codec Error Handling Callbacks
References: <> <>	<>	<>	<>	<>	<>	<> <>
Message-ID: <>

Martin v. Loewis wrote:

> "M.-A. Lemburg" <> writes:
>>What ? That exceptions are immutable ? I think it's a big win that
>>exceptions are in fact mutable -- they are great for transporting
>>extra information up the chain...
> I see. So this is an open issue.

Yes, but I think this is not that much of a problem, because when
the code that catches the exception wants to do something with
exc.args it has to know what the entries mean, which depends on
the type. And if this code knows that it is dealing with a
UnicodeEncodeError it can simply use exc.start instead of

    Walter Dörwald

From  Wed Aug 14 12:53:03 2002
From: (Kalle Svensson)
Date: Wed, 14 Aug 2002 13:53:03 +0200
Subject: [snake-farm] RE: [Python-Dev] strange warnings from tempfile.mkstemped.__del__ on HP
In-Reply-To: <>
References: <> <>
Message-ID: <>

Hash: SHA1

[Tim Peters]
> So even if we had that, it wouldn't have helped.  stdout from a
> regrtest -v run is what we need, or from running
> directly (w/o regrtest).

Here you go.

: kalle@taylor [python-HP_UX-B.11.00-9000_829-taylor]$ ; ./python ../python/dist/src/Lib/test/ 
There are no surprising symbols in the tempfile module ... ok
_once initializes its argument ... ok
_once calls the callback just once ... ok
_once does not modify anything but its argument ... ok
_RandomNameSequence returns a six-character string ... ok
_RandomNameSequence returns no duplicate strings (stochastic) ... ok
_RandomNameSequence supports the iterator protocol ... ok
_candidate_tempdir_list returns a nonempty list of strings ... ok
_candidate_tempdir_list contains the expected directories ... ok
_get_candidate_names returns a _RandomNameSequence object ... ok
_get_candidate_names always returns the same object ... ok
_mkstemp_inner can create files ... ok
_mkstemp_inner can create many files (stochastic) ... FAIL
Exception exceptions.AttributeError: "mkstemped instance has no attribute 'fd'" in <bound method mkstemped.__del__ of <__main__.mkstemped instance at 0x400e6fa8>> ignored
_mkstemp_inner can create files in a user-selected directory ... ok
_mkstemp_inner creates files with the proper mode ... ok
_mkstemp_inner file handles are not inherited by child processes ... ok
_mkstemp_inner can create files in text mode ... ok
gettempprefix returns a nonempty prefix string ... ok
gettempprefix returns a usable prefix string ... ok
gettempdir returns a directory which exists ... ok
gettempdir returns a directory writable by the user ... ok
gettempdir always returns the same object ... ok
mkstemp can create files ... ok
mkstemp can create directories in a user-selected directory ... ok
mkdtemp can create directories ... ok
mkdtemp can create many directories (stochastic) ... ok
mkdtemp can create directories in a user-selected directory ... ok
mkdtemp creates directories with the proper mode ... ok
mktemp can choose usable file names ... ok
mktemp can choose many usable file names (stochastic) ... ok
mktemp issues a warning when used ... ok
NamedTemporaryFile can create files ... ok
NamedTemporaryFile creates files with names ... ok
A NamedTemporaryFile is deleted when closed ... ok
A NamedTemporaryFile can be closed many times without error ... ok
TemporaryFile can create files ... ok
TemporaryFile creates files with no names (on this system) ... ok
A TemporaryFile can be closed many times without error ... ok

FAIL: _mkstemp_inner can create many files (stochastic)
- ----------------------------------------------------------------------
Traceback (most recent call last):
  File "../python/dist/src/Lib/test/", line 295, in test_basic_many
  File "../python/dist/src/Lib/test/", line 278, in do_create
  File "../python/dist/src/Lib/test/", line 33, in failOnException
  File "/mp/slaskdisk/tmp/sfarmer/python/dist/src/Lib/", line 260, in fail
AssertionError: _mkstemp_inner raised exceptions.OSError: [Errno 24] Too many open files: '/tmp/aaU3irrA'

- ----------------------------------------------------------------------
Ran 38 tests in 43.182s

FAILED (failures=1)
Traceback (most recent call last):
  File "../python/dist/src/Lib/test/", line 719, in ?
  File "../python/dist/src/Lib/test/", line 716, in test_main
  File "/mp/slaskdisk/tmp/sfarmer/python/dist/src/Lib/test/", line 188, in run_suite
    raise TestFailed(err)
test.test_support.TestFailed: Traceback (most recent call last):
  File "../python/dist/src/Lib/test/", line 295, in test_basic_many
  File "../python/dist/src/Lib/test/", line 278, in do_create
  File "../python/dist/src/Lib/test/", line 33, in failOnException
  File "/mp/slaskdisk/tmp/sfarmer/python/dist/src/Lib/", line 260, in fail
AssertionError: _mkstemp_inner raised exceptions.OSError: [Errno 24] Too many open files: '/tmp/aaU3irrA'

Hmm, I wonder how many that is, and how to change it.  I'll look

- -- 
Kalle Svensson,
Student, root and saint in the Church of Emacs.
Version: GnuPG v1.0.7 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.6 <>


From  Wed Aug 14 12:54:44 2002
From: (Samuele Pedroni)
Date: Wed, 14 Aug 2002 13:54:44 +0200
Subject: [Python-Dev] Multiple dispatch
Message-ID: <002001c24389$61d49ee0$6d94fea9@newmexico>

[David Abrahams]
>I don't know how these things usually work, but isn't it a bit early for
>that? I would like to have some discussion about multiple dispatch (and
>especially matching criteria) before investing in a formal proposal. That's
>what my earlier posting which got banished to the types-sig was trying to
>do. Getting a feel for what people are thinking about this, and getting
>feedback from those with lots more experience than I in matters Pythonic is
>important to me.

I'm interested in multiple dispatch [but have limited
band-width], once I have even written
a pure Python implementation of it (only on classes).  It's quite
expressive for some designs. But:

[- Jython too internally uses a kind of multiple-dispatch
 in order to dispatch to overloaded Java methods.
But such a mechanims is really quite a limited beasts
wrt to adding multiple-dispatch to Python in general. ]

- I'm not sure is that much Pythonic or easy to grasp,
 one remark that I have read sometimes is that
 with multimethods the program logic is easely scattered
 in many places with so-to-say non-local effects.

- It is yet another paradigm that should be integrated
 with the rest of Python. For example how does it interact
  with the current single-dispatched methods, does it?
[It is not just a theoretical question, it influences whether
this can be used to model e.g. the dispatch of Jython for
overloaded Java methods or not, simplifying the picture
or adding confusion]
- Syntax and semantics: in Python definitions are assignments.
  Now one needs at least a (maybe implicit) define generic function and
 an add method to generic function. (Should def be abused?)
- Should all function  (methods) definitions define generic function methods
  under the hood.

- Do we dispatch on only foo.__class__ or
  do we want to dispatch on protocols/interfaces/categories,
  now at the moment these are not first-class in Python.
-  How do we solve dispatch ambiguities, the more
  predictable and uncomplicated the more Pythonic.

- Sometimes it is useful to substitute functions and methods
  with wrapped re-editions, the equivalent for multi-method
 are at least before,after,around combinators, I think
 they are useful, but make the picture more complex.

So the question is more what is the most pythonic way
we can find to add multiple dispatch, then maybe it
is Pythonic enough or not. [It seems a SIG task
but I have not really written that word <.5 wink>]

Related: Smallscript, CLOS, Dylan, various overloading flavors

Smallscript is interesting because it adds multiple dispatch
to the single-dispatch semantics of Smalltalk, so it's
very overlapping with our case, OTOH I have not played
with it and I don't know the details of the actual semantics,
[and in general it gives a PL/I-esque impression, at least from far away].


From  Wed Aug 14 13:02:30 2002
From: (Barry A. Warsaw)
Date: Wed, 14 Aug 2002 08:02:30 -0400
Subject: [Python-Dev] hex constants, bit patterns,
 PEP 237 warnings and gettext
References: <>
Message-ID: <>

>>>>> "TP" == Tim Peters <> writes:

    TP> [Barry A. Warsaw]
    >> ...  So if "0x950412de" isn't the right way to write a 32 bit
    >> pattern,

    TP> It isn't today, but will be in 2.4.

But isn't that wasteful?  Today I have to add the L to my hex
constants, but in a year from now, I can just turn around and remove
them again.  What's the point?

The deeper question is: what's wrong with "0x950412de"?  What bits
have I lost by writing my hex constant this way?  I'm trying to
understand why hex constants > sys.maxint have to deprecated.


From  Wed Aug 14 13:13:05 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 08:13:05 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: Your message of "Wed, 14 Aug 2002 08:33:13 +0200."
References: <>
Message-ID: <>

> > Do I misunderstand something, or this this a bug (limitation?) in the
> > unicode->latin-1 decoder?
> It's a limitation, in all codecs. Contributions of normalization code
> are welcome. Since this is hard work, this is unlikely to be fixed in
> Python 2.3 - unless somebody has a really good incentive for fixing
> it.

Note that normalization doesn't belong in the codecs (except perhaps
as a separate Unicode->Unicode codec, since codecs seem to be useful
for all string->string transformations).  It's a separate step that
the application has to request; only the app knows whether a
particular Unicode string is already normalized or not, and whether
the expense is useful for the app, or not.

--Guido van Rossum (home page:

From  Wed Aug 14 13:27:49 2002
From: (Kalle Svensson)
Date: Wed, 14 Aug 2002 14:27:49 +0200
Subject: [snake-farm] RE: [Python-Dev] strange warnings from tempfile.mkstemped.__del__ on HP
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

Hash: SHA1

[me, on the HP-UX snake farm build]
> AssertionError: _mkstemp_inner raised exceptions.OSError: [Errno 24]
> Too many open files: '/tmp/aaU3irrA'
> Hmm, I wonder how many that is, and how to change it.  I'll look
> around.

I've raised maxfiles from 200 to 2048, and the test now runs without

- -- 
Kalle Svensson,
Student, root and saint in the Church of Emacs.
Version: GnuPG v1.0.7 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.6 <>


From  Wed Aug 14 13:24:58 2002
From: (Oren Tirosh)
Date: Wed, 14 Aug 2002 08:24:58 -0400
Subject: [Python-Dev] hex constants, bit patterns, PEP 237 warnings and gettext
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

On Wed, Aug 14, 2002 at 08:02:30AM -0400, Barry A. Warsaw wrote:
> >>>>> "TP" == Tim Peters <> writes:
>     TP> [Barry A. Warsaw]
>     >> ...  So if "0x950412de" isn't the right way to write a 32 bit
>     >> pattern,
>     TP> It isn't today, but will be in 2.4.
> But isn't that wasteful?  Today I have to add the L to my hex
> constants, but in a year from now, I can just turn around and remove
> them again.  What's the point?
> The deeper question is: what's wrong with "0x950412de"?  What bits
> have I lost by writing my hex constant this way?  I'm trying to
> understand why hex constants > sys.maxint have to deprecated.

Unifying ints and longs means that there is no predefined bit width for
numbers. Conceptually they are all infinite. Positive numbers have an
infinite number of leading '0's and negative numbers have an infinite number
of leading 'F's. Numbers that have less than 8/16 digits to the right of
this infinite sequence '0'f or 'F's of happen to get a more efficient 
internal representation and a different ob_type, but other than that it 
should be impossible to tell the difference between an int and a long.

What's wrong with 0x950412de is that with a word width of 32 bits it is 
negative and therefore the invisible bits to the left are all set. With a 
word width of 64 bits or with an infinite width they are cleared.

That's why I propose borrowing the 'U' suffix from C. 0x950412deU would
mean that the bits to the left are cleared. This way you could change your
code only once, document your intentions clearly and get a number that is
guaranteed to be equivalent on Python and C compilers with different native
word sizes.


From  Wed Aug 14 13:26:58 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 08:26:58 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Wed, 14 Aug 2002 11:29:42 +0200."
References: <>
Message-ID: <>

> > How about the following counterproposal.  This also changes some of
> > the other format codes to be a little more regular.
> >
> > Code    C type          	Range check
> >
> > b	unsigned char		0..UCHAR_MAX
> > B	unsigned char		none **
> > h	unsigned short		0..USHRT_MAX
> > H	unsigned short		none **
> > i	int			INT_MIN..INT_MAX
> > I *	unsigned int		0..UINT_MAX
> > l	long			LONG_MIN..LONG_MAX
> > k *	unsigned long		none
> > L	long long		LLONG_MIN..LLONG_MAX
> > K *	unsigned long long	none
> >
> > Notes:
> >
> > * New format codes.
> >
> > ** Changed from previous "range-and-a-half" to "none"; the
> >    range-and-a-half checking wasn't particularly useful.
> Fine with me.

OK, I've added this to my TODO list (, assigned to
me -- but if someone else wants to do it, please assign to yourself or
submit a patch and leave a note in the bug item!).

--Guido van Rossum (home page:

From  Wed Aug 14 13:40:47 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 08:40:47 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Tue, 13 Aug 2002 23:59:29 EDT."
References: <> <> <0c0501c24311$8cebbdc0$> <> <0ccc01c24341$d839b130$> <>
Message-ID: <>

> > Time to write a PEP.
> I don't know how these things usually work, but isn't it a bit early
> for that?

Not at all.  If you want multiple dispatch to go into the language,
you'll have to educate the rest of us here, both about the advantages,
and how it can be implemented with reasonable, Pythonic semantics.
A PEP is the perfect vehicle for that.  A PEP doesn't have to *start*
as a full formal proposal.  It can go through stages and eventually
end up being rejected before there ever was a full formal proposal,
*or* it will eventually evolve into a full formal proposal.  (For
example, PEPs 245 and 246 are examples of PEPs in the very early
stages.  I expect PEP 245 was too early, but PEP 246 strikes me as
just the right thing to get a meaningful discussion started.)

--Guido van Rossum (home page:

From  Wed Aug 14 13:53:39 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 08:53:39 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Wed, 14 Aug 2002 10:25:23 +0300."
References: <> <> <> <>
Message-ID: <>

> Not at all! Removing the differences between ints and longs is good. 
> My reservations are about thehexadecimal representation.
>     - Currently, the '%u', '%x', '%X' and '%o' string formatting
>       operators and the hex() and oct() built-in functions behave
>       differently for negative numbers: negative short ints are
>       formatted as unsigned C long, while negative long ints are
>       formatted with a minus sign.  This will be changed to use the
>       long int semantics in all cases (but without the trailing 'L'
>       that currently distinguishes the output of hex() and oct() for
>       long ints).  Note that this means that '%u' becomes an alias for
>       '%d'.  It will eventually be removed.
> In Python up to 2.2 it's inconsistent between ints and longs:
> >>> hex(-16711681)
> '0xff00ffff'
> >>> hex(-16711681L)
> '-0xff0001L'		# ??!?!?
> The hex representation of ints gives me useful information about their 
> bit structure. After all, it is not immediately apparent to most mortals 
> that the number above is a mask for bits 16-23.

If you want to see the bit mask, all you have to do is and it with a
positive mask, e.g. 0xffff to see it as a 16-bit mask, 0xffffffffL for
a 32-bit mask, or 0xffffffffffffffff for a 64-bit mask.  And you can
go higher.

> The hex representation of longs is something I find quite misleading and 
> I think it's also unprecedented.  This wart has bothered me for a long 
> time now but I didn't have any use for it so I didn't mind too much. Now 
> it is proposed to extend this useless representation to ints so I do.

Just yesterday I got a proposal for a hex calculator that proposed
-0x1 to represent the mathematical value -1 in hex.  I don't think
it's unprecedented at all, although it may be unconventional.

> So we have two elements of the language that are inconsistent. One of 
> them is in widespread use and the other is... ahem... 
> Which one of them should be changed to conform to the other? 
> My proposal: 
> On 32 bit platforms:
> >>> hex(-16711681)
> '0xff00ffff'
> >>> hex(-16711681L)
> '0xff00ffff'
> On 64 bit platforms:
> >>> hex(-16711681)
> '0xffffffffff00ffffLL'
> >>> hex(-16711681L)
> '0xffffffffff00ffffLL'
> The 'LL' suffix means that this number is to be treated as a 64 bit
> *signed* number. This is consistent with the way it is interpreted by 
> GCC and other unix compilers on both 32 and 64 bit platforms.  


Python doesn't have the concept of 64-bit signed numbers.  It also
doesn't have the 'LL' syntax on input -- or do you propose to add that
too?  Why should the hex representation have to contain the conceptual
size of the number?  Do you propose to add LL to the hex
representations of positive numbers too?

> What to do about numbers from 2**31 to 2**32-1?
> >>> hex(4278255615)
> 0xff00ffffU
> The U suffix, also borrowed from C, makes it unambigous on 32 and 64 bit 
> platforms for both Python and C. 

Another -1.  Python doesn't have this on input.

> Representation of positive numbers:
>  0x00000000   -         0x7fffffff   : unambigous on all platforms
>  0x80000000U  -         0xffffffffU  : representation adds U suffix
> 0x100000000LL - 0x7fffffffffffffffLL : representation adds LL suffix

What does the addition of the U or LL suffix give you?  If I really
want to know how many bits there are I can count the digits, right?
And usually the app that does the printing knows in how many bits it
is interested.

> Representation of negative numbers:
>  0x80000000  - 0xffffffff (-2147483648 to -1):
> 	8 digits on 32 bit platforms
>  0xffffffff80000000LL  - 0xffffffffffffffffLL  (same range):
> 	16 digits and LL suffix on 64 bit platforms
>  others negative numbers: 16 digits and LL suffix on all platforms.

And what do you suppose we do with hex(-100**100)?

> This makes the hex representation of a number informative and consistent 
> between int and long on all platforms. It is also consistent with the
> C compiler on the same platform. Yes, it will produce a different text
> representation of some numbers on different platforms but this conveys
> important information about the bit structure of the number which really
> is different between platforms. eval()ing it back to a number is still 
> consistent.

Why is the bit structure so important to you?

> When converting in the other direction (hex representation to number) 
> there is an ambigous range from 0x80000000 to 0xffffffff.  Should it be 
> treated as signed or unsigned?  The current interpretation is signed. PEP
> 237 proposes to change it to unsigned. I propose to do neither - this range
> should be deprecated and some explicit notation should be used instead.

Now that's really helpful. :-(  What is someone to do who wants to
enter a hex constant they got from some documentation?  E.g. the AIFC
magic number is 0xA2805140.  Why shouldn't I be able to write that?
What's the use of having a discontinuity in our notation?

(I wanted to write much stronger words but I'm trying to respond to
the proposal only.  I guess I'm -1000000 on this.)

--Guido van Rossum (home page:

From  Wed Aug 14 13:59:44 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 08:59:44 -0400
Subject: [Python-Dev]
Message-ID: <>

The mkstemp() function in the rewritten tempfile has an argument with
a curious name and default: binary=True.  This caused confusion (even
the docstring in the original patch was confused :-).  It would be
much easier to explain if this was changed to text=False.  That is, to
deviate from the default mode, i.e. use text mode, you'll have to
write mkstemp(text=True) rather than mkstemp(binary=False).

This might require a few changes to the standard library and to
anybody's code who has aggressively started using this, but given the
freshness of the patch I think that's OK.  If anybody sees a good
reason *not* to do this, please let me know (here or on the SF patch,

--Guido van Rossum (home page:

From  Wed Aug 14 14:09:19 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 09:09:19 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Wed, 14 Aug 2002 06:18:19 EDT."
References: <> <>
Message-ID: <>

> Type categories are fundamentally different from interfaces.  An 
> interface must be declared by the type while a category can be an 
> observation about an existing type. 

Yup.  (In Python these have often been called "protocols".  Jim Fulton
calls them "lore protocols".)

> Two types that are defined independently in different libraries may
> in fact fit under the same category because they implement the same
> protocol.  With named interfaces they may in fact be compatible but
> they will not expose the same explicit interface. Requiring them to
> import the interface from a common source starts to sound more like
> Java than Python and would introduce dependencies and interface
> version issues in a language that is wonderfully free from such
> arbitrary complexities.

Hm, I'm not sure if you can solve the version incompatibility problem
by ignoring it. :-)

> Python is a dymanic language. It deserves a dynamic type category
> system, not static interfaces that must be declared. It's fine to
> write a class and somehow say "I intend this class to be in category
> X, please warn me if I write a method that will make it
> incompatible". But I don't want declarations to be a *requirement*
> for being considered compatible with a protocol. I have noticed that
> a lots of protocols are defined retroactively by observation of the
> behavior of existing code. There shoudln't be any need to go tag
> someone else's code as conforming to a protocol or put a wrapper
> around it just to be able to use it.

Are you familiar with Zope's Interface package?  It solves this
problem (nicely, IMO) by allowing you to place an interface
declaration inside a class but also allowing you to make calls to an
interface registry that declare interfaces for pre-existing classes.

> A category is defined mathematically by a membership predicate. So
> what we need for type categories is a system for writing predicates
> about types.

Now I think you've lost me.  How can a category on the one hand be
observed after the fact and on the other hand defined by a rigorous
mathematical definition?  How could a program tell by looking at a
class whether it really is an implementation of a given protocol?

> Standard Python expressions should not be used for defining a
> category membership predicate. A Python expression is not a pure
> function. This makes it impossible to cache the results of which
> type belongs to what category for efficiency. Another problem is
> that many different expressions may be equivalent but if two
> independently defined categories use equivalent predicates they
> should *be* the same category.  They should be merged at runtime
> just like interned strings.

Again you've lost me.  I expect there's something here that you assume
well-known.  Can you please clarify this?  What on earth do you mean
by "A Python expression is not a pure function" ?

> About a year ago I worked on a system for predicates having a
> canonical representation for security applications. . While I was
> working on it I realized that it would be perfect for implementing a
> type category system for Python. It would be useful at runtime for
> error detection and runtime queries of protocols. It would also be
> useful at compile time for early detection of some errors and
> possibly for optimization. By implementing an optional strict mode
> the early error detection could be improved to the point where it's
> effectively a static type system.

So let's see a proposal already.  I can't guess what you are proposing
from this description except that you think highly of your own
invention.  I wouldn't expect you to mention it otherwise, so that's 
bits of information. :-)

> Just a quick example of the usefulness of canonical predicates: if I
> calculate the intersection of two predicates and reduce it to
> canonical form it will reduce to the FALSE predicate if no input
> will satisfy both predicates. It will be equal to one of the
> predicate if it is contained by the other.
> I spent countless hours thinking about these issues, probably more than 
> most people on this list...

How presumptuous.

> I think I have the foundation for a powerful yet unobtrusive type
> category system. Unfortunately it will take me some time to put it
> in writing and I don't have enough free time (who does?)

I say vaporware. :-)

Tell us about it when you have time.

--Guido van Rossum (home page:

From  Wed Aug 14 14:12:21 2002
From: (Fredrik Lundh)
Date: Wed, 14 Aug 2002 15:12:21 +0200
Subject: [Python-Dev]
References: <>
Message-ID: <016a01c24394$3a620b80$0900a8c0@spiff>

guido wrote:
> The mkstemp() function in the rewritten tempfile has an argument with
> a curious name and default: binary=3DTrue.  This caused confusion =
> the docstring in the original patch was confused :-).  It would be
> much easier to explain if this was changed to text=3DFalse.

fwiw, it would probably be even easier to use/explain if it
used a mode string.


From  Wed Aug 14 14:11:44 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 09:11:44 -0400
Subject: [Python-Dev] hex constants, bit patterns, PEP 237 warnings and gettext
In-Reply-To: Your message of "Wed, 14 Aug 2002 08:02:30 EDT."
References: <> <>
Message-ID: <>

>     TP> [Barry A. Warsaw]
>     >> ...  So if "0x950412de" isn't the right way to write a 32 bit
>     >> pattern,
>     TP> It isn't today, but will be in 2.4.
> But isn't that wasteful?  Today I have to add the L to my hex
> constants, but in a year from now, I can just turn around and remove
> them again.  What's the point?

Think 5 years rather than 1 year.

> The deeper question is: what's wrong with "0x950412de"?  What bits
> have I lost by writing my hex constant this way?  I'm trying to
> understand why hex constants > sys.maxint have to deprecated.

We're not deprecating them.  Instead, the type of hex constants
in range(sys.maxint, 2*sys.maxint+2) will change from int to long, to
be consistent with other hex constants.  Currently:

  >>> 0xf > 0
  >>> 0xffff > 0
  >>> 0xfffffff > 0
  >>> 0xffffffff > 0
  False                      <----------- This anomaly will disappear
  >>> 0xfffffffff > 0
  >>> 0xffffffffffffffff > 0

--Guido van Rossum (home page:

From  Wed Aug 14 14:12:36 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 09:12:36 -0400
Subject: [Python-Dev] hex constants, bit patterns, PEP 237 warnings and gettext
In-Reply-To: Your message of "Wed, 14 Aug 2002 08:02:30 EDT."
References: <> <>
Message-ID: <>

> I'm trying to understand why hex constants > sys.maxint have to
> deprecated.

Hm, maybe I should use a different warning category rather than
DeprecationWarning?  Any suggestions?

--Guido van Rossum (home page:

From  Wed Aug 14 14:14:21 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 09:14:21 -0400
Subject: [snake-farm] RE: [Python-Dev] strange warnings from tempfile.mkstemped.__del__ on HP
In-Reply-To: Your message of "Wed, 14 Aug 2002 14:27:49 +0200."
References: <> <> <>
Message-ID: <>

> [me, on the HP-UX snake farm build]
> > AssertionError: _mkstemp_inner raised exceptions.OSError: [Errno 24]
> > Too many open files: '/tmp/aaU3irrA'
> > 
> > Hmm, I wonder how many that is, and how to change it.  I'll look
> > around.
> I've raised maxfiles from 200 to 2048, and the test now runs without
> error.

Thanks!  Maybe the test was a little too eager though -- perhaps it
could be happy with creating 100 instead of 1000 files.

--Guido van Rossum (home page:

From  Wed Aug 14 15:08:59 2002
From: (Andrew Koenig)
Date: 14 Aug 2002 10:08:59 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

>> The category names look like general purpose interface names. The
>> addition of interfaces has been discussed quite a bit. While many
>> people are interested in having interfaces added to Python, there
>> are many design issues that will have to be resolved before it
>> happens.

Oren> Nope. Type categories are fundamentally different from
Oren> interfaces.  An interface must be declared by the type while a
Oren> category can be an observation about an existing type.

Why?  That is, why can't you imagine making a claim that type
X meets interface Y, even though the author of neither X nor Y
made that claim?

However, now that you bring it up... One difference I see between
interfaces and categories is that I can imagine categories carrying
semantic information to the human reader of the code that is not
actually expressed in the category itself.  As a simple example,
I can imagine a PartialOrdering category that I might like as part
of the specification for an argument to a sort function.

Oren> Two types that are defined independently in different libraries
Oren> may in fact fit under the same category because they implement
Oren> the same protocol.  With named interfaces they may in fact be
Oren> compatible but they will not expose the same explicit
Oren> interface. Requiring them to import the interface from a common
Oren> source starts to sound more like Java than Python and would
Oren> introduce dependencies and interface version issues in a
Oren> language that is wonderfully free from such arbitrary
Oren> complexities.

Why is importing an interface any worse than importing a library?

I see both interfaces and categories as claims about types.  Those
claims might be made by the types' authors, or they might be made by
the types' users.  I see no reason why they should have to be any
more static than the definitions of the types themselves.

Oren> Python is a dymanic language. It deserves a dynamic type
Oren> category system, not static interfaces that must be
Oren> declared. It's fine to write a class and somehow say "I intend
Oren> this class to be in category X, please warn me if I write a
Oren> method that will make it incompatible". But I don't want
Oren> declarations to be a *requirement* for being considered
Oren> compatible with a protocol. I have noticed that a lots of
Oren> protocols are defined retroactively by observation of the
Oren> behavior of existing code. There shoudln't be any need to go tag
Oren> someone else's code as conforming to a protocol or put a wrapper
Oren> around it just to be able to use it.

Oren> A category is defined mathematically by a membership
Oren> predicate. So what we need for type categories is a system for
Oren> writing predicates about types.

Indeed, that's what I was thinking about initially.  Guido pointed out
that the notion could be expanded to making concrete assertions about
the interface to a class.  I had originally considered that those
assertions could be just that--assertions, but then when Guido started
talking about interfaces, I realized that my original thought of
expressing satisfaction of a predicate by inheriting it could be
extended by simply adding methods to those predicates.  Of course,
this technique has the disadvantage that it's not easy to add base
classes to a class after it has been defined.

Oren> Standard Python expressions should not be used for defining a
Oren> category membership predicate. A Python expression is not a pure
Oren> function. This makes it impossible to cache the results of which
Oren> type belongs to what category for efficiency. Another problem is
Oren> that many different expressions may be equivalent but if two
Oren> independently defined categories use equivalent predicates they
Oren> should *be* the same category.  They should be merged at runtime
Oren> just like interned strings.


Oren> About a year ago I worked on a system for predicates having a
Oren> canonical representation for security applications. . While I
Oren> was working on it I realized that it would be perfect for
Oren> implementing a type category system for Python. It would be
Oren> useful at runtime for error detection and runtime queries of
Oren> protocols. It would also be useful at compile time for early
Oren> detection of some errors and possibly for optimization. By
Oren> implementing an optional strict mode the early error detection
Oren> could be improved to the point where it's effectively a static
Oren> type system.

Oren> Just a quick example of the usefulness of canonical predicates:
Oren> if I calculate the intersection of two predicates and reduce it
Oren> to canonical form it will reduce to the FALSE predicate if no
Oren> input will satisfy both predicates. It will be equal to one of
Oren> the predicate if it is contained by the other.

Oren> I spent countless hours thinking about these issues, probably
Oren> more than most people on this list... I think I have the
Oren> foundation for a powerful yet unobtrusive type category
Oren> system. Unfortunately it will take me some time to put it in
Oren> writing and I don't have enough free time (who does?)

Is there room to scribble it in the margin somewhere?  <wink>

Andrew Koenig,,

From  Wed Aug 14 15:06:07 2002
From: (Steve Holden)
Date: Wed, 14 Aug 2002 10:06:07 -0400
Subject: [Python-Dev] hex constants, bit patterns, PEP 237 warnings and gettext
References: <> <>              <>  <>
Message-ID: <009d01c2439b$c2e562c0$>

> > I'm trying to understand why hex constants > sys.maxint have to
> > deprecated.
> Hm, maybe I should use a different warning category rather than
> DeprecationWarning?  Any suggestions?

SignExtensionWarning? IntegerPrecisionWarning?

Steve Holden                       
Python Web Programming      

From  Wed Aug 14 15:53:10 2002
From: (Barry A. Warsaw)
Date: Wed, 14 Aug 2002 10:53:10 -0400
Subject: [Python-Dev] hex constants, bit patterns, PEP 237 warnings and gettext
References: <>
Message-ID: <>

>>>>> "OT" == Oren Tirosh <> writes:

    OT> What's wrong with 0x950412de is that with a word width of 32
    OT> bits it is negative and therefore the invisible bits to the
    OT> left are all set. With a word width of 64 bits or with an
    OT> infinite width they are cleared.

My point is that if I write a hex constant I never think about it as a
negative number; it's always an unsigned bit pattern.  I know Python
currently disagrees when the bit pattern is 32-bits in width and the
top bit is set, and that PEP 237 is the roadmap to get there from

>>>>> "GvR" == Guido van Rossum <> writes:

    >> I'm trying to understand why hex constants > sys.maxint have to
    >> deprecated.

    GvR> Hm, maybe I should use a different warning category rather
    GvR> than DeprecationWarning?  Any suggestions?

I think that would help a lot, yes.  We had a lively internal
discussion this morning about it and we came up with FutureWarning.
Maybe Guido will come up with a better name, but I don't think it
should be DeprecationWarning.  The code that causes the warning isn't
being deprecated, its semantics are destined to be changed, and that
seems like an important distinction.


From  Wed Aug 14 15:58:32 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 10:58:32 -0400
Subject: [Python-Dev]
In-Reply-To: Your message of "Wed, 14 Aug 2002 15:12:21 +0200."
References: <>
Message-ID: <>

> guido wrote:
> > The mkstemp() function in the rewritten tempfile has an argument with
> > a curious name and default: binary=True.  This caused confusion (even
> > the docstring in the original patch was confused :-).  It would be
> > much easier to explain if this was changed to text=False.
> fwiw, it would probably be even easier to use/explain if it
> used a mode string.

The [Named]TemporaryFile() functions do that.  mkstemp() returns a
OS-level file desccriptor.  I guess the 'binary' flag is coming from
Windows thinking, where you have to add os.O_BINARY to the open()
flags for binary mode.

I'll change mkstemp() to having a text=False argument instead.

--Guido van Rossum (home page:

From  Wed Aug 14 15:59:51 2002
From: (Barry A. Warsaw)
Date: Wed, 14 Aug 2002 10:59:51 -0400
Subject: [Python-Dev] hex constants, bit patterns, PEP 237 warnings and gettext
References: <>
Message-ID: <>

>>>>> "SH" == Steve Holden <> writes:

    SH> SignExtensionWarning? IntegerPrecisionWarning?




From  Wed Aug 14 16:16:05 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 11:16:05 -0400
Subject: [Python-Dev] hex constants, bit patterns, PEP 237 warnings and gettext
In-Reply-To: Your message of "Wed, 14 Aug 2002 10:59:51 EDT."
References: <> <> <> <> <009d01c2439b$c2e562c0$>
Message-ID: <>

>     SH> SignExtensionWarning? IntegerPrecisionWarning?
> ItsGonnaBeABitDifferentWarning

Let it be FutureWarning.

--Guido van Rossum (home page:

From  Wed Aug 14 16:54:40 2002
From: (Neil Schemenauer)
Date: Wed, 14 Aug 2002 08:54:40 -0700
Subject: [Python-Dev] type categories
In-Reply-To: <0ce601c24346$ff967f60$>; from on Tue, Aug 13, 2002 at 11:59:29PM -0400
References: <> <> <0c0501c24311$8cebbdc0$> <> <0ccc01c24341$d839b130$> <> <0ce601c24346$ff967f60$>
Message-ID: <>

David Abrahams wrote:
> There's not all that much to what I'm doing. I have a really simple-minded
> dispatching scheme which checks each overload in sequence, and takes the
> first one which can get a match for all arguments.

Can you explain in more detail how the matching is done?  Wouldn't
having some kind of type declarations be a precondition to implementing
multiple dispatch.


From  Wed Aug 14 17:10:43 2002
From: (Tim Peters)
Date: Wed, 14 Aug 2002 12:10:43 -0400
Subject: [snake-farm] RE: [Python-Dev] strange warnings from tempfile.mkstemped.__del__ on HP
In-Reply-To: <>
Message-ID: <>

> Thanks!  Maybe the test was a little too eager though -- perhaps it
> could be happy with creating 100 instead of 1000 files.

Just noting that this change has been made in current CVS, so there
shouldn't be a need to boost the HP default anymore.

From  Wed Aug 14 17:17:14 2002
From: (Samuele Pedroni)
Date: Wed, 14 Aug 2002 18:17:14 +0200
Subject: [Python-Dev] multiple dispatch (ii)
Message-ID: <007901c243ae$0db918c0$6d94fea9@newmexico>

Here is my old code,
is kind of a alpha quality prototype code,
no syntax sugar, no integration, pure python.

The "_redispatch" mechanism is the moral
equivalent of

class A:
  def meth(self): ...

class B(A):
  def meth(self):

it is used both for call-next-method functionality
(that means super for multiple dispatch)
and to solve ambiguities.

(this is pre 2.2 stuff, nowadays
the mro of the actual argument type can be used
to solve ambiguities (like CLOS and Dylan do), if you add
interfaces/protocols to the picture you should
decide how to merge them in the mro, if the case)

[it uses memoization and so you can't fiddle
with __bases__]

print "** mdisp test"
import mdisp

class Panel: pass

class PadPanel(Panel): pass

class Specific: pass

present = mdisp.Generic()

panel = PadPanel()
spec = Specific()

def pan(p,o):
    print "generic panel present"

def pad(p,o):
    print "pad panel present"

def speci(p,o):
    print "generic panel <specific> present"

def padspeci(p,o):
    print "pad panel <specific> present"






except mdisp.AmbiguousMethodError:
    print "ambiguity"

print "_redispatch = (None,Any)",




print "* again... panel:obj tierule"







except mdisp.AmbiguousMethodError:
    print "ambiguity"



** mdisp test
generic panel present
generic panel <specific> present
_redispatch = (None,Any) pad panel present
pad panel <specific> present
* again... panel:obj tierule
generic panel present
generic panel <specific> present
pad panel present
pad panel <specific> present


import types
import re

def class_of(obj):
    if type(obj) is types.InstanceType:
        return obj.__class__
        return type(obj)

NonComparable = None
class Any: pass

def class_le(cl1,cl2):
    if cl1 == cl2: return 1
    if cl2 == Any: return 1
        cl_lt = issubclass(cl1,cl2)
        cl_gt = issubclass(cl2,cl1)
        if not (cl_lt or cl_gt): return NonComparable
        return cl_lt
        return NonComparable

def classes_tuple_le(tup1,tup2):
    if len(tup1) != len(tup2): return NonComparable
    tup_le = 0
    tup_gt = 0
    for cl1,cl2 in zip(tup1,tup2):
        cl_le = class_le(cl1,cl2)
        if cl_le == NonComparable:
            return NonComparable
        if cl_le:
            tup_le |= 1
            tup_gt |= 1
        if tup_le and tup_gt: return NonComparable
    return tup_le

def classes_tuple_le_ex(tup1,tup2, tierule = None):
    if len(tup1) != len(tup2): return NonComparable
    if not tierule: tierule = (len(tup1),)
    last = 0
    for upto in tierule:
        sl1 = tup1[last:upto]
        sl2 = tup2[last:upto]
        last = upto
        if sl1 == sl2: continue
        if len(sl1) == 1:
            return class_le(sl1[0],sl2[0])
        sl_le = 0
        sl_gt = 0
        for cl1,cl2 in zip(sl1,sl2):
            cl_le = class_le(cl1,cl2)
            if cl_le == NonComparable:
                return NonComparable
            if cl_le:
                sl_le |= 1
                sl_gt |= 1
            if sl_le and sl_gt: return NonComparable
        return sl_le
    return 1

_id_regexp = re.compile("\w+")

def build_tierule(patt):
    tierule = []
    last = 0
    for uni in patt.split(':'):
        c = 0
        for arg in uni.split(','):
            if not _id_regexp.match(arg): raise "ValueError","invalid Generic
(tierule) pattern"
            c += 1
        last += c
    return tierule

def forge_classes_tuple(model,tup):
    return tuple ( map ( lambda (m,cl): m or cl,

class GenericDispatchError(TypeError): pass

class NoApplicableMethodError(GenericDispatchError): pass

class AmbiguousMethodError(GenericDispatchError): pass

class Generic:
    def __init__(self,args=None):
        self.cache = {}
        self.methods = {}
        if args:
            self.args = args
            self.tierule = build_tierule(args)
            self.args = "???"
            self.tierule = None

    def add_method(self,cltup,func):
        self.methods[cltup] = func
        new_meth = (cltup,func)
        self.cache[cltup] = new_meth
        for d_cltup,(meth_cltup,meth_func) in self.cache.items():
            if classes_tuple_le(d_cltup,cltup):
                le = classes_tuple_le_ex(cltup,meth_cltup,self.tierule)
                if le == NonComparable:
                    del self.cache[d_cltup]
                elif le:
                    self.cache[d_cltup] = new_meth

    def __call__(self,*args,**kw):
        redispatch = kw.get('_redispatch',None)
        d_cltup = map(class_of,args)
        if redispatch:
            d_cltup = forge_classes_tuple(redispatch,d_cltup)
            d_cltup = tuple(d_cltup)

        if self.cache.has_key(d_cltup):
            return self.cache[d_cltup][1](*args) # 1 retrieves func

        cands = []
        for cltup in self.methods.keys():
            if d_cltup == cltup:
                return self.methods[cltup](*args)
            if classes_tuple_le(d_cltup,cltup): # applicable?
                i = len(cands)
                app = not i
                i -= 1
                while i>=0:
                    cand = cands[i]
                    le = classes_tuple_le_ex(cltup,cand,self.tierule)
                    #print cltup,"<=",cand,"?",le
                    if le == NonComparable:
                        app = 1
                    elif le:
                        if cand != cltup:
                            app = 1
                            #print "remove",cand
                            del cands[i]
                    i -= 1
                if app:
                #print cands
        if len(cands) == 0:
            raise NoApplicableMethodError
        if len(cands)>1:
            raise AmbiguousMethodError
        cltup = cands[0]
        func = self.methods[cltup]
        self.cache[d_cltup] = (cltup,func)
        return func(*args)

From  Wed Aug 14 19:35:41 2002
From: (Martin v. Loewis)
Date: 14 Aug 2002 20:35:41 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen <> writes:

> Why is this hard work? I would guess that a simple table lookup would
> suffice, after all there are only a finite number of unicode
> characters that can be split up, and each one can be split up in only
> a small number of ways.

Canonical decomposition requires more than that: you not only need to
apply the canonical decomposition mapping, but also need to put the
resulting characters into canonical order (if more than one combining
character applies to a base character).

In addition, a na=EFve implementation will consume large amounts of
memory. Hangul decomposition is better done algorithmitically, as we
are talking about 11172 precombined characters for Hangul alone.

> Wouldn't something like
> for c in input:
> 	if not canbestartofcombiningsequence.has_key(c):
> 		output.append(c)
>       nlookahead =3D MAXCHARSTOCOMBINE
>       while nlookahead > 1:
> 		attempt =3D lookahead next nlookahead bytes from input
> 		if combine.has_key(attempt):
> 			output.append(combine[attempt])
> 			skip the lookahead in input
> 			break
> 	else:
> 		output.append(c)
> do the trick, if the two dictionaries are initialized intelligently?

No, that doesn't do canonical ordering. There is a lot more to
normalization; the hard work is really in understanding what has to be


From  Wed Aug 14 19:46:04 2002
From: (Martin v. Loewis)
Date: 14 Aug 2002 20:46:04 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
References: <>
Message-ID: <>

Oren Tirosh <> writes:

> >>> hex(-16711681)
> '0xff00ffff'
> >>> hex(-16711681L)
> '-0xff0001L'		# ??!?!?
> The hex representation of longs is something I find quite misleading and 
> I think it's also unprecedented.  This wart has bothered me for a long 
> time now but I didn't have any use for it so I didn't mind too much. Now 
> it is proposed to extend this useless representation to ints so I do.

I don't find it misleading - in fact, the C representation is
misleading: 0xff00ffff looks like a positive number (it does not have
a sign) - this is misleading, as the number is, in fact, negative.

The representation is not misleading: it does not make you believe it
is something that it actually isn't. It might be surprising, but after
thinking about it, it should be clear that it is correct: -N is the
number that, when added to N, gives zero. Indeed:

>>> -16711681L+0xff0001L

If you want the bitmask for the lowest 32 bits, you can write

>>> hex(-16711681L & (2**32-1))

Notice that -16711681 is a number with an infinite amont of leading
ones - just as 16711681 is a number with an infinite amount of leading


From  Wed Aug 14 19:49:25 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 14:49:25 -0400
Subject: [Python-Dev] Alternative implementation of interning
Message-ID: <>


I think Oren did a good job on this.  Could somebody please do an
independent review of the code before I check it in?

--Guido van Rossum (home page:

From  Wed Aug 14 20:21:57 2002
From: (Paul Svensson)
Date: Wed, 14 Aug 2002 15:21:57 -0400 (EDT)
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

On 14 Aug 2002, Martin v. Loewis wrote:

>Oren Tirosh <> writes:
>> >>> hex(-16711681)
>> '0xff00ffff'
>> >>> hex(-16711681L)
>> '-0xff0001L'		# ??!?!?
>> The hex representation of longs is something I find quite misleading and
>> I think it's also unprecedented.  This wart has bothered me for a long
>> time now but I didn't have any use for it so I didn't mind too much. Now
>> it is proposed to extend this useless representation to ints so I do.
>I don't find it misleading - in fact, the C representation is
>misleading: 0xff00ffff looks like a positive number (it does not have
>a sign) - this is misleading, as the number is, in fact, negative.
>The representation is not misleading: it does not make you believe it
>is something that it actually isn't. It might be surprising, but after
>thinking about it, it should be clear that it is correct: -N is the
>number that, when added to N, gives zero. Indeed:
>>>> -16711681L+0xff0001L
>If you want the bitmask for the lowest 32 bits, you can write
>>>> hex(-16711681L & (2**32-1))
>Notice that -16711681 is a number with an infinite amont of leading
>ones - just as 16711681 is a number with an infinite amount of leading

Just a thougth: if it's true that those using hex() and %x are more
interested in the bit values than the numerical value of the whole number,
would a format like ~0xff000 be easier to interpret (and stop this debate) ?


From  Wed Aug 14 20:35:01 2002
From: (Martin v. Loewis)
Date: 14 Aug 2002 21:35:01 +0200
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
References: <>
Message-ID: <>

Paul Svensson <> writes:

> Just a thougth: if it's true that those using hex() and %x are more
> interested in the bit values than the numerical value of the whole
> number, would a format like ~0xff000 be easier to interpret (and
> stop this debate) ?

I like this.


From  Wed Aug 14 20:39:00 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 15:39:00 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Wed, 14 Aug 2002 15:21:57 EDT."
References: <>
Message-ID: <>

> Just a thougth: if it's true that those using hex() and %x are more
> interested in the bit values than the numerical value of the whole number,
> would a format like ~0xff000 be easier to interpret (and stop this debate) ?

Hmm...  It has a perverse Pythonic smell...  But I fear it would
introduce more backwards incompatibilities, because it would have to
apply to longs as well, and hence change the output whenever a
negative long is converted to hex or octal.  (And what about %u?
Should "%u" % -1 return "~0" too?)

--Guido van Rossum (home page:

From  Wed Aug 14 20:43:29 2002
From: (Andrew Koenig)
Date: 14 Aug 2002 15:43:29 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <>
Message-ID: <>

>> PS: is pure substring testing such a common idiom?
>> I have not found so many
>> matches for   find\(.*\)\s*>  in the std lib

Greg> For more generality, maybe

Greg>   re in string

Greg> should be made to work too, where re is a regular
Greg> expression object?

Then the core language would have to know about regular
expressions, right?

Andrew Koenig,,

From  Wed Aug 14 20:52:00 2002
From: (Jack Jansen)
Date: Wed, 14 Aug 2002 21:52:00 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
Message-ID: <>

On woensdag, augustus 14, 2002, at 02:13 , Guido van Rossum wrote:
> Note that normalization doesn't belong in the codecs (except perhaps
> as a separate Unicode->Unicode codec, since codecs seem to be useful
> for all string->string transformations).  It's a separate step that
> the application has to request; only the app knows whether a
> particular Unicode string is already normalized or not, and whether
> the expense is useful for the app, or not.

I don't like this, I don't like it at all.

Python jumps through hoops to make 'jack' and u'jack' compare=20
identical and be interchangeable in dict keys and what have you,=20
and now suddenly I find out that there's two ways to say u'j=E4ck'=20
and they won't compare equal. Not good.

I sympathise with the fact that this is difficult (although I=20
still don't understand why: whereas when you want to create the=20
decomposed version I can imagine there's N! ways to notate a=20
character with N combining chars, I would think there's one and=20
only one way to write a combined character), but that shouldn't=20
stop us at least planning to fix this.

And I don't think the burden should fall on the application.=20
That same reasoning could have been followed for making ascii=20
and unicode-ascii-subset compare equal: the application will=20
know it has to convert ascii to unicode before comparing.
- Jack Jansen        <>       =20 -
- If I can't dance I don't want to be part of your revolution --=20
Emma Goldman -

From  Wed Aug 14 21:13:09 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 16:13:09 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Wed, 14 Aug 2002 15:43:29 EDT."
References: <>
Message-ID: <>

> Greg>   re in string
> Greg> should be made to work too, where re is a regular
> Greg> expression object?
> Then the core language would have to know about regular
> expressions, right?

Um, yes.  That kills the idea (unless you want to write this as
"string in re", which almost makes sense :-).

--Guido van Rossum (home page:

From  Wed Aug 14 21:18:22 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 16:18:22 -0400
Subject: [Python-Dev] SET_LINENO killer
Message-ID: <>


Looks like Michael Hudson did an *outstanding* and very thorough job
on this.  Does anybody see a reason why I shouldn't let him check this

--Guido van Rossum (home page:

From  Wed Aug 14 21:30:23 2002
From: (Paul Svensson)
Date: Wed, 14 Aug 2002 16:30:23 -0400 (EDT)
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

On Wed, 14 Aug 2002, Guido van Rossum wrote:

>> Just a thougth: if it's true that those using hex() and %x are more
>> interested in the bit values than the numerical value of the whole number,
>> would a format like ~0xff000 be easier to interpret (and stop this debate) ?
>Hmm...  It has a perverse Pythonic smell...  But I fear it would
>introduce more backwards incompatibilities, because it would have to
>apply to longs as well, and hence change the output whenever a
>negative long is converted to hex or octal.  (And what about %u?
>Should "%u" % -1 return "~0" too?)

Didn't you say "%u" would be going away ?
You're right about octal, that would be nice to change, too.
Maybe the right time to do the change would be when the L goes away,
since that would be similarly invasive ?


From  Wed Aug 14 21:39:01 2002
From: (Skip Montanaro)
Date: Wed, 14 Aug 2002 15:39:01 -0500
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: <>
References: <>
Message-ID: <15706.49125.814428.988008@localhost.localdomain>

    Guido> I think Oren did a good job on this.  Could somebody please do an
    Guido> independent review of the code before I check it in?

Since I haven't actually looked at the patch yet, this doesn't qualify as a
review, but how about renaming PyString_InternInPlace to
PyString_InternImmortal?  My guess is "InPlace" refers to some structural
difference between mortal and immortal interned strings which doesn't give
the programmer any hints about intended usage of either function.


From  Wed Aug 14 21:34:55 2002
From: (Andrew Koenig)
Date: Wed, 14 Aug 2002 16:34:55 -0400 (EDT)
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <> (message from Guido
 van Rossum on Wed, 14 Aug 2002 16:13:09 -0400)
References: <>
 <> <>
Message-ID: <>

Greg> re in string
Greg> should be made to work too, where re is a regular
Greg> expression object?
>> Then the core language would have to know about regular
>> expressions, right?

Guido> Um, yes.  That kills the idea (unless you want to write this as
Guido> "string in re", which almost makes sense :-).

Or unless the notion of ``x in y'' could be were reinterpreted
in terms of a new attribute that strings, chars, and regexps
would share.

That is, I can imagine defining ``x in y'' anologously to ``x+y''
as follows:

   If x has an attribute __in__, then ``x in y'' means ``x.__in__(y)''

   Otherwise, if y has an attribute __rin__, then ``x in y'' means

and so on.

This is an example of the kind of situation where I imagine type
categories would be useful.

From  Wed Aug 14 22:01:14 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 17:01:14 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Wed, 14 Aug 2002 16:30:23 EDT."
References: <>
Message-ID: <>

> >> Just a thougth: if it's true that those using hex() and %x are more
> >> interested in the bit values than the numerical value of the whole number,
> >> would a format like ~0xff000 be easier to interpret (and stop this debate) ?
> >
> >Hmm...  It has a perverse Pythonic smell...  But I fear it would
> >introduce more backwards incompatibilities, because it would have to
> >apply to longs as well, and hence change the output whenever a
> >negative long is converted to hex or octal.  (And what about %u?
> >Should "%u" % -1 return "~0" too?)
> Didn't you say "%u" would be going away ?

Yes, but not any time soon.

> You're right about octal, that would be nice to change, too.
> Maybe the right time to do the change would be when the L goes away,
> since that would be similarly invasive ?

I see, you meant this idea for Python 3000, not for 2.3 or even 2.4.
That's fine, but doesn't help for the immediate pain.

--Guido van Rossum (home page:

From  Wed Aug 14 22:04:49 2002
From: (Skip Montanaro)
Date: Wed, 14 Aug 2002 16:04:49 -0500
Subject: [Python-Dev] Alternative implementation of interning
Message-ID: <15706.50673.81267.900261@localhost.localdomain>

A couple minor nits from scanning the patch:

* Probably makes no difference, but it seems oddly asymmetric to fiddle with
  the interned string's refcount in string_dealloc, call PyObject_DelItem,
  then not restore the refcount to zero.

* Should be Py_DECREF(keys) (not Py_XDECREF(keys)) in
  _Py_ReleaseInternedStrings.  If you've gotten that far keys can't be
  NULL.  If you're worried about keys being NULL, you should check it before
  the for loop (PyMapping_Size() will barf on a NULL arg).

Also, regarding the name of PyString_InternInPlace, I see now that's the
original name.  I suggest that name be deprecated in favor of
PyString_InternImmortal with a macro defined in stringobject.h for


From  Wed Aug 14 22:12:28 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 17:12:28 -0400
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: Your message of "Wed, 14 Aug 2002 16:34:55 EDT."
References: <> <> <>
Message-ID: <>

> Or unless the notion of ``x in y'' could be were reinterpreted
> in terms of a new attribute that strings, chars, and regexps
> would share.
> That is, I can imagine defining ``x in y'' anologously to ``x+y''
> as follows:
>    If x has an attribute __in__, then ``x in y'' means ``x.__in__(y)''
>    Otherwise, if y has an attribute __rin__, then ``x in y'' means
>    ``y.__rin__(x)''
> and so on.
> This is an example of the kind of situation where I imagine type
> categories would be useful.

It is already done this way, except the attribute is called
__contains__ and we only ask the right argument for it: "x in y" calls
"y.__contains__(x)" [if it exists; otherwise there's a fallback that
loops over y's items comparing them to x].

I suppose we could add __rcontains__ that was tried next, analogously
to __add__ and __radd__; or maybe it could be called __in__
instead. :-)

Unfortunately that would be a significant change in internal shit.
I'm not convinced that this particular example is worth that
(especially since chars are already taken care of -- they're just
1-char strings).

--Guido van Rossum (home page:

From  Wed Aug 14 22:14:04 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 17:14:04 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: Your message of "Wed, 14 Aug 2002 15:39:01 CDT."
References: <>
Message-ID: <>

> Since I haven't actually looked at the patch yet, this doesn't qualify as a
> review, but how about renaming PyString_InternInPlace to
> PyString_InternImmortal?  My guess is "InPlace" refers to some structural
> difference between mortal and immortal interned strings which doesn't give
> the programmer any hints about intended usage of either function.

Better still, I think we could safely make all interned strings mortal
-- I don't see any use for immortal strings.  (I see a use for immoral
strings but that's a topic for over a couple beers. :-)

--Guido van Rossum (home page:

From  Wed Aug 14 22:26:00 2002
From: (Skip Montanaro)
Date: Wed, 14 Aug 2002 16:26:00 -0500
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: <>
References: <>
Message-ID: <15706.51944.872639.806768@localhost.localdomain>

    >> ... how about renaming PyString_InternInPlace to
    >> PyString_InternImmortal?

    Guido> Better still, I think we could safely make all interned strings
    Guido> mortal -- I don't see any use for immortal strings.

Wasn't this part of the original discussion?  Extension modules are free to
call PyString_InternInPlace and may well expect immortal strings, so for
backward compatibility, the functionality probably has to remain for a time,

Of course, I'm speaking with my fake expert hat on.  I've never even
considered interning a string, immortal, immoral, or otherwise.


From  Wed Aug 14 22:22:35 2002
From: (Andrew Koenig)
Date: 14 Aug 2002 17:22:35 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

>> Perhaps the reason it's rare is that it's difficult to do.

Guido> Perhaps...  Is it the chicken or the egg?

Did you hear about the two philosophers in the diner?  One ordered a
chicken-salad sandwich and the other ordered an egg-salad sandwich,
because they wanted to see which would come first.

>> One of the cases I was thinking of was the built-in * operator,
>> which does something completely diferent if one of its operands
>> is an integer.

Guido> Really?  I suppose you're thinking of sequence repetition.


Guido> I consider that one of my early mistakes (it didn't make it to
Guido> my "regrets" list but probably should have).  It would have
Guido> been much simpler if sequences simply supported multiplcation,
Guido> and in fact repeated changes to the implementation (and subtle
Guido> edge cases of the semantics) are slowly nudging into this
Guido> direction.

It's still a plausible example, I think.

>> Another one was the buffering iterator we were discussing earlier,
>> which ideally would omit buffering entirely if asked to buffer a
>> type that already supports multiple iteration.

Guido> How do you do that in C++?  I guess you overload the function
Guido> that asks for the iterator, and call that function in a
Guido> template.  I think in Python we can ask the caller to provide a
Guido> buffering iterator when a function needs one.  Since we really
Guido> have very little power at compile time, we sometimes need to do
Guido> a little more work at run time.  But the resulting language
Guido> appears to be easier to understand (for most people anyway)
Guido> despite the theoretical deficiency.

I understand that, I think.

The C++ library has a notion of ``iterator traits'' that is implemented
by a template class named, of all thing, iterator_traits.  So, for example,
if T is an iterator type, then iterator_traits<T>::value_type is the
type that dereferencing an object of type T will yield.  To reveal what
operations an iterator supports, iterator_traits<T>::iterator_category
is one of the following five types, depending on the iterator:


Each of the last three of these types is derived from the one before it.
It is possible to instantiate objects of any of these types, but the
objects carry no information beyond their type and identity.

Now, suppose you want to implement an algorithm that requires a
bidirectional iterator, but can be done more efficiently with a
random-access iterator.  Then you might write something like this:

        // The bidirectional-iterator case
        template<class It>
        void foo_aux(It begin, It end, bidirectional_iterator_tag) {
                // ...

        // The random-access-iterator case
        template<class It>
        void foo_aux(It begin, It end, random_access_iterator_tag) {
                // ...

and then you can select the appropriate algorithm at compile time this way:

        template<class It>
        void foo(It begin, It end) {
                foo_aux(begin, end,
                   typename iterator_traits<It>::iterator_category());

This code creates an extra object (the anonymous object created by the
expression ``typename iterator_traits<It>::iterator_category()'' for the
sole purpose of using its type to distinguish between the two overloaded
versions of foo_aux.  This distinction is made at compile time, and if
the compiler is smart enough, it will also optimize away the empty,
anonymous object.

So this is an example of what I mean by ``dispatching based on a type
category.''  In C++ it's done at compile time, but what I care about
in the context of Python is not when it is done, but rather how
convenient it is to express.  (I don't think the C++ mode of
expression is particularly convenient, but at least it's possible)

Guido> I'm not quite sure why that is, but I am slowly developing a
Guido> theory, based on a remark by Samuele Pedroni; at least I
Guido> believe it was he who remarked at some point "Python has only
Guido> run time", ehich got me thinking.  My theory, partially
Guido> developed though it is, is that it is much harder (again, for
Guido> most people :-) to understand in your head what happens at
Guido> compile time than it is to understand what goes at run time.
Guido> Or perhaps that understanding *both* is harder than
Guido> understanding only one.

I have no problem believing that.

Guido> But I believe that for most people acquiring a sufficient
Guido> mental model for what goes on at run time is simpler than the
Guido> mental model for what goes on at compile time.  Possibly this
Guido> is because compilers really *do* rely on very sophisticated
Guido> algorithms (such as deciding which overloading function is
Guido> called based upon type information and available conversions).
Guido> Run time on the other hand is dead simple most of the time --
Guido> it has to be, since it has to be executed by a machine that has
Guido> a very limited time to make its decisions.

That's OK with me.  But I'd still like a less ad-hoc way of making
those run-time tests.

Guido> All this reminds me of a remark that I believe is due to John
Guido> Ousterhout at the VHLL conference in '94 in Santa Fe, where you & I
Guido> first met.  (Strangely it was Perl's Tom Christiansen who was in a
Guido> large part responsible for the eclectic program.)  You gave a talk
Guido> about ML, and I believe it was in response to your talk that John
Guido> remarked that ML was best suited for people with an IQ of over 150.

I'm still not convinced that's necessarily true -- I think it depends
a great deal on how ML is taught.  I do believe that most of what has
been written about ML is hard to follow for people who have grown up
in the imperative world, but I don't think it has to be that way.

Guido> That rang true to me, since the only other person besides you
Guido> that I know who is a serious ML user definitely falls into that
Guido> category.

Thanks for the compliment!

Guido> And ML is definitely a language that does more than the average
Guido> language at compile time.

Yes.  One of the reasons I find it interesting, incidentally, is that
it still manages to generate surprisingly efficient machine code.

>> Actually, I thought of them but omitted them to avoid confusion
>> between a type and a category with a single element.

Guido> Can you explain?  Neither string (which has Unicode and 8-bit,
Guido> plus a few other objects that are sufficiently string-like to
Guido> be regex-searchable, like arrays) nor file (at least in the
Guido> "lore protocol" interpretation of file-like object) are
Guido> categories with a single element.

Fair enough.  I just didn't have examples at my fingertips and thought
at first that using those exmaples would confuse matters.  I don't
mind asking them.

Guido> I believe that the notion of an informal or "lore" (as Jim
Guido> Fulton likes to call it) protocol first became apparent when we
Guido> started to use the idea of a "file-like object" as a valid
Guido> value for sys.stdout.

>> OK.  So what I'm asking about is a way of making notions such as
>> "file-like object" more formal and/or automatic.

Guido> Yeah, that's the holy Grail of interfaces in Python.

Cool!  (I care much less about type checking because, as I mentioned
in another message, there are uncheckable things such as being an
order relation that I would like to use for dispatching anyway).

>> Of course, one reason for my interest is my experience with a
>> language that supports compile-time overloading -- what I'm really
>> seeing on the horizon is some kind of notion of overloading in
>> Python, perhaps along the lines of ML's clausal function
>> definitions (which I think are truly elegant).

Guido> Honestly, I hadn't read this far ahead when I brought up ML
Guido> above. :-)


Guido> I really hope that the holy grail can be found at run time
Guido> rather than compile time.  Python's compile time doesn't have
Guido> enough information easily available, and to gather the
Guido> necessary information is very expensive (requiring
Guido> whole-program analysis) and not 100% reliable (due to Python's
Guido> extreme dynamic side).

I have no problem with that.  So here's a simple example of ML's clausal

        fun len([]) = 0
          | len(h::t) = len(t) + 1

Here, [] is an empty list, and h::t is ML's way of spelling cons(h,t).
The two clauses (one per line) are checked in order *at run time* --
we're dispatching on the *value*, not the type of the argument.
if you like, this example is equivalent to the following:

        fun len(x) = if x = [] then 0 else len(tl(x))+1

(well, not really, but only an ML expert will see why, and it's not germane)

In the Python domain, I imagine something like this:

        def f(arg: Category1):
        or  f(arg: Category2):
        or  f(arg: Category3):

I would like the implementation to try each version of f until it
finds one that passes the constraints, and then executes that one.  If
none of them fits the bill, then it should throw an exception.

Andrew Koenig,,

PS: Please forgive the erratic replies -- apparently our mail gateway
decided to hang onto a bunch of messages for a day or so...

From  Wed Aug 14 22:23:54 2002
From: (Andrew Koenig)
Date: Wed, 14 Aug 2002 17:23:54 -0400 (EDT)
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <> (message from Guido
 van Rossum on Wed, 14 Aug 2002 17:12:28 -0400)
References: <> <> <>
 <> <>
Message-ID: <>

Guido> It is already done this way, except the attribute is called
Guido> __contains__ and we only ask the right argument for it: "x in y" calls
Guido> "y.__contains__(x)" [if it exists; otherwise there's a fallback that
Guido> loops over y's items comparing them to x].

Ah, that's why you said that it could be done backwards.

From  Wed Aug 14 22:25:46 2002
From: (Martin v. Loewis)
Date: 14 Aug 2002 23:25:46 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen <> writes:

> I sympathise with the fact that this is difficult (although I still
> don't understand why

Feel free to contribute. Answering all your questions already took
considerable time (to answer the previous one, I did an hour of online
research, just because I had never looked into normalization in that
level of detail - to find out you need a print copy of the Unicode
standard, which I have only at the university library).


From  Wed Aug 14 22:28:25 2002
From: (Oren Tirosh)
Date: Thu, 15 Aug 2002 00:28:25 +0300
Subject: [Python-Dev] type categories
In-Reply-To: <>; from on Wed, Aug 14, 2002 at 09:09:19AM -0400
References: <> <> <> <>
Message-ID: <>

On Wed, Aug 14, 2002 at 09:09:19AM -0400, Guido van Rossum wrote:
> Now I think you've lost me.  How can a category on the one hand be
> Again you've lost me.  I expect there's something here that you assume

Oh dear. Here we go again. I'm afraid that it may take several frustrating 
iterations just to get our terminology and assumptions in sync and be able 
to start talking about the actual issues.

> > Type categories are fundamentally different from interfaces.  An 
> > interface must be declared by the type while a category can be an 
> > observation about an existing type. 
> Yup.  (In Python these have often been called "protocols".  Jim Fulton
> calls them "lore protocols".)

Nope. For me protocols are conventions to follow for performing a certain 
task.  A type category is a formally defined set of types.  

For example, the 'iterable' protocol defines conventions for a programmer
to follow for doing iteration.  The 'iterable' category is a set defined
by the membership predicate "hasattr(t, '__iter__')".  The types in the
'iterable' category presumably conform to the 'iterable' protocol so there 
is a mapping between protocols and type categories but it's not quite 1:1.

Protocols live in documentation and lore. Type categories live in the same 
place where vector spaces and other formal systems live.

> > Two types that are defined independently in different libraries may
> > in fact fit under the same category because they implement the same
> > protocol.  With named interfaces they may in fact be compatible but
> > they will not expose the same explicit interface. Requiring them to
> > import the interface from a common source starts to sound more like
> > Java than Python and would introduce dependencies and interface
> > version issues in a language that is wonderfully free from such
> > arbitrary complexities.
> Hm, I'm not sure if you can solve the version incompatibility problem
> by ignoring it. :-)

Oops, I meant interface version *numbers*, not interface versions. A
version number is a unidimentional entity. Variations on protocols and
subprotocols have many dimensions. I find that set theory ("an object that 
has a method called foo and another method called bar") works better than
arithmetic ("an object with version number 2.13 of interface voom").

> Are you familiar with Zope's Interface package?  It solves this
> problem (nicely, IMO) by allowing you to place an interface
> declaration inside a class but also allowing you to make calls to an
> interface registry that declare interfaces for pre-existing classes.

I don't like the bureacracy of declaring interfaces and maintaining 
registeries. I like the ad-hoc nature of Python protocols and I want a 
type system that gives me the tools to use it better, not replace it with 
something more formal.

> > A category is defined mathematically by a membership predicate. So
> > what we need for type categories is a system for writing predicates
> > about types.
> Now I think you've lost me.  How can a category on the one hand be
> observed after the fact and on the other hand defined by a rigorous
> mathematical definition?  How could a program tell by looking at a
> class whether it really is an implementation of a given protocol?

A category is defined mathematically. A protocol is a somewhat more fuzzy
meatspace concept.  A protocol can be associated with a category with
reasonable accuracy so the result of a set operation on categories is
reasonably applicable to the associated protocols. 

Even a human can't always tell whether a class is *really* an implmentation 
of a given protocol. But many protocols can be inferred with pretty good 
accuracy from the presence of methods or members. You can always add a 
member as a flag indicating compliance with a certain protocol if that is
not enough.

My basic assumption is that programmers are fundamentally lazy. It hasn't
ever failed me so far.

This way there is no need to declare all the protocols a class conforms to.
This is important since in many cases the protocol is only "discovered" 
later.  The user of the class knows what protocol is expected and only 
needs to declare that.  It should reduces the tendency to use relatively 
coarse-grained "fat" interfaces because there is not need to declare every 
minor protocol the type conforms to - it may observed by users of this 
type using a type category.

> > Standard Python expressions should not be used for defining a
> > category membership predicate. A Python expression is not a pure
> > function. This makes it impossible to cache the results of which
> > type belongs to what category for efficiency. Another problem is
> > that many different expressions may be equivalent but if two
> > independently defined categories use equivalent predicates they
> > should *be* the same category.  They should be merged at runtime
> > just like interned strings.
> Again you've lost me.  I expect there's something here that you assume
> well-known.  Can you please clarify this?  What on earth do you mean
> by "A Python expression is not a pure function" ?

A function whose result depends only on its inputs and has no side effects.
In this case I would add "and can be evaluated without triggering any 
Python code". Set operations on membership predicates, caching and other
optimizations need such guarantees.


From  Wed Aug 14 22:51:23 2002
From: (Martijn Faassen)
Date: Wed, 14 Aug 2002 23:51:23 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

Andrew Koenig wrote:
> >> The category names look like general purpose interface names. The
> >> addition of interfaces has been discussed quite a bit. While many
> >> people are interested in having interfaces added to Python, there
> >> are many design issues that will have to be resolved before it
> >> happens.
> Oren> Nope. Type categories are fundamentally different from
> Oren> interfaces.  An interface must be declared by the type while a
> Oren> category can be an observation about an existing type.
> Why?  That is, why can't you imagine making a claim that type
> X meets interface Y, even though the author of neither X nor Y
> made that claim?

That's entirely possible, and as Guido mentioned earlier in the thread,
the Zope 3 interface package allows that. I think that still currently
doesn't work with built-in types yet, but that's an implementation detail,
not a fundamental problem.

(it's in Interface.Implements, the implements() function)
> However, now that you bring it up... One difference I see between
> interfaces and categories is that I can imagine categories carrying
> semantic information to the human reader of the code that is not
> actually expressed in the category itself.  As a simple example,
> I can imagine a PartialOrdering category that I might like as part
> of the specification for an argument to a sort function.

But isn't that exactly what interfaces are? Of course you may not want
to make all interfaces explicit as it is too much programming overhead;
that's in part what's nice about a dynamically typed language. However,
an interface does carry semantic information to the human reader of the
code that is not actually expressed in the category itself. By making
interfaces explicit the human reader can also write code that introspects
interface information.

Or do you mean sometimes it is not useful to make interfaces explicit
at all, as you're never going to introspect on them anyway? I'd say
they may still be useful as documentation, in which case they seem to work
like your 'category'. Or of course you can not specify them at all.



From  Wed Aug 14 22:54:35 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 17:54:35 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: Your message of "Wed, 14 Aug 2002 16:26:00 CDT."
References: <> <15706.49125.814428.988008@localhost.localdomain> <>
Message-ID: <>

>     Guido> Better still, I think we could safely make all interned strings
>     Guido> mortal -- I don't see any use for immortal strings.
> Wasn't this part of the original discussion?  Extension modules are free to
> call PyString_InternInPlace and may well expect immortal strings, so for
> backward compatibility, the functionality probably has to remain for a time,
> yes?

In core Python, there are two common usage patterns for interning.

The most common pattern uses PyString_InternFromString() to intern
some string constant that will be used as a frequent key
(e.g. "__class__") and then stores the resulting object in a static
variable.  Those strings are immortal because the static variable has
a reference that is never released.

The other common pattern uses PyString_InternInPlace() to intern a
string object (usually a function argument) that's being used as a
dictionary key or attribute name, in the hope that the dict lookup
will be faster.  In these cases, the dict will keep the interned
string alive as long as it makes sense, and when it's no longer a key
in the dict, there's no point in having the interned object around.
(It's also fairly pointless since PyObject_SetAttr() already does
this; even that seems questionable and should probably be done by the
setattro handler of individual object types.)

Making such keys mortal might cause some churning, if non-existing
keys are frequently constructed (say, from user data) and then thrown
away -- each time the key is thrown away it is removed from the
interned dict now, and each time it is recreated and used as a key, it
is interned again -- to no avail.  But I think that's pretty rare (a
non-existing key) and it certainly isn't going to cause any breakage.

I expect that the usage patterns in 3rd party extensions are pretty
much the same.

Tim once posted a theoretical example that depended on interned
strings staying alive while no user object references a particular
string object.  But that was highly theoretical.

> OF course, I'm speaking with my fake expert hat on.  I've never even
> considered interning a string, immortal, immoral, or otherwise.


--Guido van Rossum (home page:

From  Wed Aug 14 22:55:47 2002
From: (Andrew Koenig)
Date: Wed, 14 Aug 2002 17:55:47 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <> (message from Martijn Faassen
 on Wed, 14 Aug 2002 23:51:23 +0200)
References: <> <> <> <> <>
Message-ID: <>

>> However, now that you bring it up... One difference I see between
>> interfaces and categories is that I can imagine categories carrying
>> semantic information to the human reader of the code that is not
>> actually expressed in the category itself.  As a simple example,
>> I can imagine a PartialOrdering category that I might like as part
>> of the specification for an argument to a sort function.

Martijn> But isn't that exactly what interfaces are?

Not really.  I can see how an interface can claim that a particular
method exists, but not how it can claim that the method implements a
function that is antisymmetric and transitive.

From  Wed Aug 14 23:11:16 2002
From: (Aahz)
Date: Wed, 14 Aug 2002 18:11:16 -0400
Subject: [snake-farm] RE: [Python-Dev] strange warnings from tempfile.mkstemped.__del__ on HP
In-Reply-To: <>
References: <> <> <> <> <>
Message-ID: <>

On Wed, Aug 14, 2002, Guido van Rossum wrote:
>> [Kalle, on the HP-UX snake farm build]
>>> AssertionError: _mkstemp_inner raised exceptions.OSError: [Errno 24]
>>> Too many open files: '/tmp/aaU3irrA'
>>> Hmm, I wonder how many that is, and how to change it.  I'll look
>>> around.
>> I've raised maxfiles from 200 to 2048, and the test now runs without
>> error.
> Thanks!  Maybe the test was a little too eager though -- perhaps it
> could be happy with creating 100 instead of 1000 files.

Hrm.  Aren't there OSes with a default limit of 63 open files per
process?  I'm pretty sure there are some with 127 (signed char).
Aahz (           <*>

Project Vote Smart:

From  Wed Aug 14 23:12:51 2002
From: (Martijn Faassen)
Date: Thu, 15 Aug 2002 00:12:51 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

Guido van Rossum wrote:
> > Guido> The exception is when you need to do something different based
> > Guido> on the type of an object and you can't add a method for what
> > Guido> you want to do.  But that is relatively rare.
> > 
> > Perhaps the reason it's rare is that it's difficult to do.
> Perhaps...  Is it the chicken or the egg?

Once you've defined interfaces you do end up using them this way,
I've found in my own experiences. It can be more clear than the
alternative if you have a set of objects in some collection that
fall apart in a number of kinds -- 'content' versus 'container'
type things in Zope for instance. It's nice to be able to say
'is this a container' without having to think about implementation 
inheritance hierarchies or trying to call a method that should exist on
a container and not on a content object.

And of course Zope3 uses interfaces in more advanced ways to associate
objects together automatically -- a view for a content object is looked up
automatically by interface, and you can automatically hook adapters that
translate one interface to another together by looking them up in
an interface registry as well.

> BTW A the original scarecrow proposal is at 

I recall looking at that for the first time and not understanding too 
much about the reasoning behind it, but by now I have some decent experience
with the descendant of that behind me (the interface package in Zope),
and it's quite nice. Many people seem to react to interfaces by 
associating them with static types and then rejecting the notion, but Python
interface checking is just as run-time as anything else.

By the way, the Twisted people are starting to use interfaces in their 
package; a home grown very simple implementation at first but they are 
trying to stay compatible with the Zope ones and are looking into adopting
the Zope interface package proper. When I first discussed interfaces
with some Twisted developers a year ago or so their thinking seemed
quite negative, but they seem to be changing their minds, at least slowly.
That's a good sign for interfaces, and I imagine it will happen with
more people.

Interfaces in Python are almost too trivial to understand, but surprisingly
useful. I imagine this is why so many smart Python users don't get it;
they either reject the notion because it seems too trivial and 'therefore
useless', or because they think it must involve far more complication
(static typing) and therefore it's too complicated and not in the spirit
of Python. :)



From  Wed Aug 14 23:42:46 2002
From: (Martin v. Loewis)
Date: 15 Aug 2002 00:42:46 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

Oren Tirosh <> writes:

> Nope. For me protocols are conventions to follow for performing a certain 
> task.  A type category is a formally defined set of types.  

ODP (Reference Model For Open Distributed Processing, ISO 10746)
defines that a type is a predicate; it implies a set (of which it is
the characteristic function).

By your definition, a type category, as a formally-defined means to
determine whether something belongs to the category, is a predicate,
and thus still a type.

> For example, the 'iterable' protocol defines conventions for a programmer
> to follow for doing iteration.  The 'iterable' category is a set defined
> by the membership predicate "hasattr(t, '__iter__')".  

It is not so clear that this is what defines the iterable category. It
could also be defined as "the programmer can use to for doing
iteration, by means of the iterable protocol".

> Protocols live in documentation and lore. Type categories live in the same 
> place where vector spaces and other formal systems live.

By that definition, I'd say that Andrew's list enumerates protocols,
not type categories: they all live in lore, not in a formalism.

> A category is defined mathematically. A protocol is a somewhat more fuzzy
> meatspace concept.  

A protocol can certainly be formalized, if there is need. Of all the
possible interaction sequences, you define those that follow the
protocol. Then, an object that follows the protocol in all interaction
sequences in which it participates is said to implement the protocol.

> > Again you've lost me.  I expect there's something here that you assume
> > well-known.  Can you please clarify this?  What on earth do you mean
> > by "A Python expression is not a pure function" ?
> A function whose result depends only on its inputs and has no side effects.
> In this case I would add "and can be evaluated without triggering any 
> Python code".

For being a pure function, requiring that it does not trigger Python
code seems a bit too restrictive.

In any case, I think it is incorrect to say that a Python expression
is not a function. Instead, it is correct to say that it is not
necessarily a function. There are certainly expressions that are


From  Wed Aug 14 23:45:43 2002
From: (Martin v. Loewis)
Date: 15 Aug 2002 00:45:43 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

Andrew Koenig <> writes:

> Martijn> But isn't that exactly what interfaces are?
> Not really.  I can see how an interface can claim that a particular
> method exists, but not how it can claim that the method implements a
> function that is antisymmetric and transitive.

An interface can certainly claim such things, in its documentation -
and indeed, the documentation of interfaces typically associates
certain semantics with the objects implementing the interface (and in
some cases, even semantics for objects using the interface).

Of course, there is typically no way to automatically *validate* such
claims; you can only validate conformance to signatures. It turns out
that, in Python, you cannot even do that.


From  Thu Aug 15 00:06:52 2002
From: (Greg Ewing)
Date: Thu, 15 Aug 2002 11:06:52 +1200 (NZST)
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>

Oren Tirosh <>:

> In Python up to 2.2 it's inconsistent between ints and longs:
> >>> hex(-16711681)
> '0xff00ffff'
> >>> hex(-16711681L)
> '-0xff0001L'		# ??!?!?

The more I think about it, the more I like the suggestion
that was made of representing this as 


which both makes the bit pattern apparent and unambiguously
indicates the sign, all without any assumptions about length.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 15 00:51:43 2002
From: (Greg Ewing)
Date: Thu, 15 Aug 2002 11:51:43 +1200 (NZST)
Subject: [Python-Dev] type categories
In-Reply-To: <>
Message-ID: <>

Andrew Koenig <>:

> In the Python domain, I imagine something like this:
>         def f(arg: Category1):
>                 ....
>         or  f(arg: Category2):
>                 ....
>         or  f(arg: Category3):
> I would like the implementation to try each version of f until it
> finds one that passes the constraints

Would all the versions of f have to be written together
like that? I think when most people talk of multiple
dispatch they have something more flexible in mind.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 15 01:01:19 2002
From: (Greg Ewing)
Date: Thu, 15 Aug 2002 12:01:19 +1200 (NZST)
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
Message-ID: <>

> > Greg>   re in string
> > 
> > Greg> should be made to work too, where re is a regular
> > Greg> expression object?
> > 
> > Then the core language would have to know about regular
> > expressions, right?
> Um, yes.  That kills the idea (unless you want to write this as
> "string in re", which almost makes sense :-).

Maybe there should be an __in__ method that gets called
on the left operand if the __contains__ of the right
operand doesn't know what to do?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 15 01:50:02 2002
From: (Andrew Koenig)
Date: 14 Aug 2002 20:50:02 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

Greg> Would all the versions of f have to be written together like
Greg> that?

I'm not sure.  In ML, they do, but in ML, the tests are on
values, not types (ML has neither inheritance nor overloading).
Obviously, it would be nice not to have to write the versions
of f together, but I haven't thought about how such a feature
would be defined or implemented.

Greg> I think when most people talk of multiple
Greg> dispatch they have something more flexible in mind.

Probably true.

Andrew Koenig,,

From  Thu Aug 15 02:07:37 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 21:07:37 -0400
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: Your message of "Thu, 15 Aug 2002 11:06:52 +1200."
References: <>
Message-ID: <>

> > In Python up to 2.2 it's inconsistent between ints and longs:
> > >>> hex(-16711681)
> > '0xff00ffff'
> > >>> hex(-16711681L)
> > '-0xff0001L'		# ??!?!?
> The more I think about it, the more I like the suggestion
> that was made of representing this as 
>   1x00ffff
> which both makes the bit pattern apparent and unambiguously
> indicates the sign, all without any assumptions about length.

That won't help with %o, %u or %x.

I don't expect there will be much of a need to write negative hex
constants in practice: people only end up creating negative numbers
using hex constants because the want to represent 32-bit bit patterns
in a signed 32-bit int.  In Python 2.4, the recommended way will be to
write 0xffffffff and not worry about the fact that it's a positive
long; extensions that take bit masks will be fixed by then to deal
with this just fine (probably through the 'k' format code in

The issue of printing negative hex constants is more a theoretical
issue: hex(-1) has to return *something*, and 0xffffffff simply isn't
acceptable.  I'd like it to return something that evaluates back to -1
when used in a Python expression, so "-0x1" and "~0x0" are still the
best candidates.

--Guido van Rossum (home page:

From  Thu Aug 15 02:15:32 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 21:15:32 -0400
Subject: [snake-farm] RE: [Python-Dev] strange warnings from tempfile.mkstemped.__del__ on HP
In-Reply-To: Your message of "Wed, 14 Aug 2002 18:11:16 EDT."
References: <> <> <> <> <>
Message-ID: <>

> Hrm.  Aren't there OSes with a default limit of 63 open files per
> process?  I'm pretty sure there are some with 127 (signed char).

We'll deal with those as we encounter them.  All the mainstream
platforms go much beyond that (just as we have left the 640 KB limit
behind us :-).

--Guido van Rossum (home page:

From  Thu Aug 15 02:34:03 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 21:34:03 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Thu, 15 Aug 2002 00:12:51 +0200."
References: <> <> <> <>
Message-ID: <>

> Interfaces in Python are almost too trivial to understand, but
> surprisingly useful. I imagine this is why so many smart Python
> users don't get it; they either reject the notion because it seems
> too trivial and 'therefore useless', or because they think it must
> involve far more complication (static typing) and therefore it's too
> complicated and not in the spirit of Python. :)

No, I think it's because they only work well if they are used
pervasively (not necessarily everywhere).  That's why they work in
Zope: not only does almost everything in Zope have an interface, but
interfaces are used to implement many Zope features.

I haven't made up my mind yet whether Python could benefit as much as
Zope, but I am cautiosuly looking into adding something derived from
Zope's interface package.  Jim & I have rather different ideas on what
the ideal interfaces API should look like though, so it'll be a while.
Maybe I should pull down the Twisted interfaces package and see how I
like their subset (I'm sure it must be a subset -- the Zope package is
a true kitchen sink :-).

--Guido van Rossum (home page:

From  Thu Aug 15 02:35:37 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 21:35:37 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Wed, 14 Aug 2002 17:55:47 EDT."
References: <> <> <> <> <>
Message-ID: <>

> Not really.  I can see how an interface can claim that a particular
> method exists, but not how it can claim that the method implements a
> function that is antisymmetric and transitive.

That's done in the docs, usually.  Zope even has the notion of a
"marker" interface -- an interface that says "this object has property
such-and-such" but which does not assert any methods or attributes.

--Guido van Rossum (home page:

From  Thu Aug 15 02:38:18 2002
From: (Andrew Koenig)
Date: Wed, 14 Aug 2002 21:38:18 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <>
 (message from Guido van Rossum on Wed, 14 Aug 2002 21:35:37 -0400)
References: <> <> <> <> <>
 <> <>
Message-ID: <>

>> Not really.  I can see how an interface can claim that a particular
>> method exists, but not how it can claim that the method implements a
>> function that is antisymmetric and transitive.

Guido> That's done in the docs, usually.  Zope even has the notion of a
Guido> "marker" interface -- an interface that says "this object has property
Guido> such-and-such" but which does not assert any methods or attributes.

So perhaps what I mean by a category is the set of all types that
implement a particular marker interface.

From  Thu Aug 15 02:50:02 2002
From: (Tim Peters)
Date: Wed, 14 Aug 2002 21:50:02 -0400
Subject: [Python-Dev] FW: multimethod-0.1
Message-ID: <>

I haven't studied this, but from a quick glance it looks competent.

-----Original Message-----
    On Behalf Of Aric Coady <>
Sent: Wednesday, August 14, 2002 8:59 PM
Subject: ANN: multimethod-0.1

Multimethod-0.1 is another python module for implementing multimethods 
(a.k.a.  generic functions, multiple-argument method dispatch).  This 
one features:

- support for Python2.2 type/class unification
- a precedence graph for more efficient dispatching
- a best-fit resolution algorithm, in which the method closest in 
inheritance distance is called
- a versatile 'call-next-method' or 'super' function.

Available at and the Vaults of Parnassus.



From  Thu Aug 15 03:24:07 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 22:24:07 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Wed, 14 Aug 2002 20:50:02 EDT."
References: <>
Message-ID: <>

> Greg> Would all the versions of f have to be written together like
> Greg> that?
> I'm not sure.  In ML, they do, but in ML, the tests are on
> values, not types (ML has neither inheritance nor overloading).
> Obviously, it would be nice not to have to write the versions
> of f together, but I haven't thought about how such a feature
> would be defined or implemented.
> Greg> I think when most people talk of multiple
> Greg> dispatch they have something more flexible in mind.
> Probably true.

I can see how it could be done using some additional syntax similar to
what ML uses, e.g.:

  def f(a: Cat1):
      ...code for Cat1...
  else f(a: Cat2):
      ...code for Cat2...
  else f(a: Cat3):
      ...code for Cat3...

Don't take this syntax too seriously!  I just mean that there is a
single statement that provides the different alternative versions.

Another approach would be more in the spirit of properties in 2.2:

  def f1(a: Cat1):
    ...code for Cat1...

  def f2(a: Cat2):
    ...code for Cat2...

  def f3(a: Cat3):
    ...code for Cat3...

  f = multimethod(f1, f2, f3)

(There could be a way to spell this without having the type
declaration syntax in the argument list, and do it in the
multimethod() call instead, e.g. with keyword arguments or passing a
list of tuples: [(f1, Cat1), (f2, Cat2), ...].  I suppose this could
be extended to more arguments as well.)

It might also be possible to modify a multimethod dynamically,
e.g. later one could write:

  def f4(a: Cat4):
    ...code for Cat4...


This is more in the spirit of Python than your original proposal,
which appeared like the compiler would have to gather all the
definitions from different places and fuse them.  That would be too
complex for Python's simple-minded compiler!

--Guido van Rossum (home page:

From  Thu Aug 15 03:39:42 2002
From: (Greg Ewing)
Date: Thu, 15 Aug 2002 14:39:42 +1200 (NZST)
Subject: [Python-Dev] type categories
In-Reply-To: <>
Message-ID: <>

> I don't like the bureacracy of declaring interfaces and maintaining 
> registeries. I like the ad-hoc nature of Python protocols and I want a 
> type system that gives me the tools to use it better, not replace it with 
> something more formal.

But you seem to want *something* that's more formal.
How formal exactly do you have in mind? How does it
differ from what Zope does? (Which I know nothing
about, by the way...)

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 15 03:42:16 2002
From: (Greg Ewing)
Date: Thu, 15 Aug 2002 14:42:16 +1200 (NZST)
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
Message-ID: <>

Jack Jansen <>:

> Python jumps through hoops to make 'jack' and u'jack' compare=20
> identical and be interchangeable in dict keys and what have you,=
> and now suddenly I find out that there's two ways to say u'j=E4ck'=
> and they won't compare equal. Not good.

To me, this says that Python should pick one of the
canonical forms and make sure all its Unicode strings
are normalised to it. (Or at least make it appear
as if they are.)

Greg Ewing, Computer Science Dept, +---------------------------------=
University of Canterbury,=09   | A citizen of NewZealandCorp, a=09  |
Christchurch, New Zealand=09   | wholly-owned subsidiary of USA Inc. =
 |   +------------------------------------=

From  Thu Aug 15 03:58:09 2002
From: (Greg Ewing)
Date: Thu, 15 Aug 2002 14:58:09 +1200 (NZST)
Subject: [Python-Dev] Deprecation warning on integer shifts and such
In-Reply-To: <>
Message-ID: <>


> In Python 2.4, the recommended way will be to write 0xffffffff and not
> worry about the fact that it's a positive long;

Yes, it won't be so much of an issue then. But you can still get a
negative long from a positive one when bit twiddling by complementing,
meaning that you have to remember to mask the result before displaying
it as hex, or end up with a hex representation that displays the bit
pattern in a way that's hard to interpret.

That's the usage I had in mind when I mentioned the 1x notation --
for display, not for input.

But thinking about it now, it would be better to provide a new
function for hexifying that you could tell how many bits you're
interested in, and it would show you that many, unsigned. Maybe
also a new format operator for this as well.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 15 04:02:27 2002
From: (Guido van Rossum)
Date: Wed, 14 Aug 2002 23:02:27 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: Your message of "Wed, 14 Aug 2002 16:04:49 CDT."
References: <15706.50673.81267.900261@localhost.localdomain>
Message-ID: <>

> * Probably makes no difference, but it seems oddly asymmetric to fiddle with
>   the interned string's refcount in string_dealloc, call PyObject_DelItem,
>   then not restore the refcount to zero.

That's unnecessary because the next line executed simply frees the
object.  free() doesn't check the refcount.  It's a bit optimistic in
that it doesn't check the DelItem for an error; but that's a separate
issue, and I don't know what it should do when it gets an error at
that point.  Probably call Py_FatalError(); if it wanted to recover,
it would have to call PyErr_Fetch() / PyErr_Restore() around the
DelItem() call, because we're in a dealloc handler here and that
shouldn't change the exception state.

> * Should be Py_DECREF(keys) (not Py_XDECREF(keys)) in
>   _Py_ReleaseInternedStrings.  If you've gotten that far keys can't be
>   NULL.  If you're worried about keys being NULL, you should check it before
>   the for loop (PyMapping_Size() will barf on a NULL arg).

You're right.  Also, I think it should use PyDict_Keys() and
PyDict_Size() -- it knows that interned is a dict so all the hoopla
that PyMapping_Keys() adds is unnecessary.  Maybe the best thing to do
is to remove _Py_ReleaseInternedStrings() and let Barry worry about
how to implement it the next time he wants to use Insure++.

> Also, regarding the name of PyString_InternInPlace, I see now that's the
> original name.  I suggest that name be deprecated in favor of
> PyString_InternImmortal with a macro defined in stringobject.h for
> compatibility.

Yeah, if we keep the immortality feature at all.

BTW, it looks like Oren was careless with error checking in a few
places.  The whole patch needs to checked carefully.

--Guido van Rossum (home page:

From  Thu Aug 15 04:13:53 2002
From: (Greg Ewing)
Date: Thu, 15 Aug 2002 15:13:53 +1200 (NZST)
Subject: [Python-Dev] type categories
In-Reply-To: <>
Message-ID: <>


> I can see how it could be done using some additional syntax similar to
> what ML uses, e.g.:
>   def f(a: Cat1):
>       ...code for Cat1...
>   else f(a: Cat2):
>       ...code for Cat2...
>   else f(a: Cat3):
>       ...code for Cat3...

As long as all the implementations have to be in one place,
this is equivalent to

  def f(a):
    if belongstocategory(a, Cat1):
    elif belongstocategory(a, Cat2):
    elif belongstocategory(a, Cat3):

so you're not gaining much from the new syntax.

> It might also be possible to modify a multimethod dynamically,
> e.g. later one could write:
>   def f4(a: Cat4):
>     ...code for Cat4...
>   f.add(f4)

This sort of scheme makes me uneasy, because it means that any module
can change the behaviour of any call of f() in any other
module. Currently, if you know the types involved in a method call,
you can fairly easily track down in the source which piece of code
will be called. With this sort of generic function, that will no
longer be possible. It's kind of like an "import *" in reverse -- you
won't know what's coming from where, and you can get things that
you never even asked for.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 15 05:10:25 2002
From: (Aahz)
Date: Thu, 15 Aug 2002 00:10:25 -0400
Subject: [Python-Dev] Mac forever! (was Re: The memo of pickle)
Message-ID: <>

On Sun, Aug 11, 2002, Tim Peters wrote:
> Cool!  I was just wondering the other day whether there are any Mac users
> left apart from Jack and Guido's brother.  It's a landslide <wink>.

I'm an OS X user, does that count?  <0.8 wink>

Incidentally, O'Reilly is looking for presentations for the OS X
conference in San Jose end of September.  I'll post the e-mail address
of the program chair if more than one person wants it -- or you can do
as I did and look at the web site.
Aahz (           <*>

Project Vote Smart:

From David Abrahams" <  Thu Aug 15 05:54:42 2002
From: David Abrahams" < (David Abrahams)
Date: Thu, 15 Aug 2002 00:54:42 -0400
Subject: [Python-Dev] FW: multimethod-0.1
References: <>
Message-ID: <115c01c24418$76902fe0$>

From: "Tim Peters" <>
> I haven't studied this, but from a quick glance it looks competent.
> Multimethod-0.1 is another python module for implementing multimethods
> (a.k.a.  generic functions, multiple-argument method dispatch).  This
> one features:
> - support for Python2.2 type/class unification
> - a precedence graph for more efficient dispatching
> - a best-fit resolution algorithm, in which the method closest in
> inheritance distance is called
> - a versatile 'call-next-method' or 'super' function.
> Available at and the Vaults of Parnassus.
> -Coady

It's a good start, but from the docs it doesn't appear to deal with:

a. Type categories -- it seems as though the only way for a multimethod
implementation to match an actual argument is if the formal argument has an
inheritance relationship with it.

b. Implicit conversions -- If I declare a function that accepts a Python
int, can I pass a Python float?

I think both of the above are important for any Python multimethod

           David Abrahams * Boost Consulting *


From David Abrahams" <  Thu Aug 15 05:56:23 2002
From: David Abrahams" < (David Abrahams)
Date: Thu, 15 Aug 2002 00:56:23 -0400
Subject: [Python-Dev] type categories
References: <>              <>  <>
Message-ID: <115d01c24418$772766d0$>

From: "Guido van Rossum" <>
to more arguments as well.)
> It might also be possible to modify a multimethod dynamically,
> e.g. later one could write:
>   def f4(a: Cat4):
>     ...code for Cat4...
>   f.add(f4)
> This is more in the spirit of Python than your original proposal,
> which appeared like the compiler would have to gather all the
> definitions from different places and fuse them.  That would be too
> complex for Python's simple-minded compiler!

This is most like what I had in mind.

           David Abrahams * Boost Consulting *

From  Thu Aug 15 07:30:31 2002
From: (Oren Tirosh)
Date: Thu, 15 Aug 2002 09:30:31 +0300
Subject: [Python-Dev] type categories
In-Reply-To: <>; from on Wed, Aug 14, 2002 at 09:38:18PM -0400
References: <> <> <> <> <> <> <> <>
Message-ID: <>

On Wed, Aug 14, 2002 at 09:38:18PM -0400, Andrew Koenig wrote:
> >> Not really.  I can see how an interface can claim that a particular
> >> method exists, but not how it can claim that the method implements a
> >> function that is antisymmetric and transitive.
> Guido> That's done in the docs, usually.  Zope even has the notion of a
> Guido> "marker" interface -- an interface that says "this object has property
> Guido> such-and-such" but which does not assert any methods or attributes.
> So perhaps what I mean by a category is the set of all types that
> implement a particular marker interface.

I propose that any method or attribute may serve as a marker. This makes it
possible to use an existing practice as a marker so protocols can be
defined retroactively for an existing code base. It's also possible, of 
course, to add an attribute called 'has_property_such_and_such' to serve 
as an explicit marker.

A type category is defined by a predicate that tests for the presence of
one or more markers.  Predicates can test not only for the presence of
markers but also for the type category of the marker object and for call 
signatures. When optional type checking is implemented they should also be
able to test for the categories of arguments and return values.

A new category may be defined as a union or intersection of two existing
categories. This is done by ANDing or ORing the membership predicates of
the two categories and reducing them back to canonical form. Canonizing
a predicate is done by conversion into Disjnctive Normal Form, elimination 
of redundant terms and products, sorting and a few other steps.

A global dictionary of canonical predicates is kept (similar to interning
of strings) so any equivalent categories are merged. Each type object
can store a cache of categories in which it is a member so evaluation of
a membership predicate only needs to be done once for each type.

This may sound complicated by here's how it might work in practice:

Extracting a category from an existing class:
foobarlike = like(FooBar)

The members of the foobarlike category are any classes that implement the
same methods and attributes as FooBar, whether or not they are actually
descended from it. They may be defined independently in another library.
FooBar may be an abstract class used just as a template for a category.

Asserting that a class must be a member of a category:

class SomeClass:
   __category__ = like(AnotherClass)

At the end of the class definition it will be checked whether it really is
a member of that category (like(SomeClass) issubsetof like(AnotherClass))
This attribute is inherited by subclasses.  Any subclass of this class
will be checked whether it is still a member of the category.  A subclass
may also override this attribute:

class InheritImplementationButNotTheCategoryCheckFrom(SomeClass):
   __category__ = some_other_category

class AddAdditionalRestrictionsTo(SomeClass):
   __category__ = __category__ & like(YetAnotherClass)

If there is a conflict between the two categories the new category will
reduce to the empty set and an error will be generated. The error can be
quite informative by extracting a category from the new class, subtracting
it from the defined category and printing the difference.

When a backward compatible change is made to a protocol (e.g. adding a new
method) any modules that use the old category should still work because
the new category is a subcategory of the old one. When a non backward
compatible change is made (e.g. removing a method, changing its call
signature) existing code may still run without complaining depending on
the category it uses to do the checking. If it's a wider category that
doesn't check for the method it should be ok.

A non backward compatible change must change the exposed interface. This
may be ensured by adding an attribute or method that serves as an explicit
marker and includes a version number or is renamed in some other way when
making incompatible changes. Category union may be used to check for two
incompatible versions that are known to implement a common subset even if
it has never been given a name, etc.


From  Thu Aug 15 08:26:46 2002
From: (Martin =?ISO-8859-1?Q?Sj=F6gren?=)
Date: 15 Aug 2002 09:26:46 +0200
Subject: [Python-Dev] string.find() again (was Re: timsort for jython)
In-Reply-To: <>
References: <>
Message-ID: <1029396406.30031.3.camel@ratthing-b3cf>

ons 2002-08-14 klockan 23.12 skrev Guido van Rossum:
> Unfortunately that would be a significant change in internal shit.

Just curious, is "internal shit" a technical term in Python? ;-)

*ducks and starts running*


Martin Sj=F6gren              ICQ : 41245059
  Phone: +46 (0)31 7710870       Cell: +46 (0)739 169191
  GPG key:

From  Thu Aug 15 09:08:01 2002
From: (Oren Tirosh)
Date: Thu, 15 Aug 2002 04:08:01 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

On Wed, Aug 14, 2002 at 10:08:59AM -0400, Andrew Koenig wrote:
> >> The category names look like general purpose interface names. The
> >> addition of interfaces has been discussed quite a bit. While many
> >> people are interested in having interfaces added to Python, there
> >> are many design issues that will have to be resolved before it
> >> happens.
> Oren> Nope. Type categories are fundamentally different from
> Oren> interfaces.  An interface must be declared by the type while a
> Oren> category can be an observation about an existing type.
> Why?  That is, why can't you imagine making a claim that type
> X meets interface Y, even though the author of neither X nor Y
> made that claim?

It's not a failure of imagination, it's a failure of terminology. In
contexts where the term 'interface' is used (Java, COM, etc) it usually 
means something you explicitly expose from your objects. I find that the 
term 'category' implies something you observe after the fact without 
modifying the object - "these objects both have property so-and-so, let's 
group them together and call it a category".

> However, now that you bring it up... One difference I see between
> interfaces and categories is that I can imagine categories carrying
> semantic information to the human reader of the code that is not
> actually expressed in the category itself.  As a simple example,
> I can imagine a PartialOrdering category that I might like as part
> of the specification for an argument to a sort function.

You can define any category you like and attach a semantic meaning to it
as long as you can write a membership predicate for the category. It may
be based on a marker that the type must have or, in case you can't change
the type (e.g. a builtin type) you can write a membership predicate that
also tests for some set of specific types. 

> Oren> A category is defined mathematically by a membership
> Oren> predicate. So what we need for type categories is a system for
> Oren> writing predicates about types.
> Indeed, that's what I was thinking about initially.  Guido pointed out
> that the notion could be expanded to making concrete assertions about
> the interface to a class.  I had originally considered that those
> assertions could be just that--assertions, but then when Guido started
> talking about interfaces, I realized that my original thought of
> expressing satisfaction of a predicate by inheriting it could be
> extended by simply adding methods to those predicates.  Of course,
> this technique has the disadvantage that it's not easy to add base
> classes to a class after it has been defined.

That's why the intelligence should be in the membership predicate, not in 
the classes it selects. Nothing needs to be changed about types. 
Conceptually, categories apply to *references*, not to *objects*. They help 
you ensure that during execution certain references may only point to 
objects from a limited category of types so that the operations you perform 
on them are meaningful (though not necessarily correct). A situation that 
may lead to a reference pointing to an object outside the valid category 
should be detected as early as possible. Detecting this during compilation 
is great. On module import is good. At runtime it's ok.

Can you can think of a better name than 'categories' to describe a set of 
types selected by a membership predicate? 


From  Thu Aug 15 09:40:13 2002
From: (Oren Tirosh)
Date: Thu, 15 Aug 2002 04:40:13 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <> <> <> <>
Message-ID: <>

On Thu, Aug 15, 2002 at 12:42:46AM +0200, Martin v. Loewis wrote:
> Oren Tirosh <> writes:
> > Nope. For me protocols are conventions to follow for performing a certain 
> > task.  A type category is a formally defined set of types.  
> ODP (Reference Model For Open Distributed Processing, ISO 10746)
> defines that a type is a predicate; it implies a set (of which it is
> the characteristic function).

A type is a predicate about an object. A category is a predicate about a

Objects have a type. References have a category. 

Well, Python references currently all have the 'any' category because 
Python has no type checking. Any Python reference may point to an object 
of any type. 

In a dynamically typed language there is no such thing as an 'integer
variable' but it can be simulated by a reference that may only point to
objects in the 'integer' category.

> It is not so clear that this is what defines the iterable category. It
> could also be defined as "the programmer can use to for doing
> iteration, by means of the iterable protocol".

Your definition is not formal and cannot be evaluated by a program.  

The iterable category matches the set of types implementing the iterable
protocol with reasonable accuracy. It doesn't have to be perfect to be 

> > Protocols live in documentation and lore. Type categories live in the same 
> > place where vector spaces and other formal systems live.
> By that definition, I'd say that Andrew's list enumerates protocols,
> not type categories: they all live in lore, not in a formalism.

> For being a pure function, requiring that it does not trigger Python
> code seems a bit too restrictive.

That's not a formal requirement, it for robustness and efficiency. 


From  Thu Aug 15 10:42:29 2002
From: (Michael Hudson)
Date: 15 Aug 2002 10:42:29 +0100
Subject: [Python-Dev] type categories
In-Reply-To: Greg Ewing's message of "Thu, 15 Aug 2002 15:13:53 +1200 (NZST)"
References: <>
Message-ID: <>

Greg Ewing <> writes:

> As long as all the implementations have to be in one place,
> this is equivalent to
>   def f(a):
>     if belongstocategory(a, Cat1):
>       ...
>     elif belongstocategory(a, Cat2):
>       ...
>     elif belongstocategory(a, Cat3):
>       ...
> so you're not gaining much from the new syntax.

Good point.

> > It might also be possible to modify a multimethod dynamically,
> > e.g. later one could write:
> > 
> >   def f4(a: Cat4):
> >     ...code for Cat4...
> > 
> >   f.add(f4)
> This sort of scheme makes me uneasy, because it means that any module
> can change the behaviour of any call of f() in any other
> module.

True, but I don't think this is a problem in practice with CLOS is it?

I mean, you can currently do

import mod

mod.func = my_func # evil cackle!

but you don't.

> Currently, if you know the types involved in a method call,
> you can fairly easily track down in the source which piece of code
> will be called. With this sort of generic function, that will no
> longer be possible. 

I would sincerely hope that any core implementation of such an idea
would be introspective enough to allow finding method implementations.
Obviously this would only work at run time, but it would be a help
(imagine running under pdb).


  Well, yes.  I don't think I'd put something like "penchant for anal
  play" and "able to wield a buttplug" in a CV unless it was relevant
  to the gig being applied for...
                                 -- Matt McLeod, alt.sysadmin.recovery

From  Wed Aug 14 20:52:00 2002
From: (Jack Jansen)
Date: Wed, 14 Aug 2002 21:52:00 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
Message-ID: <>

On woensdag, augustus 14, 2002, at 02:13 , Guido van Rossum wrote:
> Note that normalization doesn't belong in the codecs (except perhaps
> as a separate Unicode->Unicode codec, since codecs seem to be useful
> for all string->string transformations).  It's a separate step that
> the application has to request; only the app knows whether a
> particular Unicode string is already normalized or not, and whether
> the expense is useful for the app, or not.

I don't like this, I don't like it at all.

Python jumps through hoops to make 'jack' and u'jack' compare=20
identical and be interchangeable in dict keys and what have you,=20
and now suddenly I find out that there's two ways to say u'j=E4ck'=20
and they won't compare equal. Not good.

I sympathise with the fact that this is difficult (although I=20
still don't understand why: whereas when you want to create the=20
decomposed version I can imagine there's N! ways to notate a=20
character with N combining chars, I would think there's one and=20
only one way to write a combined character), but that shouldn't=20
stop us at least planning to fix this.

And I don't think the burden should fall on the application.=20
That same reasoning could have been followed for making ascii=20
and unicode-ascii-subset compare equal: the application will=20
know it has to convert ascii to unicode before comparing.
- Jack Jansen        <>       =20 -
- If I can't dance I don't want to be part of your revolution --=20
Emma Goldman -

From  Thu Aug 15 13:13:52 2002
From: (Guido van Rossum)
Date: Thu, 15 Aug 2002 08:13:52 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Thu, 15 Aug 2002 09:30:31 +0300."
References: <> <> <> <> <> <> <> <>
Message-ID: <>

> Extracting a category from an existing class:
> foobarlike = like(FooBar)
> The members of the foobarlike category are any classes that
> implement the same methods and attributes as FooBar, whether or not
> they are actually descended from it.  They may be defined
> independently in another library.

This seems fairly userless in practice -- you almost never want to use
*all* methods and attributes of a class as the characteristic.  (Even
if you skip names starting with _.)

> FooBar may be an abstract class used just as a template for a category.

Then the like() syntax seems unnecessary.  It then becomes similar to
Zope's Interfaces.

> Asserting that a class must be a member of a category:
> class SomeClass:
>    __category__ = like(AnotherClass)
>    ...

In Zope:

  class SomeClass:
    __implements__ = AnotherClass

By convention, AnotherClass usually has a name that indicates it
is an interface: IAnotherClass.

> At the end of the class definition it will be checked whether it
> really is a member of that category (like(SomeClass) issubsetof
> like(AnotherClass)) This attribute is inherited by subclasses.  Any
> subclass of this class will be checked whether it is still a member
> of the category.

I've been mulling over another way to spell this; perhaps you can
add categories to the inheritance list:

  class SomeClass(IAnotherClass):

There's ambiguity here though: extending an interface already uses the
same syntax:

  class IExtendedClass(IAnotherClass):

Disambiguating based on name conventions seems wrong and unpythonic.
In C++, abstract classes are those that have one or more abstract
methods; maybe we can borrow from that.

> A subclass
> may also override this attribute:
> class InheritImplementationButNotTheCategoryCheckFrom(SomeClass):
>    __category__ = some_other_category
>    ...

My alternative spelling idea currently has no way to do this; but one
is needed, and preferably one that's not too ugly.

> class AddAdditionalRestrictionsTo(SomeClass):
>    __category__ = __category__ & like(YetAnotherClass)

There's a (shallow) problem here, in that __category__ is not
initially in your class's namespace: at the start of executing the
class statement, you begin with an empty local namespace.

--Guido van Rossum (home page:

From  Thu Aug 15 13:20:06 2002
From: (Guido van Rossum)
Date: Thu, 15 Aug 2002 08:20:06 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Thu, 15 Aug 2002 04:40:13 EDT."
References: <> <> <> <> <> <>
Message-ID: <>

> In a dynamically typed language there is no such thing as an 'integer
> variable' but it can be simulated by a reference that may only point to
> objects in the 'integer' category.

This seems a game with words.  I don't see the difference between an
integer variable and a reference that must point to an integer.
(Well, I see a difference, in the sharing semantics, but that's just
the difference between a value and an pointer in C.  They're both

--Guido van Rossum (home page:

From  Thu Aug 15 13:31:43 2002
From: (Martijn Faassen)
Date: Thu, 15 Aug 2002 14:31:43 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <> <> <> <> <> <> <>
Message-ID: <>

Oren Tirosh wrote:
> On Wed, Aug 14, 2002 at 09:38:18PM -0400, Andrew Koenig wrote:
> > >> Not really.  I can see how an interface can claim that a particular
> > >> method exists, but not how it can claim that the method implements a
> > >> function that is antisymmetric and transitive.
> > 
> > Guido> That's done in the docs, usually.  Zope even has the notion of a
> > Guido> "marker" interface -- an interface that says "this object has property
> > Guido> such-and-such" but which does not assert any methods or attributes.
> > 
> > So perhaps what I mean by a category is the set of all types that
> > implement a particular marker interface.
> I propose that any method or attribute may serve as a marker. This makes it
> possible to use an existing practice as a marker so protocols can be
> defined retroactively for an existing code base. It's also possible, of 
> course, to add an attribute called 'has_property_such_and_such' to serve 
> as an explicit marker.

This is an interesting idea. I'd say you could plug such a 
thing into an interface system, by making 'interface.isImplementedBy()' 
calling some hooks that may dynamically claim an object implements
an interface, based on methods and attributes.

I'm not sure if it's a good idea, as if you're going to state this in
code anyway it seems to me it's clearer to actually explicitly use marker
interfaces instead of writing some code that guesses based on the presence
of particular attributes, but it's definitely an interesting idea.

> A new category may be defined as a union or intersection of two existing
> categories. This is done by ANDing or ORing the membership predicates of
> the two categories and reducing them back to canonical form.

This is similar to some ideas I came up with a couple of years ago on the
types-SIG, and Guido told me to talk about it at a conference over some beer,
instead. :)
(see bottom: Interfaces can be implied:)

and here's Guido's finding my idea too absurd:

But they're still interesting ideas. :) You can basically deduce what
attributes (and methods) are on an interface by just giving a whole bunch 
of classes that you claim implement the interface, for instance.



From  Thu Aug 15 14:02:22 2002
From: (Andrew Koenig)
Date: 15 Aug 2002 09:02:22 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

>> Why?  That is, why can't you imagine making a claim that type
>> X meets interface Y, even though the author of neither X nor Y
>> made that claim?

Oren> It's not a failure of imagination, it's a failure of
Oren> terminology. In contexts where the term 'interface' is used
Oren> (Java, COM, etc) it usually means something you explicitly
Oren> expose from your objects. I find that the term 'category'
Oren> implies something you observe after the fact without modifying
Oren> the object - "these objects both have property so-and-so, let's
Oren> group them together and call it a category".

But what if it is possible to express property so-and-so as an
interface?  It's like observing that a particular set, that someone
else defined, is a group, so now all the group theorems apply to it.
Similarly, if someone has defined a class, and I happen to notice that
that class is really a reversible iterator, I would like a way saying
so that will let anyone who wants to use that class in a context that
requires a reversible iterator to do so.

>> However, now that you bring it up... One difference I see between
>> interfaces and categories is that I can imagine categories carrying
>> semantic information to the human reader of the code that is not
>> actually expressed in the category itself.  As a simple example,
>> I can imagine a PartialOrdering category that I might like as part
>> of the specification for an argument to a sort function.

Oren> You can define any category you like and attach a semantic
Oren> meaning to it as long as you can write a membership predicate
Oren> for the category. It may be based on a marker that the type must
Oren> have or, in case you can't change the type (e.g. a builtin type)
Oren> you can write a membership predicate that also tests for some
Oren> set of specific types.

Or perhaps a membership predicate that tests whether a type satisfies
a particular interface.

Oren> A category is defined mathematically by a membership
Oren> predicate. So what we need for type categories is a system for
Oren> writing predicates about types.

And, perhaps, a way for defining predicates that determine whether
types meet interfaces.

>> Indeed, that's what I was thinking about initially.  Guido pointed
>> out that the notion could be expanded to making concrete assertions
>> about the interface to a class.  I had originally considered that
>> those assertions could be just that--assertions, but then when
>> Guido started talking about interfaces, I realized that my original
>> thought of expressing satisfaction of a predicate by inheriting it
>> could be extended by simply adding methods to those predicates.  Of
>> course, this technique has the disadvantage that it's not easy to
>> add base classes to a class after it has been defined.

Oren> That's why the intelligence should be in the membership
Oren> predicate, not in the classes it selects. Nothing needs to be
Oren> changed about types.  Conceptually, categories apply to
Oren> *references*, not to *objects*.

I don't see why categories should not also apply to class objects.

Oren> They help you ensure that during execution certain references
Oren> may only point to objects from a limited category of types so
Oren> that the operations you perform on them are meaningful (though
Oren> not necessarily correct). A situation that may lead to a
Oren> reference pointing to an object outside the valid category
Oren> should be detected as early as possible. Detecting this during
Oren> compilation is great. On module import is good. At runtime it's
Oren> ok.


Oren> Can you can think of a better name than 'categories' to describe
Oren> a set of types selected by a membership predicate?

Not offhand.

Andrew Koenig,,

From  Thu Aug 15 13:58:53 2002
From: (Martijn Faassen)
Date: Thu, 15 Aug 2002 14:58:53 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <> <> <>
Message-ID: <>

Oren Tirosh wrote:
> On Wed, Aug 14, 2002 at 10:08:59AM -0400, Andrew Koenig wrote:
> > Why?  That is, why can't you imagine making a claim that type
> > X meets interface Y, even though the author of neither X nor Y
> > made that claim?
> It's not a failure of imagination, it's a failure of terminology. In
> contexts where the term 'interface' is used (Java, COM, etc) it usually 
> means something you explicitly expose from your objects. I find that the 
> term 'category' implies something you observe after the fact without 
> modifying the object - "these objects both have property so-and-so, let's 
> group them together and call it a category".

Okay, but in Python interfaces as we know them (the Scarecrow descended
interfaces as in use in Zope 2 and Zope 3), you can say things like this
(with some current limitations concerning basic types). Usually however
one does use the __implements__ class attribute, but that's because it's
often clearer and easier when one is writing a new class.

> That's why the intelligence should be in the membership predicate, not in 
> the classes it selects. Nothing needs to be changed about types. 
> Conceptually, categories apply to *references*, not to *objects*. They help 
> you ensure that during execution certain references may only point to 
> objects from a limited category of types so that the operations you perform 
> on them are meaningful (though not necessarily correct).

This is quite different from the Zope interfaces approach, I think. 
Zope interfaces do talk about objects, not references. This is leaning
towards a more static feel of typing (even though it may be quite different
from static typing in the details), which I think should be clearly
marked as quite independent from a discussion on interfaces. 

Though I'd still call your beasties Interfaces and not categories, even
though you want to use them in a statically typed way -- but please let's
not reject simple interfaces just because we may want to do something
complicated and involved with static typing later..



From  Thu Aug 15 14:06:21 2002
From: (Andrew Koenig)
Date: 15 Aug 2002 09:06:21 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

>> I can see how it could be done using some additional syntax similar to
>> what ML uses, e.g.:

Guido> def f(a: Cat1):
Guido>    ...code for Cat1...
Guido> else f(a: Cat2):
Guido>    ...code for Cat2...
Guido> else f(a: Cat3):
Guido>    ...code for Cat3...

Greg> As long as all the implementations have to be in one place,
Greg> this is equivalent to

Greg>   def f(a):
Greg>     if belongstocategory(a, Cat1):
Greg>       ...
Greg>     elif belongstocategory(a, Cat2):
Greg>       ...
Greg>     elif belongstocategory(a, Cat3):
Greg>       ...

Greg> so you're not gaining much from the new syntax.

I'm not so sure.  The code is alreasy somewhat simpler here, and it
would be substantially simpler in examples such as

        def arctan(x):
        else arctan(y, x):

>> It might also be possible to modify a multimethod dynamically,
>> e.g. later one could write:
>> def f4(a: Cat4):
>> ...code for Cat4...
>> f.add(f4)

Greg> This sort of scheme makes me uneasy, because it means that any module
Greg> can change the behaviour of any call of f() in any other
Greg> module.

It makes me uneasy because the behavior of programs might depend on the
order in which modules are loaded.  That's why I didn't suggest a way
of defining the variations on f in separate places.

Andrew Koenig,,

From  Thu Aug 15 14:08:25 2002
From: (Martijn Faassen)
Date: Thu, 15 Aug 2002 15:08:25 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <> <> <> <>
Message-ID: <>

Guido van Rossum wrote:
> > Interfaces in Python are almost too trivial to understand, but
> > surprisingly useful. I imagine this is why so many smart Python
> > users don't get it; they either reject the notion because it seems
> > too trivial and 'therefore useless', or because they think it must
> > involve far more complication (static typing) and therefore it's too
> > complicated and not in the spirit of Python. :)
> No, I think it's because they only work well if they are used
> pervasively (not necessarily everywhere).  That's why they work in
> Zope: not only does almost everything in Zope have an interface, but
> interfaces are used to implement many Zope features.

That is not true for Zope 2, and I do use interfaces in Zope 2. Zope 2
certainly doesn't use interfaces pervasively.

I use them pervasively in some of my code which is a framework
on top of Zope, which may weaken my argument, but it's still not
true interfaces are not useful unless they're used pervasively.
It's definitely a lot more powerful if you do, of course, though.

I also think it may help that one can declare a class implements an
interface outside said class itself, in a different section of the code.
I do not have any practical experience with that outside some Zope 3
hackery, however, so I can't really defend this one very well.

> I haven't made up my mind yet whether Python could benefit as much as
> Zope, but I am cautiosuly looking into adding something derived from
> Zope's interface package.  Jim & I have rather different ideas on what
> the ideal interfaces API should look like though, so it'll be a while.
> Maybe I should pull down the Twisted interfaces package and see how I
> like their subset (I'm sure it must be a subset -- the Zope package is
> a true kitchen sink :-).

It's an extremely small subset and very trivial, and last I checked
they used 'implements' in a different way than Zope, unfortunately (I pointed
it out and they may have fixed that by now, not sure).

But if you are looking for another API then the Twisted version doesn't
help (except for the inadvertent 'implements()' difference).

I don't consider the Zope 3 interface package to be a kitchen sink
myself, but I've been working with it for a while now. I would note
that some of its extensibility and introspection features is quite
useful when implementing Schema (a special kind of interfaces with
descriptions about non-method attributes). If a new package is to
be designed I hope that those use cases will be taken into account.



From  Thu Aug 15 14:09:05 2002
From: (Andrew Koenig)
Date: 15 Aug 2002 09:09:05 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

Martijn> Oren Tirosh wrote:

Oren> I propose that any method or attribute may serve as a
Oren> marker. This makes it possible to use an existing practice as a
Oren> marker so protocols can be defined retroactively for an existing
Oren> code base. It's also possible, of course, to add an attribute
Oren> called 'has_property_such_and_such' to serve as an explicit
Oren> marker.

Martijn> This is an interesting idea. I'd say you could plug such a
Martijn> thing into an interface system, by making
Martijn> 'interface.isImplementedBy()' calling some hooks that may
Martijn> dynamically claim an object implements an interface, based on
Martijn> methods and attributes.

In that case, a marker is really just an interface with a single element.

Andrew Koenig,,

From  Thu Aug 15 14:13:35 2002
From: (Oren Tirosh)
Date: Thu, 15 Aug 2002 09:13:35 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <> <> <> <> <> <>
Message-ID: <>

On Thu, Aug 15, 2002 at 08:20:06AM -0400, Guido van Rossum wrote:
> > In a dynamically typed language there is no such thing as an 'integer
> > variable' but it can be simulated by a reference that may only point to
> > objects in the 'integer' category.
> This seems a game with words.  I don't see the difference between an
> integer variable and a reference that must point to an integer.
> (Well, I see a difference, in the sharing semantics, but that's just
> the difference between a value and an pointer in C.  They're both
> variables.)

In C a pointer and a value are both "objects".  But Python references are 
not objects. In a language where almost everything is an object they are a 
conspicous exception. A slot in a list is bound to an object but there is 
no introspectable object that represents the slot *itself*.

And yes, sharing semantics make a big difference.

My basic distinction is that type categories are not a property of objects. 
An object is what it is. It doesn't need "type checking". Type categories 
are useful to check *references* and ensure that operations on a reference 
are meaningful. A useful type checking system can be built that makes no 
change at all to objects and type, only applying tests to references. The
__category__ attribute I proposed for classes is not much more than a 
convenient way to spell:

class Foo:

assert Foo in category

The category is not stored inside the class. It is an observation about 
the class, not a property of the class.


From  Thu Aug 15 14:55:04 2002
From: (Kevin Jacobs)
Date: Thu, 15 Aug 2002 09:55:04 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <>
Message-ID: <>

[I'm just jumping into this thread -- please forgive me if my reply does not
make sense in the context of the past discussions on this thread -- I've
only had time to read part of the archive.]

On Thu, 15 Aug 2002, Oren Tirosh wrote:
> In C a pointer and a value are both "objects".  But Python references are 
> not objects.

References are so transparent that you can treat them as 'instance

> In a language where almost everything is an object they are a 
> conspicous exception.

What semantics do you propose for "reference objects"?

> A slot in a list is bound to an object but there is no introspectable
> object that represents the slot *itself*.

Of course there is -- slots bind a memory location in an object via a
descriptor object stored in its class.

> And yes, sharing semantics make a big difference. My basic distinction is
> that type categories are not a property of objects.

Why not deconstruct these ideas a little more and really explore what we
want, and not how we want to go about implementing it.  Let us consider the
following characteristics:

  1) Static interfaces/categories vs. dynamic interfaces/categories?

  i.e., does a particular instance always implement a given interface or
        category, or can it change over the lifetime of an object (to to
        state changes within the object, or changes within the
        interface/category registry system).
  2) Unified interface/category registry vs. distributed interface/category

  i.e., are the supported interfaces/categories instrinsic to a class or are
        they reflections of the desires of the code wishing to use an
        instance?  For example, there are many possible and sensible subsets
        of the "file-like" interface.  Must any class implementing
        "file-like" objects know about and advertize that it implements a
        predefined set of interfaces?  Or, should the code wishing to use
        the object instance be able to construct a specialized
        interface/category query with which to check the specific
        capabilities of the instance?

  3) (related) Should interfaces/categories of a particular class/instance
     be enumerable?

  4) Should interfaces/categories be monotonic with respect to inheritance?

  i.e., is it sufficient that base class of an instance implements a given
        interface/category for a derived instance to implement that

> An object is what it is. It doesn't need "type checking". Type categories 
> are useful to check *references* and ensure that operations on a reference 
> are meaningful.

This distinction is meaningless for Python.  Objects references are not
typed, and thus any reference can potentially be to any type.  Putting too
fine a point on the semantic difference between the type of an object and
the type of an object reffered to by some reference is just playing games.

> A useful type checking system can be built that makes no 
> change at all to objects and type, only applying tests to references. The
> __category__ attribute I proposed for classes is not much more than a 
> convenient way to spell:
> class Foo:
>     ...
> assert Foo in category
> The category is not stored inside the class. It is an observation about 
> the class, not a property of the class.

Given this description, I am guessing that these are your answers to my
previous questions:

1) dynamic, since your categories are not fixed at class creation
2) ?, either is possible, though I suspect you are advocating a standard
   unified category registry.
3) Yes, depending on implementation details
4) No

Please correct my guesses if I am mistaken.


Kevin Jacobs
The OPAL Group - Enterprise Systems Architect
Voice: (216) 986-0710 x 19         E-mail:
Fax:   (216) 986-0714              WWW:

From  Thu Aug 15 14:58:46 2002
From: (Samuele Pedroni)
Date: Thu, 15 Aug 2002 15:58:46 +0200
Subject: [Python-Dev] FW: multimethod-0.1
Message-ID: <001801c24463$e06d3660$6d94fea9@newmexico>

[David Abrahms]
>> I haven't studied this, but from a quick glance it looks competent.
>> Multimethod-0.1 is another python module for implementing multimethods
>> (a.k.a.  generic functions, multiple-argument method dispatch).  This
>> one features:
>> - support for Python2.2 type/class unification

It works only with new style classes and types, not with old style
classes, don't know if it's a problem, in the latter case
mutability of __bases__ becomes a problem.

>> - a precedence graph for more efficient dispatching

>> - a best-fit resolution algorithm, in which the method closest in
>> inheritance distance is called

This makes me uneasy, either we go the CLOS way where lefter
arguments take priority, or the Dylan way, i.e.  in  face of ambiguity
throw an exception (in that case we could add a mechanism to
force disptach on a supplied supertype like my _redisptach).

>> - a versatile 'call-next-method' or 'super' function.

FYI which uses a dictionaries on frames

>It's a good start, but from the docs it doesn't appear to deal with:

it's a start

>a. Type categories -- it seems as though the only way for a multimethod
>implementation to match an actual argument is if the formal argument has an
>inheritance relationship with it.

but it's really an orthogonal problem (the concrete problem wrt
multidispatch is just how to merge categories in the mro),
make categories/protocols firtst class is a different can of worms.

>b. Implicit conversions -- If I declare a function that accepts a Python
>int, can I pass a Python float?

maybe if it accepts a float you can pass an int <wink>. I see the issue
but I don't know if this should be the default on the Python side in general.
It should be configurable for the single multimethod.

c. there should be a version that can be put in class a and behave
like a method (e.g. to implement the moral of overloading) producing
bound and unbound in the first argument versions when retrieved through the
class or instances, it should probably take a function to be called
in case no matching method is found, this can be used to redispatch
to the super classes, or something similar, unless we want to redifine how the
whole single
dispatch work.


From  Thu Aug 15 15:02:31 2002
From: (Andrew Koenig)
Date: 15 Aug 2002 10:02:31 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

>> In a language where almost everything is an object they are a 
>> conspicous exception.

Kevin> What semantics do you propose for "reference objects"?

I think the idea is to constrain the circumstances under which a
reference can be created in the first place.

Andrew Koenig,,

From  Thu Aug 15 15:11:30 2002
From: (Guido van Rossum)
Date: Thu, 15 Aug 2002 10:11:30 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Thu, 15 Aug 2002 14:58:53 +0200."
References: <> <> <> <> <>
Message-ID: <>

I think we may have retired the types-sig a week or two too
early... :-)

This kind of discussion is great for sharpening our intellect; the
types-sig had outbursts like this maybe twice a year.  But I have yet
to see something come out of it that was practical enough to be added
to Python "now".  Maybe Zope Interfaces are our best bet.  They surely
have been used and refined for almost four years now...

--Guido van Rossum (home page:

From  Thu Aug 15 15:53:02 2002
From: (Oren Tirosh)
Date: Thu, 15 Aug 2002 10:53:02 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Thu, Aug 15, 2002 at 09:55:04AM -0400, Kevin Jacobs wrote:
> [I'm just jumping into this thread -- please forgive me if my reply does not
> make sense in the context of the past discussions on this thread -- I've
> only had time to read part of the archive.]
> On Thu, 15 Aug 2002, Oren Tirosh wrote:
> > In C a pointer and a value are both "objects".  But Python references are 
> > not objects.
> References are so transparent that you can treat them as 'instance
> aliases'.
> > In a language where almost everything is an object they are a 
> > conspicous exception.
> What semantics do you propose for "reference objects"?

No no no! I am not proposing anything like that.

What I'm saying is that interfaces/categories/whateveryouwannacallit are
more about references to objects than about the objects themselves and 
pointed out that references are not even Python objects.

Two references to the same object may have very different expectations about 
what they are pointing to. I went a step further and decided to completely 
decouple it from the object: All the intelligence is in the category that 
makes observations about the object's form without requiring any change to 
the objects or types.


From  Thu Aug 15 17:13:31 2002
From: (Guido van Rossum)
Date: Thu, 15 Aug 2002 12:13:31 -0400
Subject: [Python-Dev] Re: SET_LINENO killer
In-Reply-To: Your message of "Wed, 14 Aug 2002 14:51:25 EDT."
Message-ID: <>

> python/sf/587993
> Looks like Michael Hudson did an *outstanding* and very thorough job
> on this.  Does anybody see a reason why I shouldn't let him check this
> in?

OK, Michael's checked it in, after some comments from Martin.  Woo

But here's some sad news.  I only see a speed increase of 0.5%!  I
believe that when we first looked at this patch the speedup was about
5%...  Worse, Tim claims that on his Windows box it's actually 5%
slower.  What happened?

--Guido van Rossum (home page:

From  Thu Aug 15 17:20:37 2002
From: (Guido van Rossum)
Date: Thu, 15 Aug 2002 12:20:37 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: Your message of "Wed, 14 Aug 2002 23:02:27 EDT."
Message-ID: <>

Again: python/sf/576101

I'd like to make all interned strings mortal; this allows some
simplifications to the patch.  This would mean that in the following

  x = intern('12345'*4)
  nx = id(x)
  del x something else...
  y = intern('12345'*4)
  ny = id(y)

nx doesn't necessarily equal ny any more.  This is a backward
incompatibility but I'm willing to break programs that rely on this;
it sounds highly unlikely that the author of any such code as exists
would mind it being broken.


(Reminder: this is python-dev, not types-sig. :-)

--Guido van Rossum (home page:

From  Thu Aug 15 17:22:59 2002
From: (Michael Hudson)
Date: 15 Aug 2002 17:22:59 +0100
Subject: [Python-Dev] Re: SET_LINENO killer
In-Reply-To: Guido van Rossum's message of "Thu, 15 Aug 2002 12:13:31 -0400"
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> > python/sf/587993
> > 
> > Looks like Michael Hudson did an *outstanding* and very thorough job
> > on this.  Does anybody see a reason why I shouldn't let him check this
> > in?
> OK, Michael's checked it in, after some comments from Martin.  Woo
> hoo!


> But here's some sad news.  I only see a speed increase of 0.5%!  I
> believe that when we first looked at this patch the speedup was about
> 5%...  Worse, Tim claims that on his Windows box it's actually 5%
> slower.  What happened?

Beats me.  I still see a healthy speed up:


$ ./python ../Lib/test/ 
Pystone(1.1) time for 50000 passes = 3.99
This machine benchmarks at 12531.3 pystones/second


$ ./python ../Lib/test/ 
Pystone(1.1) time for 50000 passes = 3.65
This machine benchmarks at 13698.6 pystones/second

(which is nosing on for 10% faster, actually).

You're not testing a debug vs a release build or anything like that
are you?


  That's why the smartest companies use Common Lisp, but lie about it
  so all their competitors think Lisp is slow and C++ is fast.  (This
  rumor has, however, gotten a little out of hand. :)
                                        -- Erik Naggum, comp.lang.lisp

From  Thu Aug 15 17:37:12 2002
From: (Neil Schemenauer)
Date: Thu, 15 Aug 2002 09:37:12 -0700
Subject: [Python-Dev] Re: SET_LINENO killer
In-Reply-To: <>; from on Thu, Aug 15, 2002 at 12:13:31PM -0400
References: <>
Message-ID: <>

Guido van Rossum wrote:
> But here's some sad news.  I only see a speed increase of 0.5%!

Based on pystone, the current CVS tree seems to be about 8% faster on my
machine than it was two days ago.


From  Thu Aug 15 17:46:25 2002
From: (Tim Peters)
Date: Thu, 15 Aug 2002 12:46:25 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: <>
Message-ID: <>

> I'd like to make all interned strings mortal; this allows some
> simplifications to the patch.  This would mean that in the following
> example:
>   x = intern('12345'*4)
>   nx = id(x)
>   del x
> something else...
>   y = intern('12345'*4)
>   ny = id(y)
> nx doesn't necessarily equal ny any more.  This is a backward
> incompatibility but I'm willing to break programs that rely on this;
> it sounds highly unlikely that the author of any such code as exists
> would mind it being broken.
> Opinions?

As the only person to have posted an example relying on this behavior, it's
OK by me if that example breaks -- it was made up just to illustrate the
possibility and raise a caution flag.  I don't take it seriously.

> (Reminder: this is python-dev, not types-sig. :-)

Oops!  I guess I should take it more seriously then <wink>.

From  Thu Aug 15 17:54:29 2002
From: (Tim Peters)
Date: Thu, 15 Aug 2002 12:54:29 -0400
Subject: [Python-Dev] Re: SET_LINENO killer
In-Reply-To: <>
Message-ID: <>

[Michael Hudson]
> Beats me.  I still see a healthy speed up:
> Before:
> $ ./python ../Lib/test/
> Pystone(1.1) time for 50000 passes = 3.99
> This machine benchmarks at 12531.3 pystones/second
> After:
> $ ./python ../Lib/test/
> Pystone(1.1) time for 50000 passes = 3.65
> This machine benchmarks at 13698.6 pystones/second
> (which is nosing on for 10% faster, actually).
> You're not testing a debug vs a release build or anything like that
> are you?

I'm not, but I was comparing -O times (in release builds).  Three runs
before patch:

C:\Code\python\PCbuild>python -O ../lib/test/
Pystone(1.1) time for 50000 passes = 3.49756
This machine benchmarks at 14295.7 pystones/second

C:\Code\python\PCbuild>python -O ../lib/test/
Pystone(1.1) time for 50000 passes = 3.49881
This machine benchmarks at 14290.6 pystones/second

C:\Code\python\PCbuild>python -O ../lib/test/
Pystone(1.1) time for 50000 passes = 3.52653
This machine benchmarks at 14178.2 pystones/second

Three runs after patch:

C:\Code\python\PCbuild>python -O  ../lib/test/
Pystone(1.1) time for 50000 passes = 3.74291
This machine benchmarks at 13358.6 pystones/second

C:\Code\python\PCbuild>python -O  ../lib/test/
Pystone(1.1) time for 50000 passes = 3.74544
This machine benchmarks at 13349.6 pystones/second

C:\Code\python\PCbuild>python   ../lib/test/
Pystone(1.1) time for 50000 passes = 3.74487
This machine benchmarks at 13351.6 pystones/second

Three runs after commenting out the new

		if (tstate->c_tracefunc != NULL && !tstate->tracing) {
			/* see maybe_call_line_trace
			   for expository comments */
					      f, &instr_lb, &instr_ub);

on the eval-loop critical path:

C:\Code\python\PCbuild>python   ../lib/test/
Pystone(1.1) time for 50000 passes = 3.59444
This machine benchmarks at 13910.4 pystones/second

C:\Code\python\PCbuild>python   ../lib/test/
Pystone(1.1) time for 50000 passes = 3.59211
This machine benchmarks at 13919.4 pystones/second

C:\Code\python\PCbuild>python   ../lib/test/
Pystone(1.1) time for 50000 passes = 3.59742
This machine benchmarks at 13898.9 pystones/second

OTOH, MSVC 6 has been generating faster ceval.c code than gcc for a long
time; given how touchy this is, maybe it's just time for gcc to win 587 coin
flips in a row <wink>.

From  Thu Aug 15 18:31:04 2002
From: (Guido van Rossum)
Date: Thu, 15 Aug 2002 13:31:04 -0400
Subject: [Python-Dev] Re: SET_LINENO killer
In-Reply-To: Your message of "Thu, 15 Aug 2002 17:22:59 BST."
References: <>
Message-ID: <>

> > But here's some sad news.  I only see a speed increase of 0.5%!  I
> > believe that when we first looked at this patch the speedup was about
> > 5%...  Worse, Tim claims that on his Windows box it's actually 5%
> > slower.  What happened?
> Beats me.  I still see a healthy speed up:
> Before:
> $ ./python ../Lib/test/ 
> Pystone(1.1) time for 50000 passes = 3.99
> This machine benchmarks at 12531.3 pystones/second
> After:
> $ ./python ../Lib/test/ 
> Pystone(1.1) time for 50000 passes = 3.65
> This machine benchmarks at 13698.6 pystones/second
> (which is nosing on for 10% faster, actually).
> You're not testing a debug vs a release build or anything like that
> are you?

Absolutely not.  I did a "cvs update -D yesterday" and built a
Python binary.  That was only about 1.5% slower than today's binary
built in the same directory minutes earlier.

On the other hand, on a different machine, I had a checkout that was
approximately 2 days old, and there the latest checkout was about 6%
faster (all without -O).  Perhaps something else has happened in the
last two days that actually is responsible for the speedup?

I'm also happy to report that with current cvs, -O makes almost no
difference: it's only 0.18% faster.  Current cvs witout -O is about
the same speed as two days ago with -O.

So it's still a mystery.  What happened yesterday that could have
caused a speedup?

--Guido van Rossum (home page:

From  Thu Aug 15 18:43:20 2002
From: (Christian Tismer)
Date: Thu, 15 Aug 2002 19:43:20 +0200
Subject: [Python-Dev] Re: SET_LINENO killer
References: <>              <> <>
Message-ID: <>

Guido van Rossum wrote:

> So it's still a mystery.  What happened yesterday that could have
> caused a speedup?

Without looking ijto it, but I have spent weeks in
speeding up the main loop, and I can tell you that
it shows some fractal behavior, at least under
Windows. It appears to make a difference, when small
code moves happen, or some variable vanishes and the
compiler decides to re-arrange registers or change
code ordering, and especially folding. I experienced
that some changes of mine caused the compiler to
re-use a common code sequence, which caused one
more jump and made it slower.

Just to tell you, it isn't always under your control,
and something that *should* run faster is actually
slower, especially when you're fiddling with
fractions of percentiles.

SET_LINENO was so cheap, that after some tests,
I decided to keep it in, also since I found it
useful for line-wise interrupts.

Btw., with the new patch, how is tracing done now?
(sorry, I could read the sources but I'm under pressure)

cheers - chris

Christian Tismer             :^)   <>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship*
14109 Berlin                 :     PGP key ->
work +49 30 89 09 53 34  home +49 30 802 86 56  pager +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?

From  Thu Aug 15 18:37:39 2002
From: (Samuele Pedroni)
Date: Thu, 15 Aug 2002 19:37:39 +0200
Subject: [Python-Dev] Alternative implementation of interning
Message-ID: <001201c24482$7458a940$6d94fea9@newmexico>

In Jython as long as we want to support Java 1.1
(and AFAIK Finn still will) we cannot make interned
string always mortal.
So it is OK if CPython goes this route, but the Python
manual should say that it is unspecified whether
intern results are mortal or immortal or nothing on the subject
(now it explicitly says immortal).


From  Thu Aug 15 18:58:16 2002
From: (=?ISO-8859-15?Q?Walter_D=F6rwald?=)
Date: Thu, 15 Aug 2002 19:58:16 +0200
Subject: [Python-Dev] mimetypes patch #554192
Message-ID: <>

Patch adds a function to that returns all known extensions for a mimetype,

 >>> import mimetypes
 >>> mimetypes.guess_all_extensions("image/jpeg")
['.jpg', '.jpe', '.jpeg']

Martin v. Loewis and I were discussing whether it would make
sense to make the helper method add_type (which is used for
adding a mapping between one type and one extension) visible
on the module level.

Any comments?

    Walter Dörwald

From  Thu Aug 15 19:04:55 2002
From: (Guido van Rossum)
Date: Thu, 15 Aug 2002 14:04:55 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: Your message of "Thu, 15 Aug 2002 19:37:39 +0200."
References: <001201c24482$7458a940$6d94fea9@newmexico>
Message-ID: <>

> In Jython as long as we want to support Java 1.1
> (and AFAIK Finn still will) we cannot make interned
> string always mortal.
> So it is OK if CPython goes this route, but the Python
> manual should say that it is unspecified whether
> intern results are mortal or immortal or nothing on the subject
> (now it explicitly says immortal).

That's okay.  Immortality of interned strings is mostly an issue for
very long running server processes that take connections from
arbitrary clients; the issue is that arbitrary client data
accidentally gets immortalized because it is tried as an attribute
name or mapping key.  While Jython *could* be used in JSP server
setups, I expect that most long-running Python servers are using
CPython and a framework like Zope, Twisted or Quixote.

--Guido van Rossum (home page:

From  Thu Aug 15 19:04:11 2002
From: (Aahz)
Date: Thu, 15 Aug 2002 14:04:11 -0400
Subject: [Python-Dev] Mutable exceptions? (was Re: PEP 293, Codec Error Handling Callbacks)
Message-ID: <>

On Mon, Aug 12, 2002, Martin v. Loewis wrote:
> "M.-A. Lemburg" <> writes:
>> What ? That exceptions are immutable ? I think it's a big win that
>> exceptions are in fact mutable -- they are great for transporting
>> extra information up the chain...
> I see. So this is an open issue.

This looks like an issue that potentially deserves more community
feedback, so I'm ripping it out and starting a new thread: should
exception objects be treated as mutable as exceptions get caught and

(I'm not suggesting any code changes, just trying to get a feel for what
"standard practice" ought to be, partly for the book I'm writing.)
Aahz (           <*>

Project Vote Smart:

From  Thu Aug 15 19:05:57 2002
From: (Aahz)
Date: Thu, 15 Aug 2002 14:05:57 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: <001201c24482$7458a940$6d94fea9@newmexico>
References: <001201c24482$7458a940$6d94fea9@newmexico>
Message-ID: <>

On Thu, Aug 15, 2002, Samuele Pedroni wrote:
> In Jython as long as we want to support Java 1.1 (and AFAIK Finn still
> will) we cannot make interned string always mortal.  So it is OK if
> CPython goes this route, but the Python manual should say that it is
> unspecified whether intern results are mortal or immortal or nothing
> on the subject (now it explicitly says immortal).

Isn't this irrelevant, anyway, because Jython doesn't implement
CPython's refcount mechanisms?
Aahz (           <*>

Project Vote Smart:

From  Thu Aug 15 19:15:51 2002
From: (Guido van Rossum)
Date: Thu, 15 Aug 2002 14:15:51 -0400
Subject: [Python-Dev] Mutable exceptions? (was Re: PEP 293, Codec Error Handling Callbacks)
In-Reply-To: Your message of "Thu, 15 Aug 2002 14:04:11 EDT."
References: <>
Message-ID: <>

> This looks like an issue that potentially deserves more community
> feedback, so I'm ripping it out and starting a new thread: should
> exception objects be treated as mutable as exceptions get caught and
> re-raised?

(A new thread in python-dev hardly counts as "community feedback".)

I'd say definitely.  Code like this looks reasonable to me:

  def some_function(arg):
    except SomeExpectedExceptionClass, obj:

Then some outer piece of code catches exceptions and produces a
traceback augmented by information added by various calls to

--Guido van Rossum (home page:

From  Thu Aug 15 19:19:24 2002
From: (Barry A. Warsaw)
Date: Thu, 15 Aug 2002 14:19:24 -0400
Subject: [Python-Dev] mimetypes patch #554192
References: <>
Message-ID: <>

>>>>> "WD" =3D=3D Walter D=F6rwald <> writes:

    WD> Martin v. Loewis and I were discussing whether it would make
    WD> sense to make the helper method add_type (which is used for
    WD> adding a mapping between one type and one extension) visible
    WD> on the module level.

    WD> Any comments?

+1 on add_types() being public, but it should probably have a strict
flag to decide whether to add the new entry to the standard types dict
or the common types dict.


From  Thu Aug 15 19:21:38 2002
From: (Barry A. Warsaw)
Date: Thu, 15 Aug 2002 14:21:38 -0400
Subject: [Python-Dev] Mutable exceptions? (was Re: PEP 293, Codec Error Handling Callbacks)
References: <>
Message-ID: <>

>>>>> "A" == Aahz  <> writes:

    >> "M.-A. Lemburg" <> writes:
    >> What ? That exceptions are immutable ? I think it's a big win
    >> that exceptions are in fact mutable -- they are great for
    >> transporting extra information up the chain...
    >> I see. So this is an open issue.

    A> This looks like an issue that potentially deserves more
    A> community feedback, so I'm ripping it out and starting a new
    A> thread: should exception objects be treated as mutable as
    A> exceptions get caught and re-raised?

    A> (I'm not suggesting any code changes, just trying to get a feel
    A> for what "standard practice" ought to be, partly for the book
    A> I'm writing.)

MAL's right, it /is/ occasionally useful to do this.  A call higher up
the chain may have more information about the failing condition, and
it can be useful to augment the exception object with this extra
information.  That's one of the reasons why exception classes are so
much nicer than exception strings!


From  Thu Aug 15 20:02:31 2002
From: (Neil Schemenauer)
Date: Thu, 15 Aug 2002 12:02:31 -0700
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: <>; from on Thu, Aug 15, 2002 at 02:05:57PM -0400
References: <001201c24482$7458a940$6d94fea9@newmexico> <>
Message-ID: <>

Aahz wrote:
> Isn't this irrelevant, anyway, because Jython doesn't implement
> CPython's refcount mechanisms?

I don't think mark & sweep or copying GC saves you.  If Jython keeps a
reference to interned strings then the GC cannot free that memory.


From  Thu Aug 15 19:51:54 2002
From: (Samuele Pedroni)
Date: Thu, 15 Aug 2002 20:51:54 +0200
Subject: [Python-Dev] Alternative implementation of interning
References: <001201c24482$7458a940$6d94fea9@newmexico>  <>
Message-ID: <005a01c2448c$d37a19e0$6d94fea9@newmexico>

From: Guido van Rossum <>
> > In Jython as long as we want to support Java 1.1
> > (and AFAIK Finn still will) we cannot make interned
> > string always mortal.
> > So it is OK if CPython goes this route, but the Python
> > manual should say that it is unspecified whether
> > intern results are mortal or immortal or nothing on the subject
> > (now it explicitly says immortal).
> That's okay.  Immortality of interned strings is mostly an issue for
> very long running server processes that take connections from
> arbitrary clients; the issue is that arbitrary client data
> accidentally gets immortalized because it is tried as an attribute
> name or mapping key.  While Jython *could* be used in JSP server
> setups, I expect that most long-running Python servers are using
> CPython and a framework like Zope, Twisted or Quixote.

Ok, thinking a bit more it's a kind of trade-off
('is' speed for Python strings and 2 ref plus an int vs. a ref a boolean and an
of space required for Python strings (which is kind of VM depedent and should
be measured)), we could make
the Python interned strings mortal but anyway:

- we use Java interned strings (immortal anyway) for class,module, and instance
dictionaries anyway.
- for the rest Python interned strings are just the result of intern()
  with the property that the wrapped Java string is also a Java interned one
(so immortal).

so the point for us is a bit muddy.


From  Thu Aug 15 22:02:04 2002
From: (David Abrahams)
Date: Thu, 15 Aug 2002 17:02:04 -0400
Subject: [Python-Dev] type categories
References: <> <> <0c0501c24311$8cebbdc0$> <> <0ccc01c24341$d839b130$> <> <0ce601c24346$ff967f60$> <>
Message-ID: <13b001c2449f$08423240$>

From: "Neil Schemenauer" <>

> David Abrahams wrote:
> > There's not all that much to what I'm doing. I have a really
> > dispatching scheme which checks each overload in sequence, and takes
> > first one which can get a match for all arguments.
> Can you explain in more detail how the matching is done?  Wouldn't
> having some kind of type declarations be a precondition to implementing
> multiple dispatch.

Since in Boost.Python we are ultimately wrapping C++ function and member
function pointers, the type declarations are available to us. For each C++
type, any number of from_python converters may be registered with the
system. Each converter can have its own matching criterion. For example,
there is a pre-registered converter for each of the built-in C++ integral
types which checks the source object's tp_int field to decide
convertibility. When you wrap a C++ class, a from_python converter is
registered whose convertibility criterion checks to see if the source
object is one of my extension classes, then asks if it contains a C++
object of the appropriate type. Since we have C++ types corresponding to
some of the built-in Python types (e.g. list, dict, str), the
convertibility criterion for those just checks to see whether the Python
object has the appropriate type. However, we're not limited to matching
precise types: we could easily make a C++ type called "sequence" whose
converter would match any Python sequence (if we could decide exactly what
constitutes a Python sequence <.02 wink>).


P.S. If you want even /more/ gory details, just ask: I have plenty ;-)

           David Abrahams * Boost Consulting *

From David Abrahams" <  Fri Aug 16 00:19:23 2002
From: David Abrahams" < (David Abrahams)
Date: Thu, 15 Aug 2002 19:19:23 -0400
Subject: [Python-Dev] type categories
References: <> <>
Message-ID: <14c201c244b2$8cc651f0$>

From: "Andrew Koenig" <>

> Greg> so you're not gaining much from the new syntax.
> I'm not so sure.  The code is alreasy somewhat simpler here, and it
> would be substantially simpler in examples such as
>         def arctan(x):
>             ...
>         else arctan(y, x):
>             ...
> >> It might also be possible to modify a multimethod dynamically,
> >> e.g. later one could write:
> >>
> >> def f4(a: Cat4):
> >> ...code for Cat4...
> >>
> >> f.add(f4)
> Greg> This sort of scheme makes me uneasy, because it means that any
> Greg> can change the behaviour of any call of f() in any other
> Greg> module.
> It makes me uneasy because the behavior of programs might depend on the
> order in which modules are loaded.  That's why I didn't suggest a way
> of defining the variations on f in separate places.

This concern seems most un-pythonic to my eye, since there are already all
kinds of ways any module can change the behavior of any call in another
module. The moset direct way is by rebinding the implementation of another
module's function. Python is a dynamic language, and that is usually seen
as a strength.

More importantly, though, forcing all the definitions to be in one place
prevents an important (you might even say the most important) use case: the
author of a new type should be able to provide a a multimethod
implementation corresponding to that type. For example, if I write a
rational number class, I should be able to plug in a corresponding arctan

I'm extra-surprised to see that Andy's uneasy about this, since a C++
feature which (colloquially) bears his name was purpose-built to make this
sort of thing possible. Koenig lookup raises a similar issue: that the
behavior of a function call can be changed depending on which headers are
#included, and even the order they're #included in.

[I personally have many other concerns about how that feature worked out in
C++ - paper available on request - but the Python implementation being
suggested here suffers none of those problems because of its explicit


From  Fri Aug 16 01:18:41 2002
From: (Greg Ewing)
Date: Fri, 16 Aug 2002 12:18:41 +1200 (NZST)
Subject: [Python-Dev] type categories
In-Reply-To: <>
Message-ID: <>

> I mean, you can currently do
> import mod
> mod.func = my_func # evil cackle!

That's my point -- f.add(method) is just like doing that.

As with import *, no doubt the problems can be managed with
appropriate discipline. But it's not clear to me what sort of
discipline is needed.

Suppose you have a module H defining class Hobbit, and a class E
defining class Elf. Now you want to be able to add hobbits and elves,
but you don't want to clutter up either H or E with stuff concerning
the other one, so you put it in a third module M.

Now suppose module X creates a Hobbit, and moule Y creates an Elf. In
the course of processing, they both end up in module Z, which adds
them together.

It's not clear how you can tell, when looking at Z, where to look to
find out what method will be called -- even if you know you're dealing
with a Hobbit and an Elf.

There's another problem, too -- who is responsible for *importing*
module M? It's not E or H, neither of which knows about the
other. It's not X, which only knows about Hobbits, or Y, which only
knows about Elves. It's not Z, which doesn't know about either of them
-- it just gets two things to add together.

So, unless you arbitrarily insert an import of M into one of these
modules, for no reason that's apparent from looking at that module --
M won't be imported at all, and the method it contains won't be added
to the generic function.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug 16 01:48:38 2002
From: (Greg Ewing)
Date: Fri, 16 Aug 2002 12:48:38 +1200 (NZST)
Subject: [Python-Dev] type categories
In-Reply-To: <>
Message-ID: <>

Oren Tirosh <>:

> Two references to the same object may have very different expectations
> about what they are pointing to.

It sounds a bit odd to talk about references having expectations.

I think all Oren is trying to say is that different pieces of code may
have different requirements of the same object, or the same piece of
code at different times, and that it's not practical to precalculate
all the requirements that might exist and put information about them
in the object itself or its class, or anywhere else for that matter.
So he wants to check the requirements procedurally whenever such
a check is needed.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug 16 01:51:51 2002
From: (Greg Ewing)
Date: Fri, 16 Aug 2002 12:51:51 +1200 (NZST)
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: <>
Message-ID: <>

>  x = intern('12345'*4)
>  nx = id(x)
>  del x
> something else...
>  y = intern('12345'*4)
>  ny = id(y)
> nx doesn't necessarily equal ny any more.  This is a backward
> incompatibility

If you wrote something like that *expecting* the strings
to be immortal, there would be no reason to bother with
the ids -- just keep references to the strings themselves.

If you *weren't* expecting them to be immortal, there would
be no reason to expect the ids to be equal anyway.

So I agree -- it's not a problem.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug 16 02:10:31 2002
From: (Greg Ewing)
Date: Fri, 16 Aug 2002 13:10:31 +1200 (NZST)
Subject: [Python-Dev] type categories
In-Reply-To: <14c201c244b2$8cc651f0$>
Message-ID: <>

David Abrahams <>:

> This concern seems most un-pythonic to my eye, since there are already
> all kinds of ways any module can change the behavior of any call in
> another module.

Yes, but most of the time you don't have to use them!
With this feature, it would be the *normal* way of using

> forcing all the definitions to be in one place prevents an important
> (you might even say the most important) use case: the author of a new
> type should be able to provide a a multimethod implementation
> corresponding to that type.

You can get that without the notion of a generic function as a
separate entity. Just have a dispatch mechanism that looks in all the
arguments of a call for a method to use, instead of just the first

That would be relatively tractable, since at least you'd know that the
method must be found in one of the argument classes somewhere.

It also doesn't suffer from the who-imports-the-module problem, since
someone must have imported it in order to get an object of that class
in the first place.

The use case that this doesn't cover is where you're not defining a
new class, just trying to add behaviour to handle a previously
unanticipated combination of existing classes.  The organisational
problems involved in that aren't unique to Python, and seem to me an
inherent feature of the problem itself. Where does functionality
belong that isn't owned by any class?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug 16 02:26:10 2002
From: (Greg Ewing)
Date: Fri, 16 Aug 2002 13:26:10 +1200 (NZST)
Subject: [Python-Dev] type categories
In-Reply-To: <14c201c244b2$8cc651f0$>
Message-ID: <>

David Abrahams <>:

> Koenig lookup raises a similar issue: that the behavior of a function
> call can be changed depending on which headers are #included, and even
> the order they're #included in.

But at least you can, in principle, figure out what will be done by a
particular call in the source, by examining the files included by that
source file.

With the proposed generic function mechanism in Python, that wouldn't
be true.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From David Abrahams" <  Fri Aug 16 02:19:44 2002
From: David Abrahams" < (David Abrahams)
Date: Thu, 15 Aug 2002 21:19:44 -0400
Subject: [Python-Dev] type categories
References: <>
Message-ID: <14db01c244c3$0cc231c0$>

From: "Greg Ewing" <>

> David Abrahams <>:
> > This concern seems most un-pythonic to my eye, since there are already
> > all kinds of ways any module can change the behavior of any call in
> > another module.
> Yes, but most of the time you don't have to use them!
> With this feature, it would be the *normal* way of using
> it.

I don't understand. You still don't have to use it. Nobody would force you
to add or encourage multimethod overloads. In fact, I think it would be
most appropriate if multimethods meant to be overloaded had to be declared

> > forcing all the definitions to be in one place prevents an important
> > (you might even say the most important) use case: the author of a new
> > type should be able to provide a a multimethod implementation
> > corresponding to that type.
> You can get that without the notion of a generic function as a
> separate entity. Just have a dispatch mechanism that looks in all the
> arguments of a call for a method to use, instead of just the first
> one.
> That would be relatively tractable, since at least you'd know that the
> method must be found in one of the argument classes somewhere.


(Sorry, I'm overreacting... but just a little)

That approach suffers from all the problems of Koenig lookup in C++.
Namely, if I provide a method foo in my class, and two different modules
are invoking "foo", whose idea of the "foo" semantics am I implementing?
That really becomes a problem for authors of generic functions (the ones
that call the multimethods) because every time they call a function it
essentially reserves the name of that function for the semantics they're it
to have. This is currently, IMO, one of the most-intractable problems in
C++ and I'd hate to see Python go down that path.** If you want to see the
gory details, ask me to send you my paper about it.

> It also doesn't suffer from the who-imports-the-module problem, since
> someone must have imported it in order to get an object of that class
> in the first place.

I don't think that's a serious problem. Multimethod definitions that apply
to a given type will typically be supplied by the same module as the type.

> The use case that this doesn't cover is where you're not defining a
> new class, just trying to add behaviour to handle a previously
> unanticipated combination of existing classes.  The organisational
> problems involved in that aren't unique to Python, and seem to me an
> inherent feature of the problem itself. Where does functionality
> belong that isn't owned by any class?

Often there's behavior associated with combinations of classes from the
same package or module. It's reasonable to supply that at module scope.
Besides the practical problems mentioned above, I think it's unnatural to
try to tie MULTImethod implementations to a single class. When you try to
generalize that arrangement to two arguments, you end up with something
like the __add__/__radd__ system, and generalizing it to three arguments is
next to impossible.

Where to supply multimethods that work for types defined in different
modules/packages is an open question, but it's a question that applies to
both the class-scope and module-scope approaches


** Python is very nice about using explicit qualification to associate
semantics with implementation (i.e. we write and not just
foo(x)), and this would be a major break with that tradition. Explicit is
better than implicit.

From David Abrahams" <  Fri Aug 16 02:26:26 2002
From: David Abrahams" < (David Abrahams)
Date: Thu, 15 Aug 2002 21:26:26 -0400
Subject: [Python-Dev] type categories
References: <>
Message-ID: <150601c244c3$f43264d0$>

From: "Greg Ewing" <>

> David Abrahams <>:
> > Koenig lookup raises a similar issue: that the behavior of a function
> > call can be changed depending on which headers are #included, and even
> > the order they're #included in.
> But at least you can, in principle, figure out what will be done by a
> particular call in the source, by examining the files included by that
> source file.
> With the proposed generic function mechanism in Python, that wouldn't
> be true.

Like anything in Python, you can figure out what will happen by 

a. examining all the source that will be executed
b. examining the state of things at runtime.

What's new?

           David Abrahams * Boost Consulting *


From  Fri Aug 16 03:11:09 2002
From: (Greg Ewing)
Date: Fri, 16 Aug 2002 14:11:09 +1200 (NZST)
Subject: [Python-Dev] type categories
In-Reply-To: <14db01c244c3$0cc231c0$>
Message-ID: <>

David Abrahams <>:

> I think it's unnatural to try to tie MULTImethod implementations to a
> single class.

I'm not sure what you mean by that. What I was talking about wouldn't
be tied to a single class. Any given method implementation would have
to reside in some class, but the dispatch mechanism would be
symmetrical with respect to all its arguments.

> When you try to generalize that arrangement to two arguments, you
> end up with something like the __add__/__radd__ system, and
> generalizing it to three arguments is next to impossible.

But it's exactly the same problem as implementing a generic function
dispatch mechanism. If you can solve one, you can solve the other.

I'm talking about replacing

  f(a, b, c)

where f is a generic function, with

  (a, b, c).f

(not necessarily that syntax, but that's more or less what it would
mean.) The dispatch mechanism -- whatever it is -- is the same,
but the generic function entity itself doesn't exist.

> Where to supply multimethods that work for types defined in
> different modules/packages is an open question, but it's a question
> that applies to both the class-scope and module-scope approaches

The class-scope approach would be saying effectively that
you're not allowed to have a method that doesn't belong in 
any class -- you have to pick a class and put it there.

That doesn't solve the problem, I know, but at least it
would be explicit about not solving it!

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From David Abrahams" <  Fri Aug 16 03:09:59 2002
From: David Abrahams" < (David Abrahams)
Date: Thu, 15 Aug 2002 22:09:59 -0400
Subject: [Python-Dev] type categories
References: <>
Message-ID: <154301c244ca$0c338220$>

From: "Greg Ewing" <>

> David Abrahams <>:
> > I think it's unnatural to try to tie MULTImethod implementations to a
> > single class.
> I'm not sure what you mean by that. What I was talking about wouldn't
> be tied to a single class. Any given method implementation would have
> to reside in some class,

"reside in" is approximately equivalent to what I meant by "tied to". I
think it's unnatural to force users to associate a function designed to be
considered symmetrically over a combination of types (and "type
categories", I hope) with a single one of those types.

That approach also prevents another important use-case:

I want to use type X with generic function F, and can write a plausible
implementation of some multimethod call used by F for X, but the author of
X didn't supply it

> > When you try to generalize that arrangement to two arguments, you
> > end up with something like the __add__/__radd__ system, and
> > generalizing it to three arguments is next to impossible.
> But it's exactly the same problem as implementing a generic function
> dispatch mechanism. If you can solve one, you can solve the other.
> I'm talking about replacing
>   f(a, b, c)
> where f is a generic function, with
>   (a, b, c).f
> (not necessarily that syntax, but that's more or less what it would
> mean.) The dispatch mechanism -- whatever it is -- is the same,
> but the generic function entity itself doesn't exist.

Are you talking about allowing the "self" argument of a multimethod to
appear in any position in the argument list? Othewise you get a
proliferation of __add__, __radd__, __r2add__, etc. methods

> > Where to supply multimethods that work for types defined in
> > different modules/packages is an open question, but it's a question
> > that applies to both the class-scope and module-scope approaches
> The class-scope approach would be saying effectively that
> you're not allowed to have a method that doesn't belong in
> any class -- you have to pick a class and put it there.
> That doesn't solve the problem, I know, but at least it
> would be explicit about not solving it!

And that's progress?

Anyway, you've managed to avoid the most important problem with this
approach (and this is a way of rephrasing my analogy to the problems with
Koenig lookup in C++): it breaks namespaces. When a module author defines a
generic function he shouldn't be forced to go to some name distribution
authority to find a unique name: some_module.some_function should be enough
to ensure uniqueness. Class authors had better be able to say precisely
which module's idea of "some_function" they're implementing. If you want
class authors to write something like:

    def some_module.some_function(T1: x, T2: y)

within the class body, I guess it's OK with me, but it seems rather
pointless to force the association with a class in that case, since the
really important association is with the module defining the generic
function's semantics. Explicit is better than implicit.

           David Abrahams * Boost Consulting *

From  Fri Aug 16 03:36:01 2002
From: (Steve Holden)
Date: Thu, 15 Aug 2002 22:36:01 -0400
Subject: [Python-Dev] CGIHTTPServer interactions with Internet Explorer
Message-ID: <07da01c244cd$ac5d3810$>

I'm currently researching some changes needed to solve a couple of bugs
(430160 and 428345) where Internet Explorer (ironically in the name of
Netscape compatibility, though as far as I can see Netscape stopped doing
this at about release 2) will send an extra CRLF over and above the
advertised Content-Length in a POST method input stream.

If the server closes the socket before removing this input, IE somehow gets
confused, and will (usually) send a second POST request, (most often)
followed by a GET request. [This had me tearing my hair out for three days
when writing PWP].

With Kevin Altis' help I have what appears to be a basic fix for
CGIHTTPServer, but there are a couple of points I'd appreciate some advice

1) Although the basic code can use select() to ensure the input stream is no
longer readable (and therefore presumably flushed), I'm not confident enough
about the modifications to assert that they'll work when assembled with
Forking or Threading mixins. If anyone knows the code well enough to offer
an opinion it would be helpful.

2) I understand that the appropriate RFC mandates that SCRIPTS must not read
more than Content-Length bytes and believe this is the relevant quote:

> > When a CGI gets a POSTed request, the "message-body" appears on standard
> > input:
> >
> >   6.2. Request Message-Bodies
> >
> >    As there may be a data entity attached to the request, there
> MUST be a
> >    system defined method for the script to read these data.
> Unless defined
> >    otherwise, this will be via the 'standard input' file descriptor.
> >
> >    If the CONTENT_LENGTH value (see section 6.1.2) is non-NULL,
> the server
> >    MUST supply at least that many bytes to scripts on the standard input
> >    stream. Scripts are not obliged to read the data. Servers MAY signal
> >    an EOF condition after CONTENT_LENGTH bytes have been read, but are
> >    not obligated to do so. Therefore, scripts MUST NOT attempt to read
> >    more than CONTENT_LENGTH bytes, even if more data are available.

Clearly this would be significant for HTTP/1.1. Technically the change would
be the *server* reading the extra bytes and not the *script*. Under HTTP/1.0
I suspect I can assume nothing will break. I'm less happy if a persistent
connection is invoked, since I'm just sucking on the socket until it comes
up empty. This could clearly interfere with a request with a "Connection:
Keep-Alive" header. Does anyone know whether IE uses this header when it's
indulging in the error behavior?

The current first-round  patch is available under

if anyone wants to test it and let me know of any problems or suggestions
for improvement.

Steve Holden                       
Python Web Programming      

From  Fri Aug 16 04:42:13 2002
From: (Greg Ewing)
Date: Fri, 16 Aug 2002 15:42:13 +1200 (NZST)
Subject: [Python-Dev] type categories
In-Reply-To: <154301c244ca$0c338220$>
Message-ID: <>

David Abrahams <>:

> I think it's unnatural to force users to associate a function
> designed to be considered symmetrically over a combination of types
> ... with a single one of those types.

I agree, but the alternative seems to be for it to reside
"out there" somewhere in an unknown location from the point
of view of code which uses it.

> Are you talking about allowing the "self" argument of a multimethod to
> appear in any position in the argument list?

Something like that. I haven't really thought it through
very much. Maybe the method name could be systematically
modified to indicate which argument position is "self",
or something like that.

> > That doesn't solve the problem, I know, but at least it
> > would be explicit about not solving it!
> And that's progress?

Maybe. I don't know. At least it would generalise and make available
to the user what's already there in an ad-hoc way to deal with numeric

> Class authors had better be able to say precisely
> which module's idea of "some_function" they're implementing. If you want
> class authors to write something like:
>     def some_module.some_function(T1: x, T2: y)

That would only be an issue if T1 and T2 were already both using the
name some_function for incompatible purposes. That can happen now
anyway with multiple inheritance -- name clashes can occur whenever
you inherit from two pre-existing classes. I don't see that it's any
more of a problem here.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug 16 07:50:51 2002
From: (Barry A. Warsaw)
Date: Fri, 16 Aug 2002 02:50:51 -0400
Subject: [Python-Dev]
Message-ID: <>

I don't know how many of you are hip to this yet, but in case you
occasionally feel overwhelmed by traffic on the various python lists,
or if you just prefer a news interface over your mail reader, or if
you occasionally want to drop in on a thread for a list you don't
normally follow, you should check out  This domain is run
by Lars Magne Ingebrigtsen, the guy who wrote the GNUS mail and news
reader for Emacs.  It's basically a non-expiring mail/news gateway for
mailing lists.

Among the tons of lists it carries, it's got all the Python lists I've
heard of and many I haven't, although it takes a little figuring out
the mapping (e.g. python-list <-> gmane.comp.python.general).  There's
a web page that can help you find the list you're interested in.

Check out for details.


From  Fri Aug 16 09:27:17 2002
From: (=?ISO-8859-15?Q?Walter_D=F6rwald?=)
Date: Fri, 16 Aug 2002 10:27:17 +0200
Subject: [Python-Dev] Mutable exceptions? (was Re: PEP 293, Codec Error
 Handling Callbacks)
References: <> <>
Message-ID: <>

Guido van Rossum wrote:

>>This looks like an issue that potentially deserves more community
>>feedback, so I'm ripping it out and starting a new thread: should
>>exception objects be treated as mutable as exceptions get caught and
> (A new thread in python-dev hardly counts as "community feedback".)
> I'd say definitely.  Code like this looks reasonable to me:
>   def some_function(arg):
>     try:
>       call_some_other_function(arg)
>     except SomeExpectedExceptionClass, obj:
>       obj.add_context(arg)
>       raise
> Then some outer piece of code catches exceptions and produces a
> traceback augmented by information added by various calls to
> add_context().

So, if add_context() changes any exception attribute that was
originally specified in the constructor and is thus part of
the args attribute, should this change be reflected in the
args attribute?

    Walter Dörwald

From  Fri Aug 16 09:36:16 2002
From: (Michael Hudson)
Date: 16 Aug 2002 09:36:16 +0100
Subject: [Python-Dev] Re: SET_LINENO killer
In-Reply-To: "Tim Peters"'s message of "Thu, 15 Aug 2002 12:54:29 -0400"
References: <>
Message-ID: <>

"Tim Peters" <> writes:

> [Michael Hudson]
> > Beats me.  I still see a healthy speed up:
> >
> > Before:
> >
> > $ ./python ../Lib/test/
> > Pystone(1.1) time for 50000 passes = 3.99
> > This machine benchmarks at 12531.3 pystones/second
> >
> > After:
> >
> > $ ./python ../Lib/test/
> > Pystone(1.1) time for 50000 passes = 3.65
> > This machine benchmarks at 13698.6 pystones/second
> >
> > (which is nosing on for 10% faster, actually).
> >
> > You're not testing a debug vs a release build or anything like that
> > are you?
> I'm not, but I was comparing -O times (in release builds).

Ah.  FWIW gcc makes my patch a small win even with -O.

> Three runs before patch:
> This machine benchmarks at 14295.7 pystones/second
> Three runs after patch:
> This machine benchmarks at 13351.6 pystones/second


> Three runs after commenting out the new
> on the eval-loop critical path:
> This machine benchmarks at 13910.4 pystones/second

This makes no sense; after you've commented out the trace stuff, the
only difference left is that the switch is smaller!

Actually, there are some other changes, like always updating
f->f_lasti, and allocating 8 more bytes on the stack.  Does commenting
out the definition of instr_lb & instr_ub make any difference?

> OTOH, MSVC 6 has been generating faster ceval.c code than gcc for a long
> time; given how touchy this is, maybe it's just time for gcc to win 587 coin
> flips in a row <wink>.

Does reading assembly give any clues?  Not that I'd really expect
anyone to read all of the main loop...

I'm baffled.  Perhaps you can put SET_LINENO back in for the Windows
build <1e-6 wink>.


  Programming languages should be designed not by piling feature on
  top of feature, but by removing the weaknesses and restrictions
  that make the additional features appear necessary.
               -- Revised(5) Report on the Algorithmic Language Scheme

From  Fri Aug 16 10:20:23 2002
From: (Michael Hudson)
Date: 16 Aug 2002 10:20:23 +0100
Subject: [Python-Dev] type categories
In-Reply-To: Greg Ewing's message of "Fri, 16 Aug 2002 12:18:41 +1200 (NZST)"
References: <>
Message-ID: <>

Greg Ewing <> writes:

> > I mean, you can currently do
> > 
> > import mod
> > 
> > mod.func = my_func # evil cackle!
> That's my point -- f.add(method) is just like doing that.
> [stuff]

You raise reasonable questions.  The thought that occurs to me is that
people using CLOS must face similar issues, and I don't hear of them
as being great hold ups.  I know CL a bit, but I've never really used
CLOS in anger -- anyone else?

Some of the issues might be less because of CL not being a Lisp1 and
modularization working differently, I guess.


  First of all, email me your AOL password as a security measure. You
  may find that won't be able to connect to the 'net for a while. This
  is normal. The next thing to do is turn your computer upside down
  and shake it to reboot it.                     -- Darren Tucker, asr

From  Fri Aug 16 12:35:41 2002
From: (=?ISO-8859-1?Q?Walter_D=F6rwald?=)
Date: Fri, 16 Aug 2002 13:35:41 +0200
Subject: [Python-Dev] mimetypes patch #554192
References: <> <>
Message-ID: <>

Barry A. Warsaw wrote:
>>>>>>"WD" == Walter Dörwald <> writes:
>     WD> Martin v. Loewis and I were discussing whether it would make
>     WD> sense to make the helper method add_type (which is used for
>     WD> adding a mapping between one type and one extension) visible
>     WD> on the module level.
>     WD> Any comments?
> +1 on add_types() being public, but it should probably have a strict
> flag to decide whether to add the new entry to the standard types dict
> or the common types dict.

OK, so we probably need a reverse mapping for common_types too, but 
shouldn't we consider common_types to be fixed?

Maybe we should add a guess_all_types too, so we can handle duplicate 
extensions, i.e.
 >>> mimetypes.guess_all_types(".cdf")
['application/x-cdf', 'application/x-netcdf']

This would of course require to change the initialization of types_map
from a dict constant to many calls to add_type.

Even better would be, if we could assign priorities to the mappings,
so that for e.g. image/jpeg the preferred extension is .jpeg.
Then guess_type() and guess_extension() would return the preferred

    Walter Dörwald

From  Fri Aug 16 14:04:15 2002
From: (Samuele Pedroni)
Date: Fri, 16 Aug 2002 15:04:15 +0200
Subject: [Python-Dev] Multimethods (quel horreur?)
Message-ID: <003101c24525$6cfcab80$6d94fea9@newmexico>

[f,g,... are functions ;  T1,T2,T3 are type tuples a.k.a multi-method
signatures, T1<T2 means the corresponding elements
are "subtypes" ]

1. if I see things correctly, you are really fiddling with a multi-method
M = { (f,T1), (g,T2), ... }from a library

only if you do e.g.

add(M,(h,T3))  with T3==T2 (*) or T1<T3<T2.

[Things one typically does not do without thinking

[Btw up to (*), missing a module or calling
a generic function before all modules are loaded,
load order does not count.]

If T3 is < or incomparable with all the signatures in M,
is not fiddling. Incomparable can mean dispatch ambiguities
but that's a different issue.

import lib

class Sub(lib.Class1):

class New:

addmethod Sub,y: Sub):

addmethod lib.meth(arg: Sub):

is no different than defining new classes and subclassing and
overriding methods. Also the kind of resulting program
logic scattering is not that different under normal usage.

2. Dispatch ambiguities: the more predictable the rule
the better, the best-fit of multimethod does not match
such a criterion, see my previous postings.

Rules for CLOS:
(NB things are eminently configurable in CLOS)

Rules for Dylan:

The class predecende list is the same notion
as Python 2.2 mro.

See my posted code for the idea of redispatching
on forced types, which seems to me reasonably Pythonic
and allows OTOH to choose a very strict approach
in face of ambiguity because there's anyway a user
controllable escape.

My opinion: left-to-right and refuse ambiguity
are depending on the generic function both
reasonable approaches.

3. IMO documentation, doc strings, and introspection
should be enough to tell generic functions apart.
The proposed notation or whatever should be at most
 just syntax sugar:

(a,b,c).f(d) === f(a,b,c,d) in general.

Generic functions should be first-class object
that can be passed around and used everywhere
functions can be. Already today f(a,b) 
can invoke a function, a callable instance
(maybe implemeting some multimethod logic given that
one can write multimethod support in pure Python),
a bound method ...

4.  AFAIK multimethods where invented in environments
with code developed and defined in memory incrementally
and libraries loaded, that means that through introspection
one could list the methods of a generic functions and jump
to the various definitions points, and warnings could
be issued for redefinitions and such (see 1.).
For more static approaches to introspection 
syntactic sugar would be probably useful: 

defgeneric addmethod vs. f.add(...)

Of course a Python impl could also
optionally emit warnings etc, this requiring
the good practice to load library code before
user code, and being maximally useful in
an incremental environment.

5. It is true that once you have multimethods you have
the choice:

class C: 
  def meth(...): ...


class C: ...

defgeneric meth

addmethod meth(obj: C): ...

6. It is true that generic functions kind of add
a new degree of freedom to the modularity problem

7. The less disruptive and convenient usage for 
multimethods is to get something like overloading:

defgeneric f

addmethod f(x: int): ...

addmethod f(s: str): ...


class A:
  defgenericmethod meth

  addmethodmethod meth(self,x: int): ...

  addmethodmethod meth(self,s: str): ...

class B(A):
  defgenericmethod meth 
  # in case of no match it is supposed to redispatch
  # to the superclasses

  addmethodmethod f(self,x: int): ...

Syntax and semantics are just hypothetical

-- * --

For an overview on CLOS with some
words about generic functions vs. message passing
and modularity issues:

The Common Lisp Object System: An Overview 
Linda G. DeMichiel and Richard P. Gabriel

[Unrelated: is the personal site of
Richard P. Gabriel, some interesting writings there on

For the description of a large generic function
based system:

Common Lisp Interface Manager 
CLIM II Specification

[I said large.]

I'm not claiming to be an expert on multi-methods,
just I have played with the notion, read about,
and thought of it before this discussion.


PS: this is my input together with what I have already
posted, if something is unclear please ask.

From  Fri Aug 16 14:11:33 2002
From: (Andrew Koenig)
Date: 16 Aug 2002 09:11:33 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <14c201c244b2$8cc651f0$>
References: <>
Message-ID: <>

ark> It makes me uneasy because the behavior of programs might depend
ark> on the order in which modules are loaded.  That's why I didn't
ark> suggest a way of defining the variations on f in separate places.

David> This concern seems most un-pythonic to my eye, since there are
David> already all kinds of ways any module can change the behavior of
David> any call in another module. The moset direct way is by
David> rebinding the implementation of another module's
David> function. Python is a dynamic language, and that is usually
David> seen as a strength.

Indeed.  What concerns me is not dynamic behavior, but order-dependent
behavior that might be occurring behind the scenes.  I would really like
to be confident that if I write

        import x, y

it has the same effect as

        import y, x

I understand that there is no guarantee of that property now, but I suspect
that most people write programs in a way that does guarantee it.  I would
hate to see the language evolve in ways that makes it substantially more
difficult to avoid such order dependencies, so I am reluctant to propose
a feature that would increase that difficulty.

David> More importantly, though, forcing all the definitions to be in
David> one place prevents an important (you might even say the most
David> important) use case: the author of a new type should be able to
David> provide a a multimethod implementation corresponding to that
David> type. For example, if I write a rational number class, I should
David> be able to plug in a corresponding arctan implementation.

Yes.  I'm not saying such a feature shouldn't exist; just that I
don't know what form it should take.

David> I'm extra-surprised to see that Andy's uneasy about this, since
David> a C++ feature which (colloquially) bears his name was
David> purpose-built to make this sort of thing possible. Koenig
David> lookup raises a similar issue: that the behavior of a function
David> call can be changed depending on which headers are #included,
David> and even the order they're #included in.

The C++ #include mechanism, based as it is on copying source text,
offers almost no hope of sensible behavior without active
collaboration from programmers.

David> [I personally have many other concerns about how that feature
David> worked out in C++ - paper available on request - but the Python
David> implementation being suggested here suffers none of those
David> problems because of its explicit nature]

Which doesn't mean it can't suffer from other problems.  In
particular, if I know that modules x and y overload the same function,
and I want to be sure that x's case is tested first, one would think I
could ensure it by writing

        import x, y

But in fact I can't, because someone else may have imported y already,
in which case the second import is a no-op.

Andrew Koenig,,

From  Fri Aug 16 14:17:40 2002
From: (Guido van Rossum)
Date: Fri, 16 Aug 2002 09:17:40 -0400
Subject: [Python-Dev] Mutable exceptions? (was Re: PEP 293, Codec Error Handling Callbacks)
In-Reply-To: Your message of "Fri, 16 Aug 2002 10:27:17 +0200."
References: <> <>
Message-ID: <>

> > I'd say definitely.  Code like this looks reasonable to me:
> > 
> >   def some_function(arg):
> >     try:
> >       call_some_other_function(arg)
> >     except SomeExpectedExceptionClass, obj:
> >       obj.add_context(arg)
> >       raise
> > 
> > Then some outer piece of code catches exceptions and produces a
> > traceback augmented by information added by various calls to
> > add_context().
> So, if add_context() changes any exception attribute that was
> originally specified in the constructor and is thus part of
> the args attribute, should this change be reflected in the
> args attribute?

Usually yes, but that's up to the class that defines add_context().

--Guido van Rossum (home page:

From David Abrahams" <  Fri Aug 16 14:00:40 2002
From: David Abrahams" < (David Abrahams)
Date: Fri, 16 Aug 2002 09:00:40 -0400
Subject: [Python-Dev] type categories
References: <><><14c201c244b2$8cc651f0$> <>
Message-ID: <166701c24525$1859a5b0$>

From: "Andrew Koenig" <>

> ark> It makes me uneasy because the behavior of programs might depend
> ark> on the order in which modules are loaded.  That's why I didn't
> ark> suggest a way of defining the variations on f in separate places.
> David> This concern seems most un-pythonic to my eye, since there are
> David> already all kinds of ways any module can change the behavior of
> David> any call in another module. The moset direct way is by
> David> rebinding the implementation of another module's
> David> function. Python is a dynamic language, and that is usually
> David> seen as a strength.
> Indeed.  What concerns me is not dynamic behavior, but order-dependent
> behavior that might be occurring behind the scenes.  I would really like
> to be confident that if I write
>         import x, y
> it has the same effect as
>         import y, x
> I understand that there is no guarantee of that property now, but I
> that most people write programs in a way that does guarantee it.  I would
> hate to see the language evolve in ways that makes it substantially more
> difficult to avoid such order dependencies, so I am reluctant to propose
> a feature that would increase that difficulty.

Oh, easily solved: "in the face of ambiguity, refuse the temptation to
There should be a best match rule, and if there are two best matches, it's
an error.

           David Abrahams * Boost Consulting *

From  Fri Aug 16 14:28:48 2002
From: (Michael Hudson)
Date: 16 Aug 2002 14:28:48 +0100
Subject: [Python-Dev] type categories
In-Reply-To: Andrew Koenig's message of "16 Aug 2002 09:11:33 -0400"
References: <> <> <14c201c244b2$8cc651f0$> <>
Message-ID: <>

Andrew Koenig <> writes:

> ark> It makes me uneasy because the behavior of programs might depend
> ark> on the order in which modules are loaded.  That's why I didn't
> ark> suggest a way of defining the variations on f in separate places.
> David> This concern seems most un-pythonic to my eye, since there are
> David> already all kinds of ways any module can change the behavior of
> David> any call in another module. The moset direct way is by
> David> rebinding the implementation of another module's
> David> function. Python is a dynamic language, and that is usually
> David> seen as a strength.
> Indeed.  What concerns me is not dynamic behavior, but order-dependent
> behavior that might be occurring behind the scenes.  I would really like
> to be confident that if I write
>         import x, y
> it has the same effect as
>         import y, x
> I understand that there is no guarantee of that property now, but I suspect
> that most people write programs in a way that does guarantee it.  I would
> hate to see the language evolve in ways that makes it substantially more
> difficult to avoid such order dependencies, so I am reluctant to propose
> a feature that would increase that difficulty.

I may be getting lost in subthreads here, but are we still talking
about multimethods?  If we are, then surely any sane multimethod
system's method resolution has to be independent of the order of
method definition.  There are ways of doing this.

An implementation along the lines of:

def match(args, spec):
  for a, t in zip(args, spec):
    if not isinstance(a, t):
      return False
    return True

class multi_method:
  def __init__(self):
    self.methods = []
  def add(self, f, typespec):
    self.methods.append((f, typespec))
  def __call__(self, *args):
    for meth, typespec in self.methods:
      if match(args, typespec):
        return meth(*args)

is just insane.

If I've missed the source of concern, I'm sorry...


  incidentally, asking why things are "left out of the language" is
  a good sign that the asker is fairly clueless.
                                        -- Erik Naggum, comp.lang.lisp

From  Fri Aug 16 14:30:51 2002
From: (Michael Chermside)
Date: Fri, 16 Aug 2002 09:30:51 -0400
Subject: [Python-Dev] Re:
Message-ID: <>

 > [ offers mailing list <--> news gateway]

Thanks very much! I've always wanted to be able to use a newsreader to 
follow python newsgroups (c.p.l for instance) but since my source of 
connectivity doesn't provide access to news, I've had to make do with a 
limited "newsreader" I wrote which plucked its input from

I don't really need the non-expiring feature of gmane, and its mail <--> 
news gateway isn't as important to me, but the fact that it's an open 
news server is really appreciated!

-- Michael Chermside

From  Fri Aug 16 14:37:11 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 09:37:11 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <166701c24525$1859a5b0$>
References: <><><14c201c244b2$8cc651f0$> <> <166701c24525$1859a5b0$>
Message-ID: <>

David> Oh, easily solved: "in the face of ambiguity, refuse the
David> temptation to guess".  There should be a best match rule, and
David> if there are two best matches, it's an error.

In the ML example I showed earlier:

       fun len([]) = 0
         | len(h::t) = len(t) + 1

ordering is crucial: As long as the argument is not empty, both cases
match, so the language is defined to test the clauses in sequence.
My intuition is that people will often want to define category tests
to be done in a particular order.  There is no problem with such ordering
as long as all of the tests are specified together.

Once the tests are distributed, ordering becomes a problem, because
one person's intentional order dependency is another person's
ambiguity.  Which means that how one specifies distributed tests will
probably be different from how one specifies tests all in one place.

That's yet another reason I think it may be right to consider the
two problems separately.

From  Fri Aug 16 14:45:31 2002
From: (Tim Peters)
Date: Fri, 16 Aug 2002 09:45:31 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
Message-ID: <>


has lots of good papers from the Cecil project, a pioneering
multiple-dispatch language.  Or you could save time reading and learn by
repeating their early mistakes <wink>.

From  Fri Aug 16 14:55:05 2002
From: (Martin =?ISO-8859-1?Q?Sj=F6gren?=)
Date: 16 Aug 2002 15:55:05 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <><yu99sn1g46gi.fsf@europ><14c201c244b2$8cc651f0$>
Message-ID: <1029506105.4254.3.camel@ratthing-b3cf>

fre 2002-08-16 klockan 15.37 skrev Andrew Koenig:
> David> Oh, easily solved: "in the face of ambiguity, refuse the
> David> temptation to guess".  There should be a best match rule, and
> David> if there are two best matches, it's an error.
> In the ML example I showed earlier:
>        fun len([]) =3D 0
>          | len(h::t) =3D len(t) + 1
> ordering is crucial: As long as the argument is not empty, both cases
> match, so the language is defined to test the clauses in sequence.
> My intuition is that people will often want to define category tests
> to be done in a particular order.  There is no problem with such ordering
> as long as all of the tests are specified together.

What does "not empty" mean in this context? "not []"? Does h::t match []
or does [2] match []? Why is the ordering crucial? In Haskell:

f [] =3D 0
f (x:xs) =3D 1 + f xs

is totally equivalent with:

f (x:xs) =3D 1 + f xs
f [] =3D 0

Of course, if the different patterns overlap, THEN ordering is crucial,
I just find it odd that [] and h::t would overlap...


Martin Sj=F6gren              ICQ : 41245059
  Phone: +46 (0)31 7710870       Cell: +46 (0)739 169191
  GPG key:

From  Fri Aug 16 15:18:34 2002
From: (Andrew Koenig)
Date: 16 Aug 2002 10:18:34 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <1029506105.4254.3.camel@ratthing-b3cf>
References: <>
Message-ID: <>

>> In the ML example I showed earlier:

>> fun len([]) = 0
>> | len(h::t) = len(t) + 1

>> ordering is crucial: As long as the argument is not empty, both cases
>> match, so the language is defined to test the clauses in sequence.
>> My intuition is that people will often want to define category tests
>> to be done in a particular order.  There is no problem with such ordering
>> as long as all of the tests are specified together.

Martin> What does "not empty" mean in this context? "not []"? Does h::t match []
Martin> or does [2] match []? Why is the ordering crucial? In Haskell:

Martin> f [] = 0
Martin> f (x:xs) = 1 + f xs

Martin> is totally equivalent with:

Martin> f (x:xs) = 1 + f xs
Martin> f [] = 0

I'm sorry, you're right.  In this particular example, there is no
overlap, so order doesn't matter.  However, the general point still
stands: ML patterns are order-sensitive in cases where there is

Andrew Koenig,,

From  Fri Aug 16 15:19:07 2002
From: (Barry A. Warsaw)
Date: Fri, 16 Aug 2002 10:19:07 -0400
Subject: [Python-Dev] Re:
References: <>
Message-ID: <>

>>>>> "MC" == Michael Chermside <> writes:

    MC> I don't really need the non-expiring feature of gmane, and its
    MC> mail <--> news gateway isn't as important to me, but the fact
    MC> that it's an open news server is really appreciated!

The other interesting thing is that it's faster than my ISP's
newsfeed, at least for and!


From  Fri Aug 16 15:11:45 2002
From: (Michael Hudson)
Date: 16 Aug 2002 15:11:45 +0100
Subject: [Python-Dev] Re: type categories
References: <> <> <14c201c244b2$8cc651f0$> <> <166701c24525$1859a5b0$> <>
Message-ID: <>

Andrew Koenig <> writes:

> David> Oh, easily solved: "in the face of ambiguity, refuse the
> David> temptation to guess".  There should be a best match rule, and
> David> if there are two best matches, it's an error.
> In the ML example I showed earlier:
>        fun len([]) = 0
>          | len(h::t) = len(t) + 1
> ordering is crucial: As long as the argument is not empty, both cases
> match, so the language is defined to test the clauses in sequence.
> My intuition is that people will often want to define category tests
> to be done in a particular order.  There is no problem with such ordering
> as long as all of the tests are specified together.

If multimethods make it into Python, I think (hope!) it's a safe bet
that they will look more like CLOS's multimethods than ML's pattern


  ZAPHOD:  You know what I'm thinking?
    FORD:  No.
  ZAPHOD:  Neither do I.  Frightening isn't it?
                   -- The Hitch-Hikers Guide to the Galaxy, Episode 11

From  Fri Aug 16 15:22:46 2002
From: (Samuele Pedroni)
Date: Fri, 16 Aug 2002 16:22:46 +0200
Subject: [Python-Dev] type categories
Message-ID: <005d01c24530$6511b040$6d94fea9@newmexico>

>has lots of good papers from the Cecil project, a pioneering
>multiple-dispatch language.  Or you could save time reading and learn by
>repeating their early mistakes <wink>.

it's prototype based, not class based so not everything is 
relevant, but at least the survey part
in (not the algo descr) is relevant to the discussion at hand:

Efficient Multiple and Predicate Dispatching

btw predicate dispatching is a generalizazion of multimethod dispatch
but still is not the same as ML pattern matching form of function

""Functions can actually perform pattern matching on the argument. The form:
fun f (x:t1):t2 => (case x
    of pat_1 => exp_1
    | ...
    | pat_n => exp_n)
can be written directly as:
    fun f pat_1 = exp_1
    | ...
    | f pat_n = exp_n

"""  []
for 'case' order is relevant as Andrew Koeinig said.

I don't think it makes sense to generalize this in a non-local
way. For the point of view of predicate dispatch all
the single patterns would be different "predicates"
so potentially ambiguous.

And you people are bringing me mad, you cannot even
agree on the terminology and avoid in that way the tiniest
non-problems, Argh <wink>

everybody-should-do-his-homework'ly y'rs.

From  Fri Aug 16 15:35:56 2002
From: (Andrew Koenig)
Date: 16 Aug 2002 10:35:56 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

Michael> I may be getting lost in subthreads here, but are we still
Michael> talking about multimethods?

Well, I started by talking about type categories and ways of
writing programs that tested them.  Dave Abrahams said, in effect,
that I was really just talking about multimethods.  I'm still
not convinced.

Andrew Koenig,,

From  Fri Aug 16 15:48:10 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 10:48:10 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
Message-ID: <>

I can build Python 2.2.1 just fine on my Solaris 2.8 machine using gcc
3.1.1 and binutils 2.12.1

If I install either binutils 2.13 or the just-released gcc 3.2, I can
no longer build Python -- it dumps core quite far into the build process.

I don't really have a clue as to whether it's a gcc problem, a binutils
problem, or a Python problem.  Any suggestions as to how to proceed?

From  Fri Aug 16 16:04:16 2002
From: (Barry A. Warsaw)
Date: Fri, 16 Aug 2002 11:04:16 -0400
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
References: <>
Message-ID: <>

>>>>> "AK" == Andrew Koenig <> writes:

    AK> I can build Python 2.2.1 just fine on my Solaris 2.8 machine
    AK> using gcc 3.1.1 and binutils 2.12.1

    AK> If I install either binutils 2.13 or the just-released gcc
    AK> 3.2, I can no longer build Python -- it dumps core quite far
    AK> into the build process.

Stack trace?

    AK> I don't really have a clue as to whether it's a gcc problem, a
    AK> binutils problem, or a Python problem.  Any suggestions as to
    AK> how to proceed?

I started to try to build Py2.3cvs w/ gcc 3.2 but I had problems with
the gcc build.  I foolishly attempted to install it with an
alternative suffix so it wouldn't interfere with my existing gcc, but
that seems like a broken process.  I'm trying again, but it takes a
while to build gcc.


From  Fri Aug 16 16:09:00 2002
From: (Guido van Rossum)
Date: Fri, 16 Aug 2002 11:09:00 -0400
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: Your message of "Fri, 16 Aug 2002 10:48:10 EDT."
References: <>
Message-ID: <>

> I can build Python 2.2.1 just fine on my Solaris 2.8 machine using gcc
> 3.1.1 and binutils 2.12.1
> If I install either binutils 2.13 or the just-released gcc 3.2, I can
> no longer build Python -- it dumps core quite far into the build process.
> I don't really have a clue as to whether it's a gcc problem, a binutils
> problem, or a Python problem.  Any suggestions as to how to proceed?

I haven't heard this before.  I guess gcc 3.2 is brand new?  I'm not
generally following gcc releases except from hearsay.

I suppose you *did* do a "make clean" before trying with a different
compiler?  Maybe even re-run configure.

If that doesn't help, try turning off optimization (edit the generated
Makefile to delete the "-O3" option, them make clean).  If that helps,
it must be a gcc optimizer problem.

If that doesn't help, it's still most likely to be a gcc or binutils

A SourceForge bug report might be in order regardless.

--Guido van Rossum (home page:

From  Fri Aug 16 16:22:58 2002
From: (Oren Tirosh)
Date: Fri, 16 Aug 2002 11:22:58 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Thu, Aug 15, 2002 at 12:46:25PM -0400, Tim Peters wrote:
> As the only person to have posted an example relying on this behavior, it's
> OK by me if that example breaks -- it was made up just to illustrate the
> possibility and raise a caution flag.  I don't take it seriously.

In Python it's easier to just use the string so there is no real incentive 
to use the id.  I would say that making the result of the intern() builtin
mortal is probably safe.

The problem is in C extension modules. In C there is an incentive to rely
on the immortality of interned strings because it makes the code simpler.
There was an example of this in the Mac import code. PyString_InternInPlace 
should probably create immortal interned strings for backward compatibility 
(and deprecated, of course)

Maybe PyString_Intern should be renamed to PyString_InternReference to
make it more obvious that it modifies the pointer "in place".


From  Fri Aug 16 16:36:04 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 11:36:04 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
 (message from Guido van Rossum on Fri, 16 Aug 2002 11:09:00 -0400)
References: <> <>
Message-ID: <>

Guido> I haven't heard this before.  I guess gcc 3.2 is brand new?  I'm not
Guido> generally following gcc releases except from hearsay.

gcc 3.2 was released yesterday.

Guido> I suppose you *did* do a "make clean" before trying with a different
Guido> compiler?  Maybe even re-run configure.

Whenever I build Python, I start by unpacking the source-code
distribution into an empty directory, so I am quite confident that
there are no dregs left over from earlier builds.

Guido> If that doesn't help, try turning off optimization (edit the
Guido> generated Makefile to delete the "-O3" option, them make
Guido> clean).  If that helps, it must be a gcc optimizer problem.

I'll give that a try.

Guido> If that doesn't help, it's still most likely to be a gcc or
Guido> binutils problem.

There was definitely a problem with binutils 2.13 -- when handed
the distributed by ActiveState, it dumps core.
However, that problem does not occur if I build from
the tcl source distribution.  Nor does it occur with binutils 2.12.

Guido> A SourceForge bug report might be in order regardless.

I'll file one once I have identified the failure conditions more accurately.

From  Fri Aug 16 17:38:24 2002
From: (Andrew Koenig)
Date: 16 Aug 2002 12:38:24 -0400
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <>
Message-ID: <>

>>>>>> "AK" == Andrew Koenig <> writes:

AK> I can build Python 2.2.1 just fine on my Solaris 2.8 machine
AK> using gcc 3.1.1 and binutils 2.12.1

AK> If I install either binutils 2.13 or the just-released gcc
AK> 3.2, I can no longer build Python -- it dumps core quite far
AK> into the build process.

Barry> Stack trace?

OK, here's what I've been able to find so far.

It fails at the point in the installation process where it is trying to do this:

$ CC='gcc' LDSHARED='gcc -shared' OPT='-DNDEBUG -g -O3 -Wall -Wstrict-prototypes' ./python -E ./ build  
running build
running build_ext
skipping 'struct' extension (up-to-date)
Segmentation Fault - core dumped

Here's the back trace:

#0  __register_frame_info_bases (begin=0xfed50000, ob=0xfed50000, tbase=0x0, 
    dbase=0x0) at /tmp/build1165/gcc-3.1.1/gcc/unwind-dw2-fde.c:83
#1  0xfed517ec in frame_dummy ()
   from /export/spurr1/homes1/ark/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/
#2  0xfed516d4 in _init ()
   from /export/spurr1/homes1/ark/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/
#3  0xff3bc174 in ?? ()
#4  0xff3c0a8c in ?? ()
#5  0xff3c0ba8 in ?? ()
#6  0x0007b384 in _PyImport_GetDynLoadFunc (fqname=0x2 <Address 0x2 out of bounds>, 
    shortname=0x1002c8 "", 
    pathname=0xffbedbd8 "/export/spurr1/homes1/ark/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/", fp=0xfc1d8) at Python/dynload_shlib.c:90
#7  0x0006edd8 in _PyImport_LoadDynamicModule (name=0xffbee0c8 "struct", 
    pathname=0xffbedbd8 "/export/spurr1/homes1/ark/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/", fp=0xfc1d8) at Python/importdl.c:42
#8  0x0006b63c in load_module (name=0xffbee0c8 "struct", fp=0xfc1d8, 
    buf=0xffbedbd8 "/export/spurr1/homes1/ark/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/", type=3) at Python/import.c:1365
#9  0x0006c4e4 in import_submodule (mod=0xe08b8, subname=0xffbee0c8 "struct", 
    fullname=0xffbee0c8 "struct") at Python/import.c:1895
#10 0x0006c008 in load_next (mod=0xe08b8, altmod=0xe08b8, p_name=0xffbee0c8, 
    buf=0xffbee0c8 "struct", p_buflen=0xffbee0c4) at Python/import.c:1751
#11 0x0006de54 in import_module_ex (name=0x0, globals=0xe08b8, locals=0x0, 
    fromlist=0x0) at Python/import.c:1602
#12 0x0006d024 in PyImport_ImportModuleEx (name=0x268cac "struct", globals=0x0, 
    locals=0x0, fromlist=0x0) at Python/import.c:1643
#13 0x000bca6c in builtin___import__ (self=0x0, args=0x268cac)
    at Python/bltinmodule.c:40
#14 0x000ba3a0 in PyCFunction_Call (func=0x101ca8, arg=0x29d638, kw=0x0)
    at Objects/methodobject.c:69
#15 0x00047ed0 in eval_frame (f=0x2c0198) at Python/ceval.c:2004
#16 0x00048b38 in PyEval_EvalCodeEx (co=0x2603a8, globals=0x2c0198, locals=0x0, 
    args=0x2a74a8, argcount=2, kws=0x2a74b0, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:2585
#17 0x00049f28 in fast_function (func=0x0, pp_stack=0x2a74a8, n=2782384, na=2, nk=0)
    at Python/ceval.c:3161
#18 0x00047de8 in eval_frame (f=0x2a7350) at Python/ceval.c:2024
#19 0x00048b38 in PyEval_EvalCodeEx (co=0x283138, globals=0x2a7350, locals=0x0, 
    args=0x28d124, argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:2585
#20 0x000aaaa4 in function_call (func=0x27bd78, arg=0x28d118, kw=0x0)
    at Objects/funcobject.c:374
#21 0x00095d08 in PyObject_Call (func=0x27bd78, arg=0x28d118, kw=0x0)
    at Objects/abstract.c:1684
#22 0x0009e64c in instancemethod_call (func=0x27bd78, arg=0x28d118, kw=0x0)
    at Objects/classobject.c:2276
#23 0x00095d08 in PyObject_Call (func=0x27bd78, arg=0x28d118, kw=0x0)
    at Objects/abstract.c:1684
#24 0x00049fec in do_call (func=0x10ffb8, pp_stack=0xffbeebe0, na=1, nk=2674968)
    at Python/ceval.c:3262
#25 0x00047d30 in eval_frame (f=0x2b3f48) at Python/ceval.c:2027
#26 0x00048b38 in PyEval_EvalCodeEx (co=0x261510, globals=0x2b3f48, locals=0x0, 
    args=0x1c5684, argcount=1, kws=0x1c5688, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:2585
#27 0x00049f28 in fast_function (func=0x0, pp_stack=0x1c5684, n=1857160, na=1, nk=0)
    at Python/ceval.c:3161
#28 0x00047de8 in eval_frame (f=0x1c5520) at Python/ceval.c:2024
#29 0x00048b38 in PyEval_EvalCodeEx (co=0x2821d0, globals=0x1c5520, locals=0x0, 
    args=0x175fa8, argcount=1, kws=0x175fac, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:2585
#30 0x00049f28 in fast_function (func=0x0, pp_stack=0x175fa8, n=1531820, na=1, nk=0)
    at Python/ceval.c:3161
#31 0x00047de8 in eval_frame (f=0x175e50) at Python/ceval.c:2024
#32 0x00048b38 in PyEval_EvalCodeEx (co=0x2495c0, globals=0x175e50, locals=0x0, 
    args=0x1762fc, argcount=2, kws=0x176304, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:2585
#33 0x00049f28 in fast_function (func=0x0, pp_stack=0x1762fc, n=1532676, na=2, nk=0)
    at Python/ceval.c:3161
#34 0x00047de8 in eval_frame (f=0x1761a8) at Python/ceval.c:2024
#35 0x00048b38 in PyEval_EvalCodeEx (co=0x20c908, globals=0x1761a8, locals=0x0, 
    args=0x1734b8, argcount=2, kws=0x1734c0, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:2585
#36 0x00049f28 in fast_function (func=0x0, pp_stack=0x1734b8, n=1520832, na=2, nk=0)
    at Python/ceval.c:3161
#37 0x00047de8 in eval_frame (f=0x173360) at Python/ceval.c:2024
#38 0x00048b38 in PyEval_EvalCodeEx (co=0x293e50, globals=0x173360, locals=0x0, 
    args=0x298c00, argcount=1, kws=0x298c04, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:2585
#39 0x00049f28 in fast_function (func=0x0, pp_stack=0x298c00, n=2722820, na=1, nk=0)
    at Python/ceval.c:3161
#40 0x00047de8 in eval_frame (f=0x298aa8) at Python/ceval.c:2024
#41 0x00048b38 in PyEval_EvalCodeEx (co=0x2495c0, globals=0x298aa8, locals=0x0, 
    args=0x15e008, argcount=2, kws=0x15e010, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:2585
#42 0x00049f28 in fast_function (func=0x0, pp_stack=0x15e008, n=1433616, na=2, nk=0)
    at Python/ceval.c:3161
#43 0x00047de8 in eval_frame (f=0x15deb0) at Python/ceval.c:2024
#44 0x00048b38 in PyEval_EvalCodeEx (co=0x233e28, globals=0x15deb0, locals=0x0, 
    args=0x1325f4, argcount=1, kws=0x1325f8, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:2585
#45 0x00049f28 in fast_function (func=0x0, pp_stack=0x1325f4, n=1254904, na=1, nk=0)
    at Python/ceval.c:3161
#46 0x00047de8 in eval_frame (f=0x132488) at Python/ceval.c:2024
#47 0x00048b38 in PyEval_EvalCodeEx (co=0x25d8a8, globals=0x132488, locals=0x0, 
    args=0x28, argcount=0, kws=0x273fd4, kwcount=5, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:2585
#48 0x00049f28 in fast_function (func=0x0, pp_stack=0x28, n=2572284, na=0, nk=5)
    at Python/ceval.c:3161
#49 0x00047de8 in eval_frame (f=0x273e80) at Python/ceval.c:2024
#50 0x00048b38 in PyEval_EvalCodeEx (co=0x2618a8, globals=0x273e80, locals=0x0, 
    args=0x1130b0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, 
    closure=0x0) at Python/ceval.c:2585
#51 0x00049f28 in fast_function (func=0x0, pp_stack=0x1130b0, n=1126576, na=0, nk=0)
    at Python/ceval.c:3161
#52 0x00047de8 in eval_frame (f=0x112f60) at Python/ceval.c:2024
#53 0x00048b38 in PyEval_EvalCodeEx (co=0x267978, globals=0x112f60, locals=0x0, 
    args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:2585
#54 0x0004b79c in PyEval_EvalCode (co=0x267978, globals=0x1114b8, locals=0x0)
    at Python/ceval.c:483
#55 0x00076fe8 in run_node (n=0xffe20, filename=0x1114b8 "", globals=0x1114b8, 
    locals=0x1114b8, flags=0x1114b8) at Python/pythonrun.c:1079
#56 0x000766e8 in PyRun_SimpleFileExFlags (fp=0xfc1d8, 
    filename=0xffbefe48 "./", closeit=1, flags=0xffbefcec)
    at Python/pythonrun.c:685
#57 0x0001c544 in Py_Main (argc=0, argv=0x1) at Modules/main.c:364

Andrew Koenig,,

From  Fri Aug 16 17:44:22 2002
From: (Guido van Rossum)
Date: Fri, 16 Aug 2002 12:44:22 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: Your message of "Fri, 16 Aug 2002 11:22:58 EDT."
References: <> <>
Message-ID: <>

> On Thu, Aug 15, 2002 at 12:46:25PM -0400, Tim Peters wrote:
> > As the only person to have posted an example relying on this behavior, it's
> > OK by me if that example breaks -- it was made up just to illustrate the
> > possibility and raise a caution flag.  I don't take it seriously.

> In Python it's easier to just use the string so there is no real incentive 
> to use the id.  I would say that making the result of the intern() builtin
> mortal is probably safe.

OK, there seems consensus on this one.

> The problem is in C extension modules. In C there is an incentive to rely
> on the immortality of interned strings because it makes the code simpler.
> There was an example of this in the Mac import code. PyString_InternInPlace 
> should probably create immortal interned strings for backward compatibility 
> (and deprecated, of course)

But the vast majority of C code does *not* depend on this.  I'd rather
keep PyString_InternInPlace(), so we don't have to change all call
locations, only the very rare ones that rely on this (Martin found
another two).

Maybe we can add even detect the abusing cases by putting a test in
PyString_InternInPlace() like this:

if (s->ob_refcnt == 1) {
               "interning won't keep your string alive");
    PyErr_Clear(); /* In case the warning was an error, ignore it */
    Py_INCREF(s); /* Make s immortal */

> Maybe PyString_Intern should be renamed to PyString_InternReference to
> make it more obvious that it modifies the pointer "in place".

The perfect name for that API already exists: PyString_InternInPlace(). :-)

--Guido van Rossum (home page:

From  Fri Aug 16 17:53:26 2002
From: (Andrew Koenig)
Date: 16 Aug 2002 12:53:26 -0400
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <>
Message-ID: <>

Guido> If that doesn't help, try turning off optimization (edit the generated
Guido> Makefile to delete the "-O3" option, them make clean).  If that helps,
Guido> it must be a gcc optimizer problem.

Guido> If that doesn't help, it's still most likely to be a gcc or binutils
Guido> problem.

It doesn't help.

Andrew Koenig,,

From  Fri Aug 16 18:28:59 2002
From: (Zack Weinberg)
Date: Fri, 16 Aug 2002 10:28:59 -0700
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

On Fri, Aug 16, 2002 at 12:38:24PM -0400, Andrew Koenig wrote:
> #0  __register_frame_info_bases (begin=0xfed50000, ob=0xfed50000, tbase=0x0, 
>     dbase=0x0) at /tmp/build1165/gcc-3.1.1/gcc/unwind-dw2-fde.c:83

Er, is the directory name misleading, or have you picked up from 3.1.1?  In theory that shouldn't be a problem; in
practice it could well be the problem.

> #1  0xfed517ec in frame_dummy ()
>    from /export/spurr1/homes1/ark/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/
> #2  0xfed516d4 in _init ()
>    from /export/spurr1/homes1/ark/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/
> #3  0xff3bc174 in ?? ()
> #4  0xff3c0a8c in ?? ()
> #5  0xff3c0ba8 in ?? ()
> #6  0x0007b384 in _PyImport_GetDynLoadFunc (fqname=0x2 <Address 0x2 out of bounds>, 
>     shortname=0x1002c8 "", 
>     pathname=0xffbedbd8 "/export/spurr1/homes1/ark/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/", fp=0xfc1d8) at Python/dynload_shlib.c:90

Can you disable all use of dynamic loading and try the build again?
Unfortunately, the only practical way to do this seems to be to edit and force DYNLOADFILE to be dynload_stub.o (right before
the line saying AC_MSG_RESULT($DYNLOADFILE)), then regenerate
configure.  (Might be a good idea to add an --enable switch.)

This will obviously not get you an installable build, but it will let
us narrow down the problem a bit.


From  Fri Aug 16 18:34:05 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 13:34:05 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <> (message from Zack
 Weinberg on Fri, 16 Aug 2002 10:28:59 -0700)
References: <> <> <> <>
Message-ID: <>

Zack> On Fri, Aug 16, 2002 at 12:38:24PM -0400, Andrew Koenig wrote:

>> #0  __register_frame_info_bases (begin=0xfed50000, ob=0xfed50000, tbase=0x0, 
>> dbase=0x0) at /tmp/build1165/gcc-3.1.1/gcc/unwind-dw2-fde.c:83

Zack> Er, is the directory name misleading, or have you picked up
Zack> from 3.1.1?  In theory that shouldn't be a problem; in
Zack> practice it could well be the problem.

This particular test was done with gcc 3.1.1 and binutils 2.13.

As I said at the beginning of the discussion, if I use gcc 3.1.1
and binutils 2.12.1, the Python install works.  If I use gcc 3.2
*or* binutils 2.13, the Python install fails.

From  Fri Aug 16 18:36:26 2002
From: (Barry A. Warsaw)
Date: Fri, 16 Aug 2002 13:36:26 -0400
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
References: <>
Message-ID: <>

One quick problem that I ran into when trying to build with gcc 3.2,
installed in /usr/local/bin on a RH 7.3 system: I had to use
--with-cxx=/usr/local/bin/c++ otherwise I got this error:

-------------------- snip snip --------------------
% ./configure --with-pydebug
checking MACHDEP... linux2
checking for --without-gcc... no
checking for --with-cxx=<compiler>... no
checking for c++... c++
checking for C++ compiler default output... a.out
checking whether the C++ compiler works... configure: error: cannot run C++ compiled programs.
If you meant to cross compile, use `--host'.
-------------------- snip snip --------------------

Even though:

% c++ --version
c++ (GCC) 3.2
Copyright (C) 2002 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO

No time to dig into this right now, but I thought I'd report it.

From  Fri Aug 16 18:36:40 2002
From: (Guido van Rossum)
Date: Fri, 16 Aug 2002 13:36:40 -0400
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: Your message of "Fri, 16 Aug 2002 13:34:05 EDT."
References: <> <> <> <>
Message-ID: <>

> As I said at the beginning of the discussion, if I use gcc 3.1.1
> and binutils 2.12.1, the Python install works.  If I use gcc 3.2
> *or* binutils 2.13, the Python install fails.

I'm guessing that gcc 3.2 somehow also installs binutils 2.13 and that
the bug is in the latter.

--Guido van Rossum (home page:

PS: Mail to works again.  A comcast outage caused the
forwarding service to bounce, probably from 11:30 till 1:30 EDT today.
Bad Exim!  Thanks to Barry for the quick fix.

From  Fri Aug 16 18:42:24 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 13:42:24 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <> (message from Zack
 Weinberg on Fri, 16 Aug 2002 10:28:59 -0700)
References: <> <> <> <>
Message-ID: <>

Zack> Can you disable all use of dynamic loading and try the build again?

I can, but I'm not sure it will help.

I had a very similar problem earlier, which I definitely traced to
a bug in gnu ld:  If you say


where is as distributed by ActiveTcl, it dumps core.
What I found was that was ultimately invoking the gnu linker
which, in turn, was casing the problem by crashing.

So it may be that the problem is still in the linker; I just don't have
a clue as to where.   Anyway, I'll try some experiments and see if
I can find something interesting.

From  Fri Aug 16 18:52:19 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 13:52:19 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
 (message from Guido van Rossum on Fri, 16 Aug 2002 13:36:40 -0400)
References: <> <> <> <>
 <> <>
Message-ID: <>

>> As I said at the beginning of the discussion, if I use gcc 3.1.1
>> and binutils 2.12.1, the Python install works.  If I use gcc 3.2
>> *or* binutils 2.13, the Python install fails.

Guido> I'm guessing that gcc 3.2 somehow also installs binutils 2.13 and that
Guido> the bug is in the latter.

I see no evidence that gcc 3.2 installs binutils 2.13.
In particular, if I install gcc 3.2, *then* install
binutils 2.12.1, it still fails.

From  Fri Aug 16 19:15:36 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 14:15:36 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <> (message from Zack
 Weinberg on Fri, 16 Aug 2002 10:28:59 -0700)
References: <> <> <> <>
Message-ID: <>

Zack> This will obviously not get you an installable build, but it
Zack> will let us narrow down the problem a bit.

I've narrowed it down somewhat.

Apparently what happens is that successfully installs the
"struct" extension (or so it thinks), then crashes when it is trying
to import it.  After it has done so, I can duplicate the crash by
executing ./python and then typing

	  import struct

so I don't have to run at all to cause the crash at that

I'm trying to rebuild without dynamic loading now, to see what happens.

From  Fri Aug 16 19:43:05 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 14:43:05 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <> (message from Zack
 Weinberg on Fri, 16 Aug 2002 10:28:59 -0700)
References: <> <> <> <>
Message-ID: <>

Zack> Can you disable all use of dynamic loading and try the build
Zack> again?  Unfortunately, the only practical way to do this seems
Zack> to be to edit and force DYNLOADFILE to be
Zack> dynload_stub.o (right before the line saying
Zack> AC_MSG_RESULT($DYNLOADFILE)), then regenerate configure.  (Might
Zack> be a good idea to add an --enable switch.)

As I sort of expected, this makes the crash go away.  However, it is now
replaced by lots of messages like

building 'grp' extension
gcc -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -fPIC -I. -I/export/spurr1/homes1/ark/Python-2.2.1/./Include -I/usr/local/include -IInclude/ -c /export/spurr1/homes1/ark/Python-2.2.1/Modules/grpmodule.c -o build/temp.solaris-2.7-sun4u-2.2/grpmodule.o
gcc -shared build/temp.solaris-2.7-sun4u-2.2/grpmodule.o -L/usr/local/lib -o build/lib.solaris-2.7-sun4u-2.2/
WARNING: removing "grp" since importing it failed

If I put back the dynamic loading stuff and rebuild everything from scratch,
I again get a python that crashes when I try to import struct.
It occurs to me that the traceback from that might be useful.
Needless to say, it is much shorter than the earlier one.

I must say that the "Address 0x2 out of bounds" note makes me suspicious.

Here's the traceback:

#0  __register_frame_info_bases (begin=0xfed40000, ob=0xfed40000, tbase=0x0, 
    dbase=0x0) at /tmp/build1165/gcc-3.1.1/gcc/unwind-dw2-fde.c:83
#1  0xfed417ec in frame_dummy ()
   from /export/spurr1/homes1/ark/test-python/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/
#2  0xfed416d4 in _init ()
   from /export/spurr1/homes1/ark/test-python/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/
#3  0xff3bc174 in ?? ()
#4  0xff3c0a8c in ?? ()
#5  0xff3c0ba8 in ?? ()
#6  0x0007b384 in _PyImport_GetDynLoadFunc (fqname=0x2 <Address 0x2 out of bounds>, 
    shortname=0x1002d0 "", 
    pathname=0xffbeedb8 "/export/spurr1/homes1/ark/test-python/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/", fp=0xfc1e0) at Python/dynload_shlib.c:90
#7  0x0006edd8 in _PyImport_LoadDynamicModule (name=0xffbef2a8 "struct", 
    pathname=0xffbeedb8 "/export/spurr1/homes1/ark/test-python/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/", fp=0xfc1e0) at Python/importdl.c:42
#8  0x0006b63c in load_module (name=0xffbef2a8 "struct", fp=0xfc1e0, 
    buf=0xffbeedb8 "/export/spurr1/homes1/ark/test-python/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/", type=3) at Python/import.c:1365
#9  0x0006c4e4 in import_submodule (mod=0xe08c0, subname=0xffbef2a8 "struct", 
    fullname=0xffbef2a8 "struct") at Python/import.c:1895
#10 0x0006c008 in load_next (mod=0xe08c0, altmod=0xe08c0, p_name=0xffbef2a8, 
    buf=0xffbef2a8 "struct", p_buflen=0xffbef2a4) at Python/import.c:1751
#11 0x0006de54 in import_module_ex (name=0x0, globals=0xe08c0, locals=0x111548, 
    fromlist=0xe08c0) at Python/import.c:1602
#12 0x0006d024 in PyImport_ImportModuleEx (name=0x182eec "struct", 
    globals=0x111548, locals=0x111548, fromlist=0xe08c0) at Python/import.c:1643
#13 0x000bca6c in builtin___import__ (self=0x0, args=0x182eec)
    at Python/bltinmodule.c:40
#14 0x000ba3a0 in PyCFunction_Call (func=0x101cb0, arg=0x10a440, kw=0x0)
    at Objects/methodobject.c:69
#15 0x00095d08 in PyObject_Call (func=0x101cb0, arg=0x10a440, kw=0x0)
    at Objects/abstract.c:1684
#16 0x00049c9c in PyEval_CallObjectWithKeywords (func=0x101cb0, arg=0x10a440, 
    kw=0x0) at Python/ceval.c:3049
#17 0x00047810 in eval_frame (f=0x186898) at Python/ceval.c:1839
#18 0x00048b38 in PyEval_EvalCodeEx (co=0x18d7b0, globals=0x186898, locals=0x0, 
    args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:2585
#19 0x0004b79c in PyEval_EvalCode (co=0x18d7b0, globals=0x111548, locals=0x0)
    at Python/ceval.c:483
#20 0x00076fe8 in run_node (n=0xffd68, filename=0x111548 "", globals=0x111548, 
    locals=0x111548, flags=0x111548) at Python/pythonrun.c:1079
#21 0x00076450 in PyRun_InteractiveOneFlags (fp=0xffffffff, 
    filename=0xc0e70 "<stdin>", flags=0xffbefcf4) at Python/pythonrun.c:590
#22 0x000761f0 in PyRun_InteractiveLoopFlags (fp=0xfc1b0, 
    filename=0xc0e70 "<stdin>", flags=0xffbefcf4) at Python/pythonrun.c:526
#23 0x00077c78 in PyRun_AnyFileExFlags (fp=0xfc1b0, filename=0xc0e70 "<stdin>", 
    closeit=0, flags=0xffbefcf4) at Python/pythonrun.c:489
#24 0x0001c544 in Py_Main (argc=1, argv=0x1) at Modules/main.c:364

From  Fri Aug 16 19:51:18 2002
From: (Guido van Rossum)
Date: Fri, 16 Aug 2002 14:51:18 -0400
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: Your message of "Fri, 16 Aug 2002 14:15:36 EDT."
References: <> <> <> <>
Message-ID: <>

> I've narrowed it down somewhat.
> Apparently what happens is that successfully installs the
> "struct" extension (or so it thinks), then crashes when it is trying
> to import it.  After it has done so, I can duplicate the crash by
> executing ./python and then typing
> 	  import struct
> so I don't have to run at all to cause the crash at that
> point.
> I'm trying to rebuild without dynamic loading now, to see what happens.

I'm sure the 'struct' module is implicated only because it happens to
be the first module that tries to build.

This points to a problem with the dynamic linker.

--Guido van Rossum (home page:

From  Fri Aug 16 19:56:46 2002
From: (Zack Weinberg)
Date: Fri, 16 Aug 2002 11:56:46 -0700
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <> <> <> <> <>
Message-ID: <>

On Fri, Aug 16, 2002 at 02:15:36PM -0400, Andrew Koenig wrote:
> Zack> This will obviously not get you an installable build, but it
> Zack> will let us narrow down the problem a bit.
> I've narrowed it down somewhat.
> Apparently what happens is that successfully installs the
> "struct" extension (or so it thinks), then crashes when it is trying
> to import it.  After it has done so, I can duplicate the crash by
> executing ./python and then typing
> 	  import struct
> so I don't have to run at all to cause the crash at that
> point.
> I'm trying to rebuild without dynamic loading now, to see what happens.

Another thing to try is gcc 3.2 with the Sun assembler and linker.  It
could be that 3.2 triggers a bug in all extant versions of GNU ld on


From  Fri Aug 16 19:53:26 2002
From: (Samuele Pedroni)
Date: Fri, 16 Aug 2002 20:53:26 +0200
Subject: [Python-Dev] type categories (redux?)
Message-ID: <007401c24556$47b5dc80$6d94fea9@newmexico>

[Andrew Koenig]
>Michael> I may be getting lost in subthreads here, but are we still
>Michael> talking about multimethods?
>Well, I started by talking about type categories and ways of
>writing programs that tested them.  Dave Abrahams said, in effect,
>that I was really just talking about multimethods.  I'm still
>not convinced.

On one hand we have

[0. destructuring pattern matching which
is a kind of local control-flow construct]
1. Multiple dispatch [and predicate dispatch]
which are about generic functions defined
as a set of tuples (function,signature),
and where, given a tuple of arguments,
one find the applicable function subset based on the signatures
and then tries to induce a total/partial order
on the subset based on the arguments and calls the inferior.
The signatures typically involve types(classes)
and the order is about subtype(subclass) relationshisps.
[again see

On the other hand:

Dave Abrahams wants multiple dispatch and
 also wants to dispatch e.g. on arg being a mapping.

Now to define  in Python what a mapping is
as first-class object/formal language construct
is a kind of Holy Grail.

So the discussion
i) type categories as (Zope) interfaces
ii) type categories in terms of hasattr
[not totally safe but used in practice]
iii) type categories in terms of predicates

It is maybe worth to underline that
[this was somehow implicit in much of the discussion]:
no amount of formalism  and what-not can make an approach extending Python
as-is safer than (ii) unless with the addition of some
kind of explicitness (explicit tagging or labeling, explicit
predicate (forced) assertion, explicit interfaces with
central register or not).

IMHO Dave can wait for a long long time <wink>
or go the pragmatical route:

*) either integrating (Zope) interfaces in the dispatch
*) or adopting some minimal form of predicate dispatching too,
noticing that once you have multi-method dispatch
you can define e.g.

defgeneric ismapping

addmethod ismapping(obj: Any): return False
addmethod ismapping(obj: dict): return True

[in CLIM e.g. the notion of procols is defined
in this way too (or through abstract superclasses)]

and you gain some flexibility because a user
 - composing a system -
can redefine this for a part of the type
hierarchy in terms of hasattr or what-not ...
or define ismapping for a pre-existent type
a posteriori.

It's a bit more flexible than the registry
 approach in Zope interfaces (if I understand
that correctly).
But you don't have a direct notion of
 subcategory (this can be a problem or not)

*) or a mixture of the two approaches
[about which I admit I should think more]

[I start to feel that Python obession
about not being explicit about protocols
has gone a bit too far ( it's just a very personal feel),
even in Smalltalk people add to Object things like

and then redefine them to return true somewhere down
the hierarchy, and use these predicates to select

It is used sparingly, but it is used.

"We" could'nt do that because
in Python there was no modifiable ur-object,
but both with a registry or multi-methods one
can enable essentially this.]

with-PEP-246'ly y'rs.

"In my experience, much of language design is like this. We think we know how
it will all come out, but we don't always. Usage patterns are often surprising,
as one learns if one is around long enough to design a language or two and then
watch how expectations play out in reality over a course of years. So it's a
gamble. But the only way not to gamble is not to move ahead."
  -- Kent M. Pitman

From  Fri Aug 16 19:58:03 2002
From: (Martin v. Loewis)
Date: 16 Aug 2002 20:58:03 +0200
Subject: [Python-Dev] mimetypes patch #554192
In-Reply-To: <>
References: <>
Message-ID: <>

Walter D=F6rwald <> writes:

> OK, so we probably need a reverse mapping for common_types too, but
> shouldn't we consider common_types to be fixed?

If anything, types_map should be fixed: Those are the official
IANA-supported types (including the official x- extension mechanism).

The common types are those that violate IANA specs, yet found in real

If you wanted to support strictness in add_type, then you would
require that the type starts with x-; since should have
all registered types incorporated (if it misses some, that's a bug).

> Even better would be, if we could assign priorities to the mappings,
> so that for e.g. image/jpeg the preferred extension is .jpeg.
> Then guess_type() and guess_extension() would return the preferred
> mimetype/extension.

Do you have a specific application for that in mind? It sounds like


From  Fri Aug 16 19:58:36 2002
From: (Guido van Rossum)
Date: Fri, 16 Aug 2002 14:58:36 -0400
Subject: [Python-Dev] Last call: mortal interned strings
Message-ID: <>


I'm beginning to be convinced that mortal interned strings are a good
idea.  I've uploaded a patch that defaults interned strings to mortal
status unless explicitly requested with
PyString_InternImmortal(). There are no calls to that function in the

I'm very tempted to check this in and see how it goes.  It's not that
hard to change our mind about the default closer to the 2.3 release

Any objections?

--Guido van Rossum (home page:

From  Fri Aug 16 20:03:41 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 15:03:41 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
 (message from Guido van Rossum on Fri, 16 Aug 2002 14:51:18 -0400)
References: <> <> <> <>
 <> <>
Message-ID: <>

Guido> I'm sure the 'struct' module is implicated only because it happens to
Guido> be the first module that tries to build.

Guido> This points to a problem with the dynamic linker.

Yes, it does.  Indeed, if I look in Python/dynload_shlib.c, near
the end, I see code like this:

        if (Py_VerboseFlag)
                printf("dlopen(\"%s\", %x);\n", pathname, dlopenflags);

        handle = dlopen(pathname, dlopenflags);

        if (handle == NULL) {
                PyErr_SetString(PyExc_ImportError, dlerror());
                return NULL;
If I insert, immediately after the call to dlopen, the following:

        if (Py_VerboseFlag)
                printf("after dlopen(\"%s\", %x);\n", pathname, dlopenflags);

and then run "./python -v" and try to import struct, it does not print
the second set of output:

$ ./python -v
<lots of output>
[GCC 3.1.1] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
import readline # builtin
>>> import struct
dlopen("/export/spurr1/homes1/ark/test-python/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/", 2);
Segmentation Fault - core dumped

This behavior strongly suggests to me that it is crashing in dlopen.

However, when I write a little C program that just calls dlopen with the
file in question:

#include <stdio.h>
void *dlopen(const char *, int);

        void *handle = dlopen("/export/spurr1/homes1/ark/test-python/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/", 2);
        printf ("Handle = %x\n", handle);

it quietly succeds and prints "Handle = 0"

At this point I am way out of my depth.

From  Fri Aug 16 20:04:29 2002
From: (Martin v. Loewis)
Date: 16 Aug 2002 21:04:29 +0200
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <>
Message-ID: <>

Andrew Koenig <> writes:

> Guido> I'm guessing that gcc 3.2 somehow also installs binutils 2.13 and that
> Guido> the bug is in the latter.
> I see no evidence that gcc 3.2 installs binutils 2.13.
> In particular, if I install gcc 3.2, *then* install
> binutils 2.12.1, it still fails.

You can't do that (if installing 2.12.1 means to downgrade from
2.13). gcc configuration analyses features of binutils at configure
time, and relies on those features to be present at run-time.

Are you sure that gcc picks up the binutils you had installed when you
configured gcc? In particular, what happens if you do

gcc --print-prog-name=as
gcc --print-prog-name=ld

Are those the once that you had in PATH when configuring?


From  Fri Aug 16 20:04:27 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 15:04:27 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <> (message from Zack
 Weinberg on Fri, 16 Aug 2002 11:56:46 -0700)
References: <> <> <> <> <> <>
Message-ID: <>

Zack> Another thing to try is gcc 3.2 with the Sun assembler and
Zack> linker.  It could be that 3.2 triggers a bug in all extant
Zack> versions of GNU ld on Solaris.

It could be.  However, I seem to remember that gcc 3.x does not work
well with the Sun assembler and linker at all on Solaris.

From  Fri Aug 16 20:09:13 2002
From: (Martin v. Loewis)
Date: 16 Aug 2002 21:09:13 +0200
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <>
Message-ID: <> (Barry A. Warsaw) writes:

> checking whether the C++ compiler works... configure: error: cannot run C++ compiled programs.
> If you meant to cross compile, use `--host'.
> -------------------- snip snip --------------------
> Even though:
> % c++ --version
> c++ (GCC) 3.2

That means that the c++ that you have installed fails to build working
binaries. This, in turn, most likely means that was not

To correct this, either
- install into /usr/local/lib, and re-run ldconfig, or
- add the path that has to /etc/, and
  re-run ldconfig.

Alternatively, configure --without-cxx.


From  Fri Aug 16 20:08:58 2002
From: (Guido van Rossum)
Date: Fri, 16 Aug 2002 15:08:58 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: Your message of "Fri, 16 Aug 2002 12:44:22 EDT."
Message-ID: <>

[I wrote]
> Maybe we can add even detect the abusing cases by putting a test in
> PyString_InternInPlace() like this:
> if (s->ob_refcnt == 1) {
>     PyErr_Warn(PyExc_DeprecationWarning,
>                "interning won't keep your string alive");
>     PyErr_Clear(); /* In case the warning was an error, ignore it */
>     Py_INCREF(s); /* Make s immortal */
> }

I tried this, and alas it doesn't work; there are many legit places
where there's only one reference.  So we'll have to use more
traditional ways of tracking down C code that makes assumptions of
immortality so it can drop its own reference.  (Apart from
getclassname() I've seen none.)

--Guido van Rossum (home page:

From  Fri Aug 16 20:09:17 2002
From: (Jeremy Hylton)
Date: Fri, 16 Aug 2002 15:09:17 -0400
Subject: [Python-Dev] pystone(object)
Message-ID: <>

Anyone interested in making pystone use a new-style class?  I just
tried it and it slowed pystone down by 12%.  Using __slots__ bought
back 5%.

On the one hand, we end up comparing the new-style class
implementation of one Python with the classic class version of older
Pythons.  On the other hand, we seems to think that new-style classes
are preferred.  I think we ought to measure them.


RCS file: /cvsroot/python/python/dist/src/Lib/test/,v
retrieving revision 1.7
diff -c -c -r1.7
***	6 Aug 2002 17:21:20 -0000	1.7
---	16 Aug 2002 18:49:56 -0000
*** 40,46 ****
  [Ident1, Ident2, Ident3, Ident4, Ident5] = range(1, 6)
! class Record:
      def __init__(self, PtrComp = None, Discr = 0, EnumComp = 0,
                         IntComp = 0, StringComp = 0):
--- 40,46 ----
  [Ident1, Ident2, Ident3, Ident4, Ident5] = range(1, 6)
! class Record(object):
      def __init__(self, PtrComp = None, Discr = 0, EnumComp = 0,
                         IntComp = 0, StringComp = 0):

From  Fri Aug 16 20:12:07 2002
From: (Martin v. Loewis)
Date: 16 Aug 2002 21:12:07 +0200
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <>
Message-ID: <>

Andrew Koenig <> writes:

> #0  __register_frame_info_bases (begin=0xfed50000, ob=0xfed50000, tbase=0x0, 
>     dbase=0x0) at /tmp/build1165/gcc-3.1.1/gcc/unwind-dw2-fde.c:83

That is the initialization for exception handling regions (which is
irrelevant for C, but linked into every shared library just in case
C++ objects are also present).

My guess is that you have been using the system linker to link this
binary (


From  Fri Aug 16 20:13:53 2002
From: (Andrew Koenig)
Date: 16 Aug 2002 15:13:53 -0400
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <>
Message-ID: <>

Martin> You can't do that (if installing 2.12.1 means to downgrade from
Martin> 2.13). gcc configuration analyses features of binutils at configure
Martin> time, and relies on those features to be present at run-time.

Martin> Are you sure that gcc picks up the binutils you had installed when you
Martin> configured gcc? In particular, what happens if you do

Martin> gcc --print-prog-name=as
Martin> gcc --print-prog-name=ld

Martin> Are those the once that you had in PATH when configuring?

Yes.  The way I install stuff on this particular machine is to build
each package (gcc, binutils, etc.) in a completely separate directory,
then make symbolic links to that directory from a common directory
in which everything is actually executed.

So gcc always thinks the linker is in a single place, and "installing
binutils 2.12.1" means removing all the symlinks to the version of
binutils that was previously in place and making new symlinks to
the binutils 2.12.1 binaries.

Andrew Koenig,,

From  Fri Aug 16 20:13:57 2002
From: (Guido van Rossum)
Date: Fri, 16 Aug 2002 15:13:57 -0400
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: Your message of "Fri, 16 Aug 2002 14:43:05 EDT."
References: <> <> <> <>
Message-ID: <>

> As I sort of expected, this makes the crash go away.  However, it is now
> replaced by lots of messages like
> building 'grp' extension
> gcc -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -fPIC -I. -I/export/spurr1/homes1/ark/Python-2.2.1/./Include -I/usr/local/include -IInclude/ -c /export/spurr1/homes1/ark/Python-2.2.1/Modules/grpmodule.c -o build/temp.solaris-2.7-sun4u-2.2/grpmodule.o
> gcc -shared build/temp.solaris-2.7-sun4u-2.2/grpmodule.o -L/usr/local/lib -o build/lib.solaris-2.7-sun4u-2.2/
> WARNING: removing "grp" since importing it failed

Yeah, you'd have to enable all the modules you're interested in by
editing Modules/Setup.  That's a pain, which is why we generally use
dynamic loading. :-)

--Guido van Rossum (home page:

From  Fri Aug 16 20:15:35 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 15:15:35 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

Martin> Andrew Koenig <> writes:
>> #0  __register_frame_info_bases (begin=0xfed50000, ob=0xfed50000, tbase=0x0, 
>> dbase=0x0) at /tmp/build1165/gcc-3.1.1/gcc/unwind-dw2-fde.c:83

Martin> That is the initialization for exception handling regions
Martin> (which is irrelevant for C, but linked into every shared
Martin> library just in case C++ objects are also present).

Martin> My guess is that you have been using the system linker to link
Martin> this binary (

Seems unlikely -- the system linker isn't in my search path
and "ls -ltu" shows that it hasn't been executed in 10 days.

From  Fri Aug 16 20:17:19 2002
From: (Martin v. Loewis)
Date: 16 Aug 2002 21:17:19 +0200
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: <>
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> Maybe we can add even detect the abusing cases by putting a test in
> PyString_InternInPlace() like this:
> if (s->ob_refcnt == 1) {
>     PyErr_Warn(PyExc_DeprecationWarning,
>                "interning won't keep your string alive");
>     PyErr_Clear(); /* In case the warning was an error, ignore it */
>     Py_INCREF(s); /* Make s immortal */
> }

I believe this will trigger very often; the first usage of
InternFromString (for a certain string) will likely trigger it.

If people do

PyObject *__foo__;

int init(){
  __foo__ = PyString_InternFromString("__foo__");

then this is perfectly safe: you get an extra reference back (on top
of ones that the intern dictionary just stops holding); you can keep
this reference as long as you want.

So I would agree with your analysis that this will cause only few
problems. Unfortunately, those will be hard to track down.


From  Fri Aug 16 20:47:39 2002
From: (Zack Weinberg)
Date: Fri, 16 Aug 2002 12:47:39 -0700
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <> <> <> <> <>
Message-ID: <>

Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Fri, Aug 16, 2002 at 02:43:05PM -0400, Andrew Koenig wrote:
> Zack> Can you disable all use of dynamic loading and try the build
> Zack> again?  Unfortunately, the only practical way to do this seems
> Zack> to be to edit and force DYNLOADFILE to be
> Zack> dynload_stub.o (right before the line saying
> Zack> AC_MSG_RESULT($DYNLOADFILE)), then regenerate configure.  (Might
> Zack> be a good idea to add an --enable switch.)
> As I sort of expected, this makes the crash go away.

Okay, so that pretty much guarantees it's a bug in the toolchain, or
in Solaris

> If I put back the dynamic loading stuff and rebuild everything from scratch,
> I again get a python that crashes when I try to import struct.
> It occurs to me that the traceback from that might be useful.
> Needless to say, it is much shorter than the earlier one.
> I must say that the "Address 0x2 out of bounds" note makes me suspicious.

That's almost certainly GDB screwing up.  In any case,
dynload_shlib.c's version of _PyImport_GetDynLoadFunc does not use
that argument, so that can't be the cause of the problem.

> #0  __register_frame_info_bases (begin=0xfed40000, ob=0xfed40000, tbase=0x0, 
>     dbase=0x0) at /tmp/build1165/gcc-3.1.1/gcc/unwind-dw2-fde.c:83

Having tbase and dbase be 0x0 is not a problem.  The begin and ob
pointers should _not_ be the same. ob should point to a fairly large
data object in the .bss section, and begin should point to the
beginning of the .eh_frame section.  This could be GDB screwing up
again, but unwind-dw2-fde.c is compiled with less aggressive
optimization than dynload_shlib.c, so it's more likely to be accurate.
Also, this particular screw-up is a plausible linker or dynamic linker
bug.  I suspect was loaded at address 0xfed40000, which
leaves both these pointers aimed at the beginning of the (unwritable)
.text section -- __r_f_i_b tries to modify the object pointed to by
ob, crash.

Please execute the attached shell script with CC set to your test gcc
3.2 and/or binutils 2.13.x installation and see what happens.  If we
do have a toolchain bug, it ought to be provoked by this test.


Content-Type: application/x-sh
Content-Disposition: attachment; filename=""
Content-Transfer-Encoding: quoted-printable

#! /bin/sh=0A=0Amkdir /tmp/t.$$  || exit 3=0Acd /tmp/t.$$     || exit 3=0A=
=0Acat >main.c <<'EOF'=0A#include <stdio.h>=0A#include <dlfcn.h>=0A=0Aint m=
ain(void)=0A{=0A    void *handle, *sym;=0A    char *error;=0A=0A    puts("c=
alling dlopen");=0A    handle =3D dlopen("./", RTLD_NOW);=0A    if (!=
handle) {=0A        printf("%s\n", dlerror());=0A	return 1;=0A    }=0A=0A  =
  puts("calling dlsym");=0A    sym =3D dlsym(handle, "sym");=0A    if ((err=
or =3D dlerror()) !=3D 0) {=0A        printf("%s\n", error);=0A	return 1;=
=0A    }=0A    puts("calling sym");=0A    ((void (*)(void))sym)();=0A    pu=
ts("done");=0A    return 0;=0A}=0AEOF=0A=0Acat >dyn.c <<'EOF'=0A#include <s=
tdio.h>=0Avoid sym(void)=0A{=0A    puts("in sym");=0A}=0AEOF=0A=0A[ -n "$SH=
FLAGS" ] || SHFLAGS=3D"-fPIC -shared"=0A[ -n "$CC" ]  || CC=3Dgcc=0A=0Aset =
-x=0A=0A$CC $CFLAGS $SHFLAGS dyn.c -o$CC $CFLAGS main.c -o main -=
ldl=0A=0A./main || exit $?=0A=0Acd /tmp=0Arm -rf t.$$=0A

From  Fri Aug 16 20:51:15 2002
From: (Zack Weinberg)
Date: Fri, 16 Aug 2002 12:51:15 -0700
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <> <> <> <> <> <> <>
Message-ID: <>

On Fri, Aug 16, 2002 at 03:03:41PM -0400, Andrew Koenig wrote:
> However, when I write a little C program that just calls dlopen with the
> file in question:
> #include <stdio.h>
> void *dlopen(const char *, int);
> main()
> {
>         void *handle = dlopen("/export/spurr1/homes1/ark/test-python/Python-2.2.1/build/lib.solaris-2.7-sun4u-2.2/", 2);
>         printf ("Handle = %x\n", handle);
> }
> it quietly succeds and prints "Handle = 0"

Handle = 0 indicates a *failure*.  Either try the test script I sent
you, or change your test program to read

#include <stdio.h>
#include <dlfcn.h>

   void *handle = dlopen("/export/spurr1/homes1/ark/test-python/"
			 "", 2);
   printf ("Handle = %x\n", handle);
   if (handle == 0)
     printf("Error: %s\n", dlerror();

and try it again, or, better, both.


From  Fri Aug 16 20:54:34 2002
From: (Zack Weinberg)
Date: Fri, 16 Aug 2002 12:54:34 -0700
Subject: [Python-Dev] Re:
In-Reply-To: <>
References: <>
Message-ID: <>

On Wed, Aug 14, 2002 at 08:59:44AM -0400, Guido van Rossum wrote:
> The mkstemp() function in the rewritten tempfile has an argument with
> a curious name and default: binary=True.  This caused confusion (even
> the docstring in the original patch was confused :-).  It would be
> much easier to explain if this was changed to text=False.  That is, to
> deviate from the default mode, i.e. use text mode, you'll have to
> write mkstemp(text=True) rather than mkstemp(binary=False).

I see you've already done this, but in any case I do think it's a good


From  Fri Aug 16 20:54:47 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 15:54:47 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <> (message from Zack
 Weinberg on Fri, 16 Aug 2002 12:51:15 -0700)
References: <> <> <> <> <> <> <> <>
Message-ID: <>

Zack> Handle = 0 indicates a *failure*.

Yeah, but it didn't crash.

I'll try your other stuff after I'm out of a meeting that I'm in now.

From  Fri Aug 16 20:55:14 2002
From: (Martin v. Loewis)
Date: 16 Aug 2002 21:55:14 +0200
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <>
Message-ID: <>

Andrew Koenig <> writes:

> Seems unlikely -- the system linker isn't in my search path
> and "ls -ltu" shows that it hasn't been executed in 10 days.

Absence in the search path is irrelevant - gcc just knows that the
system linker is in /usr/ccs/bin/ld (see gcc/config/svr4.h).

It would help enourmously if you'ld focus on a single failing
scenario, and, for this scenario, would confirm that the binutils that
you had at configuration time of your compiler are also the ones that
it uses at run-time.

To confirm the latter, it would help if you would report the output
that you get from adding "-v" to one of the linker lines (e.g. for


From  Fri Aug 16 20:55:52 2002
From: (Guido van Rossum)
Date: Fri, 16 Aug 2002 15:55:52 -0400
Subject: [Python-Dev] pystone(object)
In-Reply-To: Your message of "Fri, 16 Aug 2002 15:09:17 EDT."
References: <>
Message-ID: <>

> Anyone interested in making pystone use a new-style class?  I just
> tried it and it slowed pystone down by 12%.  Using __slots__ bought
> back 5%.

Yeah, I've noticed the same (though I think I got back more than 5%
with slots).

> On the one hand, we end up comparing the new-style class
> implementation of one Python with the classic class version of older
> Pythons.  On the other hand, we seems to think that new-style classes
> are preferred.  I think we ought to measure them.

It'll be hard to close the gap completely.  For new-style classes,
instance getattr and setattr operations requires at least two dict
lookups ops: it must look in the instance dict as well as in the class
dict (and in the base classes, in MRO order).  This is so that
properties (and other descriptors that support the __set__ protocol)
can override instance variables: setattr can't just store into the
instance dict, it has to check for a property first; and similar for
getattr (it shouldn't trust the instance dict unless there's nothing
in the class).

Slots can get you back most of this, but not all.  Dict lookup is
already extremely tight code, and when I profiled this, most of the
time was spent there -- twice as many lookup calls using new-style
classes than for classic classes.

--Guido van Rossum (home page:

From  Fri Aug 16 22:54:53 2002
From: (Guido van Rossum)
Date: Fri, 16 Aug 2002 17:54:53 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
Message-ID: <>

(I know there's a sets list at SF, but SF refuses mail from this
machine, and that list is essentially dead.)

In CVS at python/nondist/sandbox/sets/ you'll find a module
(and its unit tests) that implement a versatile Set class roughly
according to PEP 218.  This code has three authors: Greg V. Wilson
wrote the first version; Alex Martelli changed the strategy around
(im)mutability and removed inheritance from dict in favor of having a
dict attribute; and I cleaned it up and changed some implementation
approaches to what I think are faster algorithms (at the cost of
somewhat more verbose code), also changing the API a little bit.

I'd like to do a little more cleanup and then move it into the
standard library.  The API is roughly that of PEP 218, although it
supports several methods and operations that aren't listed in the PEP,
such as comparisons and subset/superset tests.

I'm not sure whether I want to go fix the PEP to match this
implementation, or whether to skip that step and simply document the
module in the standard library docs (that has to be done anyway).

The biggest difference with the PEP is a different approach to the
(im)mutability issue.  This problem is that on the one hand, sets are
containers that hold a potentially large number of elements, and as
such they ought to be mutable: you should be able to start with an
empty set and then add elements to it one at a time.  On the other
hand, the only efficient implementation strategy is to represent a set
internally as the keys of a dictionary whose values are ignored.  This
way insertion and removal can be done very efficiently, at the cost of
a restriction: set elements must be immutable.  The implication is
that you can't have sets of sets -- but occasionally those are handy!

The PEP proposed a strategy based on "freezing" a set as soon as it is
incorporated into another set; but in practice this didn't work out
very well, in part because the test 's1 in s2' would cause s1 to be
frozen: its hash value has to be computed to implement the membership
test, and computing the hash value signals freezing the set.  This
caused too many surprises.

Alex Martelli implemented a different strategy (see there are two set classes, (mutable) Set and
ImmutableSet.  They derive from a common base class (BaseSet, an
abstract class) that implements most set operations.  ImmutableSet
adds __hash__, and Set adds mutating operations like update(),
clear(), in-place union, and so on.  The ImmutableSet constructor
accepts a sequence or another set as its argument.  While this is an
easy enough way to construct an immutable set from a mutable set, for
added convenience, if a mutable set is added to another set (except in
the constructor), an immutable shallow copy is automatically made
under the covers and added instead.  Also, when 's1 in s2' is
requested and s1 is a mutable set, it is wrapped in a
temporarily-immutable wrapper class (an internal class that is not
exposed) which compares equal to s1 and has a hash value equal to the
hash value that would be computed for an immutable copy of s1.  The
temporary wrapper does not make a copy; none of the code here is
thread-safe anyway, so if multiple threads are going to share a
mutable set, they will have to use their own locks.

I deviated from the original API in a few places:

- I renamed the method is_subset_of() to issubset(), since I don't
  like underscores in method names, and think leaving the 'of' off
  will cause ambiguity; I also added issuperset().

- I renamed intersect() to intersection() and sym_difference() to
  symmetric_difference(); ditto for the corresponding _update()

- Alex's code explicitly disallowed in-place operators (e.g. __ior__
  meaning in-place union) for immutable sets.  I decided against this;
  instead, when s1 is a variable referencing an immutable set, the
  statement 's1 |= s2' will compute s1|s2 as an immutable set and
  store that in the variable s1.  On the other hand, if s1 references
  a mutable set, 's1 |= s2' will update the object refereced by s1
  in-place.  This is similar to the behavior of lists and tuples under
  the += operator.

- The left operand of a union, intersection or difference operation
  will decide the type of the result.  This means that generally when
  you've got immutable sets, you'll produce more immutable sets; and
  when you've got mutable sets, you'll produce mutable sets.  When you
  mix the two kinds, the left argument of binary operations wins.

Some minor open questions:

- The set constructors have an optional second argument, sort_repr,
  defaulting to False, which decides whether the elements are sorted
  when str() or repr() is taken.  I'm not sure if there would be
  negative consequences of removing this argument and always sorting
  the string representation.  It would simplify the interface and make
  the output nicer (since usually you don't think about setting this
  argument until it's too late).  I don't think that the extra cost of
  sorting the list of elements will be a burden in practice, since
  normally one would only print small sets (if a set has thousands of
  elements, printing it isn't very user friendly).

- I'd like to change the module name from to  Somehow
  it makes more sense to write

    from sets import Set, ImmutableSet


    from set import Set, ImmutableSet

- I'm aware that this set implementation is not the be-all and end-all
  of sets.  I've seen a set implementation written in C, but I was not
  very impressed -- it was a massive copy-paste-edit job done on the
  dict implementation, and we don't need such code duplication.  But
  eventually there may be a C implementation which will change some
  implementation details.

- Like other concrete types in Python, these Set classes aren't really
  designed to mix well with other set implementations.  For example,
  sets of small cardinal numbers may be represented efficiently by
  long ints, but the union method currently requires that the other
  argument uses the same implementation.  I'm not sure whether this
  will eventually require changing.

Any comments?  Or shall I just check this in?

--Guido van Rossum (home page:

From  Fri Aug 16 22:59:05 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 17:59:05 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <> (message from Zack
 Weinberg on Fri, 16 Aug 2002 12:47:39 -0700)
References: <> <> <> <> <> <>
Message-ID: <>

Zack> Please execute the attached shell script with CC set to your test gcc
Zack> 3.2 and/or binutils 2.13.x installation and see what happens.  If we
Zack> do have a toolchain bug, it ought to be provoked by this test.


$ sh test-dynload
+ gcc -fPIC -shared dyn.c -o 
+ gcc main.c -o main -ldl 
+ ./main 
calling dlopen
Segmentation Fault - core dumped
+ exit 139 

Now to see if I can figure out why...

From  Fri Aug 16 22:49:13 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 17:49:13 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

Martin> Andrew Koenig <> writes:
>> Seems unlikely -- the system linker isn't in my search path
>> and "ls -ltu" shows that it hasn't been executed in 10 days.

Martin> Absence in the search path is irrelevant - gcc just knows that the
Martin> system linker is in /usr/ccs/bin/ld (see gcc/config/svr4.h).

But the system linker has not been executed in 10 days.  So whatever
gcc might or might not know, that proves that gcc did not execute
/usr/ccs/bin/ld during any of this testing.

Martin> It would help enourmously if you'ld focus on a single failing
Martin> scenario, and, for this scenario, would confirm that the
Martin> binutils that you had at configuration time of your compiler
Martin> are also the ones that it uses at run-time.

Martin> To confirm the latter, it would help if you would report the
Martin> output that you get from adding "-v" to one of the linker
Martin> lines (e.g. for

which I do how, exactly?

From  Fri Aug 16 23:30:40 2002
From: (Zack Weinberg)
Date: Fri, 16 Aug 2002 15:30:40 -0700
Subject: [Python-Dev] A few lessons from the rewrite
Message-ID: <>

While doing the rewrite I discovered some places where
improvements could be made to the rest of the standard library.  I'd
like to discuss these here.

1) Dummy threads module.

Currently, a library module that wishes to be thread-safe but still
work on platforms where threads are not implemented, has to jump
through hoops.  In we have

| try:
|     import thread as _thread
|     _allocate_lock = _thread.allocate_lock
| except (ImportError, AttributeError):
|     class _allocate_lock:
|         def acquire(self):
|             pass
|         release = acquire

It would be nice if the thread and threading modules existed on all
platforms, providing these sorts of dummy locks on the platforms that
don't actually implement threads.  I notice that uses 'import
thread' unconditionally -- perhaps this is already the case?  I can't
find any evidence of it.

2) pthread_once equivalent.

pthread_once is a handy function in the C pthreads library which
can be used to guarantee that some data object is initialized exactly
once, and no thread sees it in a partially initialized state.  I had
to implement a fake version in

| _once_lock = _allocate_lock()
| def _once(var, initializer):
|     """Wrapper to execute an initialization operation just once,
|     even if multiple threads reach the same point at the same time.
|     var is the name (as a string) of the variable to be entered into
|     the current global namespace.
|     initializer is a callable which will return the appropriate initial
|     value for variable.  It will be called only if variable is not
|     present in the global namespace, or its current value is None.
|     Do not call _once from inside an initializer routine, it will deadlock.
|     """
|     vars = globals()
|     # Check first outside the lock.
|     if vars.get(var) is not None:
|         return
|     try:
|         _once_lock.acquire()
|         # Check again inside the lock.
|         if vars.get(var) is not None:
|             return
|         vars[var] = initializer()
|     finally:
|         _once_lock.release()

I call it fake for three reasons.  First, it should be using
threading.RLock so that recursive calls do not deadlock.  That's a
trivial fix (this sort of high level API probably belongs in anyway).  Second, it uses globals(), which means that all
symbols it initializes live in the namespace of its own module, when
what's really wanted is the caller's module.  And most important, I'm
certain that this interface is Not The Python Way To Do It.
Unfortunately, I've not been able to figure out what the Python Way To
Do It is, for this problem.

3) test_support.TestSkipped and

Simple - you can't use TestSkipped in a test set.
This is a missing feature of unittest, which has no notion of skipping
a given test.  Any exception thrown from inside one of its test
routines is taken to indicate a failure.

I think the right fix here is to add a skip() method to
unittest.TestCase which works with both a bare test
framework, and Python's own



From  Fri Aug 16 23:32:15 2002
From: (Tim Peters)
Date: Fri, 16 Aug 2002 18:32:15 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

> ...
> Some minor open questions:
> - The set constructors have an optional second argument, sort_repr,
>   defaulting to False, which decides whether the elements are sorted
>   when str() or repr() is taken.  I'm not sure if there would be
>   negative consequences of removing this argument and always sorting
>   the string representation.

I'd rather you left this alone.  Being sortable is a much stronger
requirement on set elements than just supporting __hash__ and __eq__, and,
e.g., it would suck if I could create a set of complex numbers but couldn't
print it(!).

>   It would simplify the interface and make the output nicer (since
>   usually you don't think about setting this argument until it's too
>   late).  I don't think that the extra cost of sorting the list of
>   elements will be a burden in practice, since normally one would only
>   print small sets (if a set has thousands of elements, printing it
>   isn't very user friendly).

I think you could (and should <wink>) get 95% of the benefit here by
changing the sort_repr default to True.  I'm happy to say False when I know
I've got unsortable keys.

> - I'd like to change the module name from to  Somehow
>   it makes more sense to write
>     from sets import Set, ImmutableSet
>   than
>     from set import Set, ImmutableSet


> - I'm aware that this set implementation is not the be-all and end-all
>   of sets.

That's OK -- there is no such set implementation in existence.  This one
covers all simple uses, and many advanced uses -- go for it!

>   I've seen a set implementation written in C, but I was not very
>   impressed -- it was a massive copy-paste-edit job done on the
>   dict implementation, and we don't need such code duplication.

Ditto (& I've said so before about what was most likely the same

> ...
> - Like other concrete types in Python, these Set classes aren't really
>   designed to mix well with other set implementations.  For example,
>   sets of small cardinal numbers may be represented efficiently by
>   long ints, but the union method currently requires that the other
>   argument uses the same implementation.  I'm not sure whether this
>   will eventually require changing.

We could rehabilitate __coerce__ in a hypergeneralized form <wink>.

From  Fri Aug 16 23:35:38 2002
From: (Zack Weinberg)
Date: Fri, 16 Aug 2002 15:35:38 -0700
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <> <> <> <> <> <> <>
Message-ID: <>

On Fri, Aug 16, 2002 at 05:59:05PM -0400, Andrew Koenig wrote:
> Zack> Please execute the attached shell script with CC set to your test gcc
> Zack> 3.2 and/or binutils 2.13.x installation and see what happens.  If we
> Zack> do have a toolchain bug, it ought to be provoked by this test.
> Excellent!
> $ sh test-dynload
> + gcc -fPIC -shared dyn.c -o 
> + gcc main.c -o main -ldl 
> + ./main 
> calling dlopen
> Segmentation Fault - core dumped
> + exit 139 

Bingo.  That demonstrates conclusively that this isn't a Python bug.

Please repeat the test like so:

$ CC="gcc -v" sh test-dynload

Send the complete output, the result of "uname -a", and the script
itself to both and  And I
think we can stop bothering python-dev.


From  Fri Aug 16 23:42:13 2002
From: (Andrew Koenig)
Date: Fri, 16 Aug 2002 18:42:13 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <> (message from Zack
 Weinberg on Fri, 16 Aug 2002 15:35:38 -0700)
References: <> <> <> <> <> <> <> <>
Message-ID: <>

Zack> Bingo.  That demonstrates conclusively that this isn't a Python bug.

Zack> Please repeat the test like so:

Zack> $ CC="gcc -v" sh test-dynload

Zack> Send the complete output, the result of "uname -a", and the script
Zack> itself to both and  And I
Zack> think we can stop bothering python-dev.

Will do.  Thanks for the help!

From  Sat Aug 17 03:08:34 2002
From: (Greg Ewing)
Date: Sat, 17 Aug 2002 14:08:34 +1200 (NZST)
Subject: [Python-Dev] type categories
In-Reply-To: <>
Message-ID: <>

Andrew Koenig <>:

> if I know that modules x and y overload the same function,
> and I want to be sure that x's case is tested first, one would think I
> could ensure it by writing
>         import x, y
> But in fact I can't, because someone else may have imported y already,
> in which case the second import is a no-op.

So far no-one has addressed the other importing problem
I mentioned, which is how to ensure that the relevant
modules get imported *at all*.

Currently in Python, a module gets imported because
some other module needs to use a name from it. If no
other module needs to do so, the module is not needed.

But with generic functions, this will no longer be
true. It will be possible for a module to be needed
by the system as a whole, yet no other module knows
that it is needed!

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Sat Aug 17 07:57:14 2002
From: (Martin =?ISO-8859-1?Q?Sj=F6gren?=)
Date: 17 Aug 2002 08:57:14 +0200
Subject: [Python-Dev] A few lessons from the rewrite
In-Reply-To: <>
References: <>
Message-ID: <1029567435.597.3.camel@winterfell>

l=C3=B6r 2002-08-17 klockan 00.30 skrev Zack Weinberg:
> 2) pthread_once equivalent.
> pthread_once is a handy function in the C pthreads library which
> can be used to guarantee that some data object is initialized exactly
> once, and no thread sees it in a partially initialized state.  I had
> to implement a fake version in
> | _once_lock =3D _allocate_lock()
> |=20
> | def _once(var, initializer):
> |     """Wrapper to execute an initialization operation just once,
> |     even if multiple threads reach the same point at the same time.
> |=20
> |     var is the name (as a string) of the variable to be entered into
> |     the current global namespace.
> |=20
> |     initializer is a callable which will return the appropriate initial
> |     value for variable.  It will be called only if variable is not
> |     present in the global namespace, or its current value is None.
> |=20
> |     Do not call _once from inside an initializer routine, it will deadl=
> |     """
> |=20
> |     vars =3D globals()
> |     # Check first outside the lock.
> |     if vars.get(var) is not None:
> |         return

Wouldn't it make more sense to use has_key (or 'in')? If var is assigned
to None before _once is called, that value is overridden...

> |     try:
> |         _once_lock.acquire()
> |         # Check again inside the lock.
> |         if vars.get(var) is not None:
> |             return
> |         vars[var] =3D initializer()
> |     finally:
> |         _once_lock.release()

I agree that pthread_once is useful, and it would be nice to have
something like this in the standard library.


From  Sat Aug 17 08:29:37 2002
From: (Martin v. Loewis)
Date: 17 Aug 2002 09:29:37 +0200
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <>
References: <>
Message-ID: <>

Andrew Koenig <> writes:

> Martin> To confirm the latter, it would help if you would report the
> Martin> output that you get from adding "-v" to one of the linker
> Martin> lines (e.g. for
> which I do how, exactly?

You copy the line that is used to link, e.g. 

gcc -shared build/temp.linux-i686-2.3/structmodule.o -L/usr/local/lib -o build/lib.linux-i686-2.3/

and add a -v option to it. Actually, adding -Wl,-V is even better; on
my system, this gives

GNU ld version 20011021 (SuSE)
  Supported emulations:


From  Sat Aug 17 13:14:47 2002
From: (Guido van Rossum)
Date: Sat, 17 Aug 2002 08:14:47 -0400
Subject: [Python-Dev] A few lessons from the rewrite
In-Reply-To: Your message of "Fri, 16 Aug 2002 15:30:40 PDT."
References: <>
Message-ID: <>

> 1) Dummy threads module.
> Currently, a library module that wishes to be thread-safe but still
> work on platforms where threads are not implemented, has to jump
> through hoops.  In we have
> | try:
> |     import thread as _thread
> |     _allocate_lock = _thread.allocate_lock
> | except (ImportError, AttributeError):
> |     class _allocate_lock:
> |         def acquire(self):
> |             pass
> |         release = acquire
> It would be nice if the thread and threading modules existed on all
> platforms, providing these sorts of dummy locks on the platforms that
> don't actually implement threads.  I notice that uses 'import
> thread' unconditionally -- perhaps this is already the case?  I can't
> find any evidence of it.

The Queue module was never intended for use in an unthreaded program;
it's specifically intended for safe communication between threads.  So
if you import Queue on a threadless platform, the import thread will
fail for a good reason.

The question with a dummy module is, how far do you want it to go?  I
guess one thing we could do would be to always make the thread module
available, and implement a dummy lock.  The lock would have acquire
and release methods that simply set and test a flag; acquire() raises
an exception the flag is already set (to simulate deadlock), and
release() raises an exception if it isn't set.  But it should *not*
provide start_new_thread; this can be used as a flag to indicate the
real presence of threads.  Then the threading module would import
successfully, but calling start() on a Thread object would fail.  Hm,
I'm not sure if I like that; maybe instantiating the Thread class
should fail already.

There's a backwards incompatibility problem.  There is code that
currently tries to import the thread or threading module and if this
succeeds expects it can use threads (not just locks), and has an
alternative implementation if threads do not exist.  Such code would
break if the thread module always existed.

How about a compromise: we can put a module in the
standard library, and at the top of, you can write

    import thread as _thread
except ImportError:
    import dummy_thread as _thread
_allocate_lock = _thread.allocate_lock

If you can provide an implementation of I'll gladly
check it in.

> 2) pthread_once equivalent.
> pthread_once is a handy function in the C pthreads library which
> can be used to guarantee that some data object is initialized exactly
> once, and no thread sees it in a partially initialized state.  I had
> to implement a fake version in
> | _once_lock = _allocate_lock()
> | 
> | def _once(var, initializer):
> |     """Wrapper to execute an initialization operation just once,
> |     even if multiple threads reach the same point at the same time.
> | 
> |     var is the name (as a string) of the variable to be entered into
> |     the current global namespace.
> | 
> |     initializer is a callable which will return the appropriate initial
> |     value for variable.  It will be called only if variable is not
> |     present in the global namespace, or its current value is None.
> | 
> |     Do not call _once from inside an initializer routine, it will deadlock.
> |     """
> | 
> |     vars = globals()
> |     # Check first outside the lock.
> |     if vars.get(var) is not None:
> |         return

(Martin Sj. commented at this line that this would overwrite var if it
was defined as None; from the docstring I gather that that's
intentional behavior.)

> |     try:
> |         _once_lock.acquire()

IMO this has a subtle bug: the acquire() should come *before* the try:
call.  If for whatever reason acquire() fails, you'd end up doing a
release() on a lock you didn't acquire.  It's true that


is not atomic since a signal handler could raise an exception between
them, but there's a race condition either way, and I don't know how to
fix them both at the same time (not without adding a construct to
shield signals, which IMO is overkill -- be Pythonic, live
dangerously, accept the risk that a ^C can screw you.  It can
anyway. :-)

> |         # Check again inside the lock.
> |         if vars.get(var) is not None:
> |             return
> |         vars[var] = initializer()
> |     finally:
> |         _once_lock.release()
> I call it fake for three reasons.  First, it should be using
> threading.RLock so that recursive calls do not deadlock.  That's a
> trivial fix (this sort of high level API probably belongs in
> anyway).

What's the use case for that?  Surely an initialization function can
avoid calling itself.  I'd say YAGNI.

> Second, it uses globals(), which means that all
> symbols it initializes live in the namespace of its own module, when
> what's really wanted is the caller's module.  And most important, I'm
> certain that this interface is Not The Python Way To Do It.
> Unfortunately, I've not been able to figure out what the Python Way To
> Do It is, for this problem.

In the case of, I think the code will improve in clarity
if you simply write it out.  I tried this and it saved 10 lines of
code (mostly the docstring in _once() -- but that's fair game, since
_once embodies more tricks than the expanded code).

In addition, since gettempdir() is called for the default argument
value of mkstemp(), it would be much simpler to initialize tempdir at
module initialization time; the module initialization is guaranteed to
run only once.  If I do this, I save another 8 lines; but I believe
you probably wanted to postpone calling gettempdir() until any of the
functions that have gettempdir() as their argument get *called*, which
means that in fact all those functions have to be changed to have None
for their default and insert

  if dir is None:
      dir = gettempdir()

at the top of their bodies.

> 3) test_support.TestSkipped and
> Simple - you can't use TestSkipped in a test set.
> This is a missing feature of unittest, which has no notion of skipping
> a given test.  Any exception thrown from inside one of its test
> routines is taken to indicate a failure.
> I think the right fix here is to add a skip() method to
> unittest.TestCase which works with both a bare test
> framework, and Python's own

Maybe you can bring this one up in the PyUnit list?  I don't know
where that is, but we're basically tracking Steve Purcell's code.
Maybe he has a good argument against this feature, or a better way to
do it; or maybe he thinks it is cool.

Personally, I think the thing to do is put tests that can't always run
in a separate test suite class and only add that class to the list of
suites when applicable.  I see you *almost* stumbled upon this idiom
with the dummy_test_TemporaryFile; but that approach seems overkill:
why not simply skip test_TemporaryFile when it's the same as

--Guido van Rossum (home page:

From  Sat Aug 17 13:17:26 2002
From: (Samuele Pedroni)
Date: Sat, 17 Aug 2002 14:17:26 +0200
Subject: [Python-Dev] multimethods: missing parts (was: type categories)
Message-ID: <002801c245e8$0cf1d620$6d94fea9@newmexico>

[Greg Ewing]
>So far no-one has addressed the other importing problem
>I mentioned, which is how to ensure that the relevant
>modules get imported *at all*.

We were busy agreeing on the basic stuff  :(.

>Currently in Python, a module gets imported because
>some other module needs to use a name from it. If no
>other module needs to do so, the module is not needed.
>But with generic functions, this will no longer be
>true. It will be possible for a module to be needed
>by the system as a whole, yet no other module knows
>that it is needed!

What a dramatic personification <wink>!

Yes it's a real bookkeeping problem.

Sidenote: Common Lisp
systems deliver programs as whole
system images, and one often uses an image
to store development snapshots.
So it's not really an issue for the program
at startup. Libraries nevertheless
should come with "scripts" that assure
that all their relevant parts are loaded.
OTOH as long as a programmer immediately
loads his definitions/redefinitions and use
the image for snapshots he can forget about
the issue. But even he should still care about being
able to reload/reconstruct the system from
the source files and from scratch.
So in the end the problem is not completely
only a Python problem.

At the moment the following bookkeeping
rules come to my mind to address the issue:

- a module in a library that exposes
a generic function defined by the library
(defgeneric) should make sure that all the definitions
of methods in the library are already added to the
generic function once the generic function is exposed.

- a module in a library that exposes classes for which
it adds specialized methods
to a generic function defined in and imported
from another library, should assure that once
the class instances can be obtained the specialized
versions are already added.

It seems to me that this should cover most of
the sane cases.

Issues' orthogonal and not so orthogonal decomposition:

- Support for multimethods can be written in (pure) Python
today, at this level the question is whether to have such
support in the std lib or not;
- if my understanding is correct Dave Abrahams wants
multimethod dispatch as the moral equivalent of static
overloading, for that use gf-method definitions would in most cases
be just in one place (especially if what we are "overloading"
is a normal class method)  and the generic function is expected to
be used just as a normal function;
- nice to have: dispatch on protocols/categories; this intersect
a  well-known everrecurrent  issue;
- how much should multimethods become part of the language?
are they a useful addition?  what about newbies? can multimethods be made play
with the rest of the language (especially with single dispatch and normal class
methods [*])?
will they eventually deserve syntax sugar? are they a Py3K thing?

-should-buy-the-AMOP-ly y'rs - Samuele.

[*] at the moment personally I see two possibilities:
- have a type of generic function that can live inside a class (namespace)
be able to redispatch (maybe just only if appropriately configurated)
to superclasses' methods if no matching gf-method is found.
 [after, before, around methods would be  difficult (impossible?) to
implement with the right semantics]
- redefine the normal dispatch rules (obj.meth(...)) in order to directly
take care of generic functions inside classes [larger change]

From  Sat Aug 17 14:29:32 2002
From: (Andrew Koenig)
Date: Sat, 17 Aug 2002 09:29:32 -0400 (EDT)
Subject: [Python-Dev] Python build trouble with the new gcc/binutils -- last word for now (I hope)
In-Reply-To: <>
References: <>
 <> <>
Message-ID: <>

After trying various long-running tests on various machines, I have
determined the following:

1) The culprit appears to be binutils 2.13, not gcc 3.2.  If I use
   binutils 2.13 with gcc 3.2 or gcc 3.1.1, Python does not install.
   If I use binutils 2.12.1 with gcc 3.2 or gcc 3.1.1, Python installs.
   What was misleading me yesterday was some kind of version skew,
   perhaps caused by compiling gcc 3.2 with binutils 2.13; when I
   recompiled gcc 3.2 after installing binutils 2.12.1, all was well.

2) The little dynamic-linker test program is a reliable indicator
   for this problem.

I have filed a binutils bug report.

I will also file a sourceforge bug report for python, suggesting that
the dynamic-linker test program should be included as part of the
configuration process, as an early warning against this problem.

Thanks for all the help!

From  Sat Aug 17 16:12:13 2002
From: (Tim Peters)
Date: Sat, 17 Aug 2002 11:12:13 -0400
Subject: [Python-Dev] A few lessons from the rewrite
In-Reply-To: <>
Message-ID: <>

[Zack Weinberg]
> ...
> 2) pthread_once equivalent.
> pthread_once is a handy function in the C pthreads library which
> can be used to guarantee that some data object is initialized exactly
> once, and no thread sees it in a partially initialized state.

I don't know that it comes up enough in Python to bother doing something
about it -- as Guido said, there's an import lock under the covers that
ensures only one thread executes module init code (== all "top level" code
in a module).  So modules that need one-shot initialization can simply do it
at module level.  tempfile has traditionally gone overboard in avoiding use
of this feature, though.

A more Pythonic approach may be gotten via emulating pthread_once more
closely, forgetting the "data object" business in favor of executing
arbitrary functions "just once".  Like so, maybe:

def do_once(func, lock=threading.RLock(), done={}):
    if func not in done:
            if func not in done:
                done[func] = True

"done" is a set of function objects that have already been run, represented
by a dict mapping function objects to bools (although the dict values make
no difference, only key presence matters).  Default arguments are abused
here to give do_once persistent bindings to objects without polluting the
global namespace.  A more purist alternative is

def do_once(func):
    if func not in do_once.done:
            if func not in do_once.done:
                do_once.done[func] = True

do_once.lock = threading.RLock()
do_once.done = {}

This is "more Pythonic", chiefly in not trying to play presumptive games
with namespaces.  If some module M wants to set its own attr goob, fine, M
can do

def setgoob():
    global goob
    goob = 42


and regardless of which module do_once came from.  Now what setgoob does is
utterly obvious, and do_once() doesn't make helpful assumptions that get in
the way <wink>.

From  Sat Aug 17 16:36:22 2002
From: (Samuele Pedroni)
Date: Sat, 17 Aug 2002 17:36:22 +0200
Subject: [Python-Dev] Re: Cecil papers (clarification)
Message-ID: <009301c24603$d723f7a0$6d94fea9@newmexico>

> [Tim Peters]
> >FYI,
> >
> >
> >
> >has lots of good papers from the Cecil project, a pioneering
> >multiple-dispatch language.  Or you could save time reading and learn by
> >repeating their early mistakes <wink>.
> it's prototype based, not class based so not everything is 
> relevant

I forgot to mention that Cecil is meant for
static compilation under a closed-world assumption, so
the implementation aspects basically do not translate.


From  Sat Aug 17 19:34:35 2002
From: (Oren Tirosh)
Date: Sat, 17 Aug 2002 14:34:35 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Fri, Aug 16, 2002 at 06:32:15PM -0400, Tim Peters wrote:
> > - The set constructors have an optional second argument, sort_repr,
> >   defaulting to False, which decides whether the elements are sorted
> >   when str() or repr() is taken.  I'm not sure if there would be
> >   negative consequences of removing this argument and always sorting
> >   the string representation.
> I'd rather you left this alone.  Being sortable is a much stronger
> requirement on set elements than just supporting __hash__ and __eq__, and,
> e.g., it would suck if I could create a set of complex numbers but couldn't
> print it(!).

How about sorting with this comparison function:

errors = (TypeError, ...?)

def cmpval(x):
        cmp(0, x)
    except errors:
            h = hash(x)
        except errors:
            h = -1
        return (1, h, id(x))
    return (0, x)

def robust_cmp(a,b):
        return cmp(a,b)
    except errors:
            return cmp(cmpval(a), cmpval(b))
        except errors:
            return 0

>>> l=[3j, 2j, 4, 4j, 1, 2, 1j, 3]
>>> l.sort(robust_cmp)
>>> l
[1, 2, 3, 4, 1j, 2j, 3j, 4j]

It's equivalent to standard cmp if no errors are encountered. For lists
containing uncomparable objects it produces a pretty consistent order.
It's not perfect but should be good enough for aesthetic purposes.


From David Abrahams" <  Sat Aug 17 19:18:56 2002
From: David Abrahams" < (David Abrahams)
Date: Sat, 17 Aug 2002 14:18:56 -0400
Subject: [Python-Dev] type categories
References: <><><14c201c244b2$8cc651f0$><><> <>
Message-ID: <009a01c2461a$a82a3e70$>

From: "Andrew Koenig" <>

> Michael> I may be getting lost in subthreads here, but are we still
> Michael> talking about multimethods?
> Well, I started by talking about type categories and ways of
> writing programs that tested them.  Dave Abrahams said, in effect,
> that I was really just talking about multimethods.  I'm still
> not convinced.

Huh? That's certainly not what I thought I was saying. I was saying that a
reason I thought it was important to be able to test type categories (what
Guido calls "look before you leap") was for implementing multiple dispatch.
In other words, an idiom which most people agree is usually a bad choice
for user code might be a great choice for a generalized library or language

It's pretty hard to see how you could construe my remarks as asserting some
interpretation of what you were saying.

           David Abrahams * Boost Consulting *

From David Abrahams" <  Sat Aug 17 19:43:33 2002
From: David Abrahams" < (David Abrahams)
Date: Sat, 17 Aug 2002 14:43:33 -0400
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
References: <> <> <> <>
Message-ID: <012b01c2461e$259992e0$>

From: "Zack Weinberg" <>

> On Fri, Aug 16, 2002 at 12:38:24PM -0400, Andrew Koenig wrote:
> >
> > #0  __register_frame_info_bases (begin=0xfed50000, ob=0xfed50000,
> >     dbase=0x0) at /tmp/build1165/gcc-3.1.1/gcc/unwind-dw2-fde.c:83
> Er, is the directory name misleading, or have you picked up
> from 3.1.1?  In theory that shouldn't be a problem; in
> practice it could well be the problem.

I'm still ploughing through several days of messages here (so this may have
been discussed already) but I have recently learned that despite the
existinence of a "-V" option, it has long been impossible to correctly
install new versions of GCC on systems with existing versions without
using --prefix= to select a unique location. Why GCC's configure doesn't
issue a warning about this when you do it wrong, I don't know. The only
clue that this is going to be a problem was buried in a FAQ somewhere as of
three months ago.

           David Abrahams * Boost Consulting *

From  Sat Aug 17 20:16:15 2002
From: (Skip Montanaro)
Date: Sat, 17 Aug 2002 14:16:15 -0500
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <>
Message-ID: <15710.41215.451302.851810@localhost.localdomain>

    Tim> I think you could (and should <wink>) get 95% of the benefit here
    Tim> by changing the sort_repr default to True.  I'm happy to say False
    Tim> when I know I've got unsortable keys.

Why not just get rid of sort_repr, always attempt to sort for printing, and
just discard the TypeError resulting from the attempted sort?


From  Sat Aug 17 21:00:12 2002
From: (Andrew Koenig)
Date: Sat, 17 Aug 2002 16:00:12 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <009a01c2461a$a82a3e70$>
References: <><><14c201c244b2$8cc651f0$><><> <> <009a01c2461a$a82a3e70$>
Message-ID: <>

ark> Well, I started by talking about type categories and ways of
ark> writing programs that tested them.  Dave Abrahams said, in
ark> effect, that I was really just talking about multimethods.  I'm
ark> still not convinced.

David> Huh? That's certainly not what I thought I was saying. I was
David> saying that a reason I thought it was important to be able to
David> test type categories (what Guido calls "look before you leap")
David> was for implementing multiple dispatch.  In other words, an
David> idiom which most people agree is usually a bad choice for user
David> code might be a great choice for a generalized library or
David> language facility.

David> It's pretty hard to see how you could construe my remarks as
David> asserting some interpretation of what you were saying.

It sure looked that way to me.

In any event, I can think of other contexts in which LBYL can be
useful.  To go back to Guido's example, I agree completely that
testing whether a file exists, and then opening it in a separate
operation, is a bad idea.  One reason is that by the time you get
around to opening the file, it may no longer exist, so the open
has to test anyway.

On the other hand, it does make sense to test whether what you have is
a valid file name as soon as you know that you are going to open the
file, even if you aren't going to open it for a while.  The principle
here is that when failure is certain, failing early is usually better
than failing late.

From  Sat Aug 17 20:43:13 2002
From: (David Abrahams)
Date: Sat, 17 Aug 2002 15:43:13 -0400
Subject: [Python-Dev] type categories
References: <><><14c201c244b2$8cc651f0$><><> <> <009a01c2461a$a82a3e70$> <>
Message-ID: <041501c24626$562063a0$>

From: "Andrew Koenig" <>

> David> It's pretty hard to see how you could construe my remarks as
> David> asserting some interpretation of what you were saying.
> It sure looked that way to me.

Maybe you were just reading fast and confused yourself with Mr. Alex

> In any event, I can think of other contexts in which LBYL can be
> useful.

Of course; I never meant to imply that multiple dispatch was the only
reason to LBYL; it just happens to be the most important one to me.

> To go back to Guido's example, I agree completely that
> testing whether a file exists, and then opening it in a separate
> operation, is a bad idea.  One reason is that by the time you get
> around to opening the file, it may no longer exist, so the open
> has to test anyway.
> On the other hand, it does make sense to test whether what you have is
> a valid file name as soon as you know that you are going to open the
> file, even if you aren't going to open it for a while.  The principle
> here is that when failure is certain, failing early is usually better
> than failing late.

Usually this goes to the same question we were discussing about
re-iterability detection. You want to fail early because it's faster, but
also because you don't want to mutate important program state in some
un-recoverable way in systems that are actually supposed to recover from

           David Abrahams * Boost Consulting *

From  Sat Aug 17 21:10:57 2002
From: (Guido van Rossum)
Date: Sat, 17 Aug 2002 16:10:57 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Sat, 17 Aug 2002 14:34:35 EDT."
References: <> <>
Message-ID: <>

> How about sorting with this comparison function:
> errors = (TypeError, ...?)
> def cmpval(x):
>     try:
>         cmp(0, x)
>     except errors:
>         try:
>             h = hash(x)
>         except errors:
>             h = -1
>         return (1, h, id(x))
>     return (0, x)
> def robust_cmp(a,b):
>     try:
>         return cmp(a,b)
>     except errors:
>         try:
>             return cmp(cmpval(a), cmpval(b))
>         except errors:
>             return 0
> >>> l=[3j, 2j, 4, 4j, 1, 2, 1j, 3]
> >>> l.sort(robust_cmp)
> >>> l
> [1, 2, 3, 4, 1j, 2j, 3j, 4j]
> It's equivalent to standard cmp if no errors are encountered. For lists
> containing uncomparable objects it produces a pretty consistent order.
> It's not perfect but should be good enough for aesthetic purposes.

Too convoluted.  Explicit is better than implicit.

--Guido van Rossum (home page:

From  Sat Aug 17 21:12:17 2002
From: (Guido van Rossum)
Date: Sat, 17 Aug 2002 16:12:17 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Sat, 17 Aug 2002 14:16:15 CDT."
References: <> <>
Message-ID: <>

> Why not just get rid of sort_repr, always attempt to sort for
> printing, and just discard the TypeError resulting from the
> attempted sort?

I don't like discarding TypeErrors.  Who know what bug you're hiding

I like Tim's suggestion just fine, and checked it in: let sort_repr
default to True.  Unsortable values aren't that common in Python.

--Guido van Rossum (home page:

From  Sat Aug 17 21:14:47 2002
From: (Guido van Rossum)
Date: Sat, 17 Aug 2002 16:14:47 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Sat, 17 Aug 2002 14:16:15 CDT."
References: <> <>
Message-ID: <>

Does anybody have any *other* comments on the proposed sets module
besides convoluted suggestions on the sorted representation? :-)

Here's the source code for your web perusal:

--Guido van Rossum (home page:

From  Sat Aug 17 21:17:05 2002
From: (Andrew Koenig)
Date: 17 Aug 2002 16:17:05 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <041501c24626$562063a0$>
References: <>
Message-ID: <>

David> Usually this goes to the same question we were discussing about
David> re-iterability detection. You want to fail early because it's
David> faster, but also because you don't want to mutate important
David> program state in some un-recoverable way in systems that are
David> actually supposed to recover from errors.

And also because it is usually much easier to explain what went
wrong if the failure is detected early.

Andrew Koenig,,

From David Abrahams" <  Sat Aug 17 21:06:47 2002
From: David Abrahams" < (David Abrahams)
Date: Sat, 17 Aug 2002 16:06:47 -0400
Subject: [Python-Dev] type categories
References: <><><14c201c244b2$8cc651f0$><><> <> <009a01c2461a$a82a3e70$> <> <041501c24626$562063a0$>
Message-ID: <043101c24629$9f4ef2f0$>

From: "David Abrahams" <>

> From: "Andrew Koenig" <>
> > David> It's pretty hard to see how you could construe my remarks as
> > David> asserting some interpretation of what you were saying.
> >
> > It sure looked that way to me.
> Maybe you were just reading fast and confused yourself with Mr. Alex
> Martelli?
> (

Oh, and just for reference to what I was talking about in the above
message, here's where Alex explains why he thinks LBYL is sometimes good:


From  Sat Aug 17 21:48:39 2002
From: (David Abrahams)
Date: Sat, 17 Aug 2002 16:48:39 -0400
Subject: [Python-Dev] Re: Multimethods (quel horreur?)
References: <003101c24525$6cfcab80$6d94fea9@newmexico>
Message-ID: <065b01c2462f$86053290$>

From: "Samuele Pedroni" <>

> [f,g,... are functions ;  T1,T2,T3 are type tuples a.k.a multi-method
> signatures, T1<T2 means the corresponding elements
> are "subtypes" ]
> 1. if I see things correctly, you are really fiddling with a multi-method
> M = { (f,T1), (g,T2), ... }from a library
> only if you do e.g.
> add(M,(h,T3))  with T3==T2 (*) or T1<T3<T2.
> [Things one typically does not do without thinking
> twice]

Let's see if I can understand this in plain English. Hmm, maybe I shouldn't
try. You've mixed hard-core formalism and lax informalism here. What does
"fiddling" mean?

Against my better judgement, I'll try anyway. I think you're saying:

    "The only way you're changing the behavior of a multimethod [what I
think you mean by "fiddling"] is if

    1. the type signature of the implementation you add duplicates one
       that's already been added,


    2. it can match all the same types as another one (g,T2), and for
       which there's another implementation (f,T1) which can match all
       the same types as the new implementation. In other words, it
       falls between two other signatures in specificity.

Well, I still don't get it. I clearly don't know what "fiddling" means,
since any added signature can change the behavior of the multimethod. I
think I would be inclined to forbid your first case, where you're adding a
multimethod implementation whose signature exactly matches another one
that's already in the multimethod.

> [Btw up to (*), missing a module or calling
> a generic function before all modules are loaded,
> load order does not count.]

The above sentence is completely incomprehensible to me.

> If T3 is < or incomparable with all the signatures in M,
> is not fiddling.

Okay, you're homing in on a formal definition of "fiddling" here, but I
still don't know what its significance is.

> Incomparable can mean dispatch ambiguities
> but that's a different issue.

I guess so, by definition (yours <wink>).

> import lib
> class Sub(lib.Class1):
>   ...
> class New:
>  ...
> addmethod Sub,y: Sub):
>   ...
> addmethod lib.meth(arg: Sub):
>   ...
> is no different than defining new classes and subclassing and
> overriding methods. Also the kind of resulting program
> logic scattering is not that different under normal usage.

I agree.

> 2. Dispatch ambiguities: the more predictable the rule
> the better,


> the best-fit of multimethod does not match
> such a criterion, see my previous postings.
> Rules for CLOS:
> (NB things are eminently configurable in CLOS)
> Rules for Dylan:

I'm away from the internet at the moment, so I can't look...

> The class predecende list is the same notion
> as Python 2.2 mro.
> See my posted code for the idea of redispatching
> on forced types, which seems to me reasonably Pythonic
> and allows OTOH to choose a very strict approach
> in face of ambiguity because there's anyway a user
> controllable escape.

Could you please explain your scheme in plain English?
What is a "forced type"?

> My opinion: left-to-right and refuse ambiguity
> are depending on the generic function both
> reasonable approaches.

I assume that "left-to-right" is some kind of precedence ordering for
ambiguous multimethod implementations. Can you give an example where that
would be appropriate?

> 3. IMO documentation, doc strings, and introspection
> should be enough to tell generic functions apart.


> The proposed notation or whatever should be at most
>  just syntax sugar:
> (a,b,c).f(d) === f(a,b,c,d) in general.

All notations (except really ugly ones) are syntax sugar. What point are
you trying to make?

> Generic functions should be first-class object
> that can be passed around and used everywhere
> functions can be. Already today f(a,b)
> can invoke a function, a callable instance
> (maybe implemeting some multimethod logic given that
> one can write multimethod support in pure Python),
> a bound method ...

of course.

> 4.  AFAIK multimethods where invented in environments
> with code developed and defined in memory incrementally
> and libraries loaded, that means that through introspection
> one could list the methods of a generic functions and jump
> to the various definitions points, and warnings could
> be issued for redefinitions and such (see 1.).
> For more static approaches to introspection
> syntactic sugar would be probably useful:
> defgeneric addmethod vs. f.add(...)

I don't understand why you keep applying the term "generic" here. Ordinary
python functions are already as generic as you can get. Multimethod
implementations are less-so.

> Of course a Python impl could also
> optionally emit warnings etc, this requiring
> the good practice to load library code before
> user code, and being maximally useful in
> an incremental environment.

of course.

> 5. It is true that once you have multimethods you have
> the choice:
> class C:
>   def meth(...): ...
> vs.
> class C: ...
> defgeneric meth
> addmethod meth(obj: C): ...

It now looks like you were trying to say in 3. that multimethods should be
invokable and definable in the same way as single-methods. Well, I'm not
sure that this elegant idea from Dylan is essential or even neccessarily
good for Python. It's cute, but what does it really buy us?

> 6. It is true that generic functions kind of add
> a new degree of freedom to the modularity problem
> space.


> PS: this is my input together with what I have already
> posted, if something is unclear please ask.


           David Abrahams * Boost Consulting *

From  Sat Aug 17 22:47:16 2002
From: (Samuele Pedroni)
Date: Sat, 17 Aug 2002 23:47:16 +0200
Subject: [Python-Dev] Re: Multimethods (quelle horreur?)
Message-ID: <013301c24637$a7d9a740$6d94fea9@newmexico>

[generic function (abbreviated gf) is the used terminology
(CLOS, Dylan, goo, research papers) for
a multidispatching function,  and yes
normal Common Lisp functions are as generic
as you can get, still that's the terminology]
the question was whether
adding a method to a gf 
is always the moral equivalent of
class A:
   def meth(self,...): ...
 class B(A): ...
 class C(B):
   def meth(self,...): ...
 from lib import A
 def foo(...): ...
 > Well, I still don't get it. I clearly don't know what "fiddling" means,
> since any added signature can change the behavior of the multimethod. I
> think I would be inclined to forbid your first case, where you're adding a
> multimethod implementation whose signature exactly matches another one
> > that's already in the multimethod.
 The point is whether the behavior is changed in an undetected way
with respect to sets of arguments for which some matching signature/
method is already defined. So my conditions
 add(M,(h,T3))  with T3==T2 (*) or T1<T3<T2.
 (assuming that T3==T2 triggers substitution) .
 [T3==T2 case corresponds to the above
 A.meth = foo 
 T1<T3<T2 correspond to the single dispatch case:
 from lib import B
 B.meth=... ]
 If T3 is < or uncomparable with all the signatures
 already in M:
 - you are doing the moral equivalent of overriding
 in the single dispatch case
 - or you are defining the gf for some unrelated
 class hierarchies
 - or some case that was unambiguous
 will become ambiguous and the outcome
 will depend on the rules you choose to
 deal with ambiguity (which is a general 
 problem with multidispatching).
> > [Btw up to (*), missing a module or calling
> > a generic function before all modules are loaded,
> > load order does not count.]
> The above sentence is completely incomprehensible to me.
 Dispatching outcomes are invariant wrt
the order by which you add gf-methods to a gf.
> > import lib
> >
> > class Sub(lib.Class1):
> >   ...
> >
> > class New:
> >  ...
> >
> > addmethod Sub,y: Sub):
> >   ...
> >
> > addmethod lib.meth(arg: Sub):
> >   ...
> >
> > is no different than defining new classes and subclassing and
> > overriding methods. Also the kind of resulting program
> > logic scattering is not that different under normal usage.
 > > I agree.
 this was just an example of the incomprehesible theory
above <wink>. So we agree.
 > >
> > See my posted code for the idea of redispatching
> > on forced types, which seems to me reasonably Pythonic
> > and allows OTOH to choose a very strict approach
> > in face of ambiguity because there's anyway a user
> > controllable escape.
 > Could you please explain your scheme in plain English?
> What is a "forced type"?
 the idea is to allow optionally to specify together
with an argument a supertype of the argument and to have
the dispatching mechanism use the supertype instead
of the type of the argument for dispatching:


the dispatch mechanism will consider the
tuple (type(a),SuperTypeOf_b) instead
 of (type(a),type(b)) for dispatching.
 This is the moral equivalent of
 single dispatching:
 or super(SuperTypeOf_b).meth(b)
 and can be used as a kind of "super" mechanism
or more interstingly to disambiguate ambigous calls
on user behalf.
> > My opinion: left-to-right and refuse ambiguity
> > are depending on the generic function both
> > reasonable approaches.
 > I assume that "left-to-right" is some kind of precedence ordering for
> ambiguous multimethod implementations. Can you give an example where that
> would be appropriate?
is the default used by CLOS, that simply means that
signature (type tuples) are compared using the lexico-order,

   class A: pass
   class B(A): pass
 then (B,A)<(A,B)
 you get the same effect as multidispatching
 simulated through chained single dispatching.

> > The proposed notation or whatever should be at most
> >  just syntax sugar:
> >
> > (a,b,c).f(d) === f(a,b,c,d) in general.
> All notations (except really ugly ones) are syntax sugar. What point are
> you trying to make?
what I was trying to convoy is that it would
be bad to have multimethod invocation be a special
operation different from the usual function invocation
(which currently is at work also for method invocation).
> > 5. It is true that once you have multimethods you have
> > the choice:
> >
> > class C:
> >   def meth(...): ...
> >
> > vs.
> >
> > class C: ...
> >
> > defgeneric meth
> >
> > addmethod meth(obj: C): ...
 > It now looks like you were trying to say in 3. that multimethods should be
> invokable and definable in the same way as single-methods. 

 no, I was saying that at most
 (a,).gf(b) should just be equivalent to gf(a,b) (*)
 but  plain a.meth() should still mean what it means today.
 But honestly I find (*) unnecessary and ugly.
I'm not even sure one can really disambiguate
such syntax:
 are valid Python.
 >Well, I'm not
> sure that this elegant idea from Dylan is essential or even neccessarily
> good for Python. It's cute, but what does it really buy us?

nothing. My point is that once we have multimethods, 
one has the choice, one can
define classes without normal methods and just use multimethods instead,

 I was not advocating that C().meth() should be equivalent
 to meth(C()) in every respect and that the method definitions
 inside the class definitions should triggers gf-method
 definitions. It would be a disruptive change for Python.

So we agree. 

Multidispatching functions should be an extension of the notion
of function in Python not of class methods. OTOH
class methods are defined by defining functions
inside class namespaces, so it should be possible 
to get class methods also from gfs defined
in a class namespace.

From David Abrahams" <  Sat Aug 17 22:57:39 2002
From: David Abrahams" < (David Abrahams)
Date: Sat, 17 Aug 2002 17:57:39 -0400
Subject: [Python-Dev] Re: Multimethods (quelle horreur?)
References: <013301c24637$a7d9a740$6d94fea9@newmexico>
Message-ID: <06f301c24639$1cee88b0$>

From: "Samuele Pedroni" <>

> the question was whether
> adding a method to a gf
> is always the moral equivalent of
> class A:
>    def meth(self,...): ...
>  class B(A): ...
>  class C(B):
>    def meth(self,...): ...
>  from lib import A
>  def foo(...): ...
>  A.meth=foo

And the answer is, "clearly not always".

>  > Well, I still don't get it. I clearly don't know what "fiddling"
> > since any added signature can change the behavior of the multimethod. I
> > think I would be inclined to forbid your first case, where you're
adding a
> > multimethod implementation whose signature exactly matches another one
> > > that's already in the multimethod.
>  The point is whether the behavior is changed in an undetected way
> with respect to sets of arguments for which some matching signature/
> method is already defined. So my conditions
>  add(M,(h,T3))  with T3==T2 (*) or T1<T3<T2.
>  (assuming that T3==T2 triggers substitution) .
>  [T3==T2 case corresponds to the above
>  A.meth = foo

I hope you are covering this case just for generality's sake. It's easy
enough to forbid.

>  T1<T3<T2 correspond to the single dispatch case:
>  from lib import B
>  B.meth=... ]

I don't understand why you're using such a complicated condition; you can
change the behavior "in an undetected way WRT sets of arguments for which
some matching signature/method is already defined" simply by adding a
signature T4 s.t. T4 < X for some X in the signatures of the multimethod.

>  If T3 is < or uncomparable with all the signatures
>  already in M:
>  - you are doing the moral equivalent of overriding
>  in the single dispatch case

Sort of. You might not be the same person that supplies the types in T3.

>  - or you are defining the gf for some unrelated
>  class hierarchies


>  - or some case that was unambiguous
>  will become ambiguous and the outcome
>  will depend on the rules you choose to
>  deal with ambiguity (which is a general
>  problem with multidispatching).


> > > [Btw up to (*), missing a module or calling
> > > a generic function before all modules are loaded,
> > > load order does not count.]
> > The above sentence is completely incomprehensible to me.
>  Dispatching outcomes are invariant wrt
> the order by which you add gf-methods to a gf.

It depends on your dispatching rules, of course. However, I'd like to pick
order-independent rules.

> > > See my posted code for the idea of redispatching
> > > on forced types, which seems to me reasonably Pythonic
> > > and allows OTOH to choose a very strict approach
> > > in face of ambiguity because there's anyway a user
> > > controllable escape.
> >
>  > Could you please explain your scheme in plain English?
> > What is a "forced type"?
>  the idea is to allow optionally to specify together
> with an argument a supertype of the argument and to have
> the dispatching mechanism use the supertype instead
> of the type of the argument for dispatching:
> gf(a,b,_redispatch=(None,SuperTypeOf_b))
> the dispatch mechanism will consider the
> tuple (type(a),SuperTypeOf_b) instead
>  of (type(a),type(b)) for dispatching.

More "sugarily:"

    gf(a, dispatch_as(b, SuperTypeOf_b))

Interesting. Not sure how I feel about this.

>  This is the moral equivalent of
>  single dispatching:
>  SuperTypeOf_b.meth(b)
>  or super(SuperTypeOf_b).meth(b)

Hmm. OK, I see the analogy. I hardly ever have to do that even in the
single case, but I get what you're up to.

> > > My opinion: left-to-right and refuse ambiguity
> > > are depending on the generic function both
> > > reasonable approaches.
>  > I assume that "left-to-right" is some kind of precedence ordering for
> > ambiguous multimethod implementations. Can you give an example where
> > would be appropriate?
> is the default used by CLOS, that simply means that
> signature (type tuples) are compared using the lexico-order,
> given
>    class A: pass
>    class B(A): pass
>  then (B,A)<(A,B)
>  you get the same effect as multidispatching
>  simulated through chained single dispatching.

That seems a bit arbitrary, but I guess there are other precedents in
Python for an arbitrary ordering (e.g. ordering on type names for
heterogeneous object comparison).

> > > The proposed notation or whatever should be at most
> > >  just syntax sugar:
> > >
> > > (a,b,c).f(d) === f(a,b,c,d) in general.
> >
> > All notations (except really ugly ones) are syntax sugar. What point
> > you trying to make?
> what I was trying to convoy is that it would
> be bad to have multimethod invocation be a special
> operation different from the usual function invocation
> (which currently is at work also for method invocation).


>  > It now looks like you were trying to say in 3. that multimethods
should be
> > invokable and definable in the same way as single-methods.
>  no, I was saying that at most
>  (a,).gf(b) should just be equivalent to gf(a,b) (*)
>  but  plain a.meth() should still mean what it means today.
>  But honestly I find (*) unnecessary and ugly.
> I'm not even sure one can really disambiguate
> such syntax:
>  (a,b).__contains__(2)
>  (a,).__contains__(2)
>  are valid Python.
>  >Well, I'm not
> > sure that this elegant idea from Dylan is essential or even
> > good for Python. It's cute, but what does it really buy us?
> nothing. My point is that once we have multimethods,
> one has the choice, one can
> define classes without normal methods and just use multimethods instead,

There's the (minor?) issue of access to "private" members whose names begin
with two underscores.

>  I was not advocating that C().meth() should be equivalent
>  to meth(C()) in every respect and that the method definitions
>  inside the class definitions should triggers gf-method
>  definitions. It would be a disruptive change for Python.
> So we agree.


> Multidispatching functions should be an extension of the notion
> of function in Python not of class methods. OTOH
> class methods are defined by defining functions
> inside class namespaces, so it should be possible
> to get class methods also from gfs defined
> in a class namespace.

Yes, I strongly agree.


From  Sun Aug 18 00:01:18 2002
From: (Samuele Pedroni)
Date: Sun, 18 Aug 2002 01:01:18 +0200
Subject: [Python-Dev] Re: Multimethods (quelle horreur?)
References: <013301c24637$a7d9a740$6d94fea9@newmexico> <06f301c24639$1cee88b0$>
Message-ID: <014901c24641$ff060b80$6d94fea9@newmexico>

From: David Abrahams <>
> >  [T3==T2 case corresponds to the above
> >
> >  A.meth = foo
> I hope you are covering this case just for generality's sake. It's easy
> enough to forbid.

Yes, but Python is a dynamic language, you should allow
for redefinition in some way.

> >  the idea is to allow optionally to specify together
> > with an argument a supertype of the argument and to have
> > the dispatching mechanism use the supertype instead
> > of the type of the argument for dispatching:
> >
> > gf(a,b,_redispatch=(None,SuperTypeOf_b))
> >
> > the dispatch mechanism will consider the
> > tuple (type(a),SuperTypeOf_b) instead
> >  of (type(a),type(b)) for dispatching.
> More "sugarily:"
>     gf(a, dispatch_as(b, SuperTypeOf_b))
> Interesting. Not sure how I feel about this.
> >  This is the moral equivalent of
> >  single dispatching:
> >
> >  SuperTypeOf_b.meth(b)
> >
> >  or super(SuperTypeOf_b).meth(b)
> Hmm. OK, I see the analogy. I hardly ever have to do that even in the
> single case, but I get what you're up to.

more than as a super, it can be useful if you are picky
about ambiguities.

> >  you get the same effect as multidispatching
> >  simulated through chained single dispatching.
> That seems a bit arbitrary, but I guess there are other precedents in
> Python for an arbitrary ordering (e.g. ordering on type names for
> heterogeneous object comparison).

Yes, e.g. the mro used for single dispatch in case of multiple

The other option is to totally refuse ambiguity.
Which is also reasonable.

Honestly there is no big agreement about whether
automatically solving ambiguities is a good thing in
general (even for the single dispatch case).

See for a short opinionated survey:

The Cecil Language
Specification and Rationale

Craig Chambers

2.7 Method lookup
2.7.1 Philosophy

The predictability depends not only on
 the rule but also on the
complexities of the class hierarchies at hand,
especially in the presence of multiple inheritance.

For the stuff  I tried with my multimethods
impl, it seemed that the CLOS rule made
sense, and getting ambiguities seemed
more an impediment. As I said it corresponds
to chained single dispatch and that was basically
what I needed in sweeter way.

Anyway this can be made configurable
for the single gf.


From  Sun Aug 18 00:45:12 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 17 Aug 2002 19:45:12 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <>
Message-ID: <>

[Guido van Rossum]

I felt comfortable, or at least I think so, with all the contents of the
message.  All described compromises, not repeated here, seemed reasonable
to me.  Except maybe for the following:

> - The set constructors have an optional second argument, sort_repr,
>   defaulting to False, which decides whether the elements are sorted
>   when str() or repr() is taken.  I'm not sure if there would be
>   negative consequences of removing this argument and always sorting
>   the string representation.

Unless there is something deep attached to the properties of the sets
themselves, I do not understand why the sorting/non-sorting virtues of
`repr' should be tied with the constructor.

There is a precedent with dicts.  They print non-sorted, but they
pretty-print (through the `pprint' module) sorted.  Maybe the same could
be done for sets: use `pprint' if you want a sorted representation.
But otherwise, sets as well as dicts should print using the same order
by which elements are to be iterated upon or listed, in various other

François Pinard

From  Sun Aug 18 00:49:45 2002
From: (Aahz)
Date: Sat, 17 Aug 2002 19:49:45 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Sat, Aug 17, 2002, François Pinard wrote:
> [Guido van Rossum]
>> - The set constructors have an optional second argument, sort_repr,
>>   defaulting to False, which decides whether the elements are sorted
>>   when str() or repr() is taken.  I'm not sure if there would be
>>   negative consequences of removing this argument and always sorting
>>   the string representation.
> Unless there is something deep attached to the properties of the sets
> themselves, I do not understand why the sorting/non-sorting virtues of
> `repr' should be tied with the constructor.
> There is a precedent with dicts.  They print non-sorted, but they
> pretty-print (through the `pprint' module) sorted.  Maybe the same could
> be done for sets: use `pprint' if you want a sorted representation.
> But otherwise, sets as well as dicts should print using the same order
> by which elements are to be iterated upon or listed, in various other
> circumstances.

Aahz (           <*>

Project Vote Smart:

From  Sun Aug 18 05:11:31 2002
From: (Zack Weinberg)
Date: Sat, 17 Aug 2002 21:11:31 -0700
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
In-Reply-To: <012b01c2461e$259992e0$>
References: <> <> <> <> <012b01c2461e$259992e0$>
Message-ID: <>

On Sat, Aug 17, 2002 at 02:43:33PM -0400, David Abrahams wrote:
> From: "Zack Weinberg" <>
> > On Fri, Aug 16, 2002 at 12:38:24PM -0400, Andrew Koenig wrote:
> > >
> > > #0  __register_frame_info_bases (begin=0xfed50000, ob=0xfed50000,
> tbase=0x0,
> > >     dbase=0x0) at /tmp/build1165/gcc-3.1.1/gcc/unwind-dw2-fde.c:83
> >
> > Er, is the directory name misleading, or have you picked up
> > from 3.1.1?  In theory that shouldn't be a problem; in
> > practice it could well be the problem.
> I'm still ploughing through several days of messages here (so this may have
> been discussed already) but I have recently learned that despite the
> existinence of a "-V" option, it has long been impossible to correctly
> install new versions of GCC on systems with existing versions without
> using --prefix= to select a unique location. Why GCC's configure doesn't
> issue a warning about this when you do it wrong, I don't know. The only
> clue that this is going to be a problem was buried in a FAQ somewhere as of
> three months ago.

I believe that this has been addressed in the current documentation --
the INSTALL file clearly states that multiple versions shouldn't be
installed in the same prefix, and the -V option (which never worked
properly) has been removed.  Would you mind reading the docs shipped
with 3.2, and reporting any remaining confusion as a bug?

(note that we're not going to add a configure check as you suggest,
because there are conditions where it's safe, and we don't want to
make life harder for people who really do want that.)


From  Sun Aug 18 05:11:08 2002
From: (David Abrahams)
Date: Sun, 18 Aug 2002 00:11:08 -0400
Subject: [Python-Dev] Python build trouble with the new gcc/binutils
References: <> <> <> <> <012b01c2461e$259992e0$> <>
Message-ID: <075501c2466d$4b2041e0$>

From: "Zack Weinberg" <>
To: "David Abrahams" <>

> > I'm still ploughing through several days of messages here (so this may
> > been discussed already) but I have recently learned that despite the
> > existinence of a "-V" option, it has long been impossible to correctly
> > install new versions of GCC on systems with existing versions without
> > using --prefix= to select a unique location. Why GCC's configure
> > issue a warning about this when you do it wrong, I don't know. The only
> > clue that this is going to be a problem was buried in a FAQ somewhere
as of
> > three months ago.
> I believe that this has been addressed in the current documentation --
> the INSTALL file clearly states that multiple versions shouldn't be
> installed in the same prefix, and the -V option (which never worked
> properly) has been removed.  Would you mind reading the docs shipped
> with 3.2, and reporting any remaining confusion as a bug?

The *only* place I have ever looked for documentation about how to install
was here:

...and I still see nothing about this issue. There should be a prominent,
eye-catching warning about this, for people like me who have gotten used to
the procedure and have been doing it wrong for a while without knowing

> (note that we're not going to add a configure check as you suggest,
> because there are conditions where it's safe, and we don't want to
> make life harder for people who really do want that.)

Optimizing for the 1% case? Seems like a bad choice to me. How much more
difficult could it be for those 1%-ers if you added a warning?


           David Abrahams * Boost Consulting *

From  Sun Aug 18 05:50:30 2002
From: (Oren Tirosh)
Date: Sun, 18 Aug 2002 00:50:30 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <009a01c2461a$a82a3e70$>
References: <> <009a01c2461a$a82a3e70$>
Message-ID: <>

On Sat, Aug 17, 2002 at 02:18:56PM -0400, David Abrahams wrote:
> Huh? That's certainly not what I thought I was saying. I was saying that a
> reason I thought it was important to be able to test type categories (what
> Guido calls "look before you leap") was for implementing multiple dispatch.
> In other words, an idiom which most people agree is usually a bad choice
> for user code might be a great choice for a generalized library or language
> facility.

Multiple dispatch is one possible use for testing type categories. Another
important use is early detection of errors and more informative error 
messages.  When using a library the errors resulting from a bad argument 
to one of its published entry points are often raised deep within someone 
else's code with an uninformative message, making them hard to trace. If 
an object of the incorrect category is passed to a method that just stores 
a reference to it the actual error may only be raised much later, making it
even harder to trace.  There is also the issue of trusting someone else's 
code - maybe it's a bug in the library? With explicit category checks it's 
easier to tell that the source of the problem.

The extreme form of early detection is static typing, of course. Forcing
category checks on all arguments passed is too much overhead for me.
I prefer explicit checks for protocol compliance at some well-defined
interface points between different domains of code. When an exception is
raised the region of uncertainty about its real source can sometimes be
quite large. Category checks can serve as a kind of fire door to try to
limit the spread of uncertainty. The problem with putting too many fire 
doors is that they hinder passage because they must be kept closed at all 


From  Sun Aug 18 10:59:03 2002
From: (Samuele Pedroni)
Date: Sun, 18 Aug 2002 11:59:03 +0200
Subject: [Python-Dev] Re: multimethods (quelle horreur?) (clarification)
Message-ID: <002401c2469d$e23aa860$6d94fea9@newmexico>

[David Abrahams]
>> gf(a,b,_redispatch=(None,SuperTypeOf_b))
>> the dispatch mechanism will consider the
>> tuple (type(a),SuperTypeOf_b) instead
>>  of (type(a),type(b)) for dispatching.
>More "sugarily:"
>    gf(a, dispatch_as(b, SuperTypeOf_b))
>Interesting. Not sure how I feel about this.
>>  This is the moral equivalent of
>>  single dispatching:
>>  SuperTypeOf_b.meth(b)

what I described above corresponds
to this.

>>  or super(SuperTypeOf_b).meth(b)

to be quivalent with that

gf(a, dispatch_as(b, SuperTypeOf_b))

would have to consider
SuperType_Of_b but together with
the order induced by type(b).mro().


From  Sun Aug 18 13:00:59 2002
From: (Skip Montanaro)
Date: Sun, 18 Aug 2002 07:00:59 -0500
Subject: [Python-Dev] Weekly Python Bug/Patch Summary
Message-ID: <>

Bug/Patch Summary

267 open / 2767 total bugs (-1)
106 open / 1653 total patches (-12)

New Bugs

python-mode.el disables raw_input() (2002-08-13)
printing email object deletes whitespace (2002-08-13)
test_nis test fails on TRU64 5.1 (2002-08-14)
inspect.getsource shows incorrect output (2002-08-14)
Support for masks in getargs.c (2002-08-14)
AESend on Jaguar (2002-08-14)
asynchat problems multi-threaded (2002-08-14)
string method bugs w/ 8bit, unicode args (2002-08-14)
pythonw has a console on Win98 (2002-08-15)
file (& socket) I/O are not thread safe (2002-08-15)
Get rid of FutureWarnings in Carbon (2002-08-15)
IDLE/Command Line Output Differ (2002-08-15)
pickle_complex in (2002-08-15)
popenN return only text mode pipes (2002-08-16)
build dumps core (binutils 2.13/solaris) (2002-08-17)
textwrap has problems wrapping hyphens (2002-08-17)
NetBSD 1.4.3, a.out, shared modules (2002-08-17)

New Patches

PEP 277: Unicode file name support (2002-08-12)
turtle tracer bugfixes and new functions (2002-08-14)
Update environ for (2002-08-15)
urllib.splituser(): '@' in usrname (2002-08-17)

Closed Bugs

Replace strcat, strcpy (2001-11-30)
whichdb lies about db type (2001-12-11)
pickle interns strings (2002-01-11)
pydoc doesn't show C types (2002-04-25)
list(xrange(1e9))  -->  seg fault (2002-05-14)
illegal use of malloc/free (2002-05-16) could do email obfuscation (2002-05-19)
pydoc(.org) does not find file.flush() (2002-06-26)
convert_path fails with empty pathname (2002-06-26)
multiple inheritance w/ slots dumps core (2002-06-28)
System Error with slots and multi-inh (2002-07-05)
pickle error message unhelpful (2002-07-16)
Empty genindex.html pages (2002-07-26)
pydoc -w fails with path specified (2002-07-26)
Bug with deepcopy and new style objects (2002-08-08)
u'%c' % large value: broken result (2002-08-10)

Closed Patches

Changing the preferences mechanism (2001-10-18)
Distutils -- set runtime library path (2001-11-26)
Make more friendly to PDAs (2001-12-20)
Remove eval in pickle and cPickle (2002-01-19)
suppress type restrictions on locals() (2002-01-31)
bug in pydoc on python 2.2 release (2002-02-07)
help asyncore recover from repr() probs (2002-03-25)
Set softspace to 0 in raw_input() (2002-04-29)
Karatsuba multiplication (2002-05-24)
Better token-related error messages (2002-07-25)
alternative SET_LINENO killer (2002-07-29)
_locale library patch (2002-07-30)
Mindless editing, DL_EXPORT/IMPORT (2002-07-31) rewrite (2002-08-01)
Split-out ntmodule.c (2002-08-08)
Static names (2002-08-11)

From  Sun Aug 18 14:07:57 2002
From: (=?ISO-8859-1?Q?Walter_D=F6rwald?=)
Date: Sun, 18 Aug 2002 15:07:57 +0200
Subject: [Python-Dev] mimetypes patch #554192
References: <>	<>	<> <>
Message-ID: <>

Martin v. Loewis wrote:

> Walter Dörwald <> writes:
>>OK, so we probably need a reverse mapping for common_types too, but
>>shouldn't we consider common_types to be fixed?
> If anything, types_map should be fixed: Those are the official
> IANA-supported types (including the official x- extension mechanism).
> The common types are those that violate IANA specs, yet found in real
> life.
> If you wanted to support strictness in add_type, then you would
> require that the type starts with x-; since should have
> all registered types incorporated (if it misses some, that's a bug).

OK, but then adding the entries to types_map must be done differently.
I'd prefer if both can be done by add_type (but then we'd need tree
modes: Initialising types_map, adding further mappings to types_map
(checking that only x- types/subtypes are used, and adding mappings
to common_types.

>>Even better would be, if we could assign priorities to the mappings,
>>so that for e.g. image/jpeg the preferred extension is .jpeg.
>>Then guess_type() and guess_extension() would return the preferred
> Do you have a specific application for that in mind? It sounds like
> overkill.

I'm using a web mirror script which uses the extensions from
guess_extension to save all downloaded resources, and I hate it
when the HTML files are named .htm and JPEG images are named .jpe.

    Walter Dörwald

From  Sun Aug 18 15:06:55 2002
From: (Steve Holden)
Date: Sun, 18 Aug 2002 10:06:55 -0400
Subject: [Python-Dev] Platforms missing both fork and popen2/3
Message-ID: <0a6501c246c0$84209c30$>

The "third leg" of the CGIHTTPServer run_cgi() method is only taken if the
platform's os module contains neither fork nor popen2 nor popen3 attributes.
My platform experience is pretty much limited to the mainstream . Which
platforms are we looking at here besides Macintosh OS 9 and prior?

Steve Holden                       
Python Web Programming      

From  Sun Aug 18 16:59:19 2002
From: (Fredrik Lundh)
Date: Sun, 18 Aug 2002 17:59:19 +0200
Subject: [Python-Dev] Platforms missing both fork and popen2/3
References: <0a6501c246c0$84209c30$>
Message-ID: <017a01c246d0$3842a9b0$ced241d5@hagrid>

Steve wrote:

> The "third leg" of the CGIHTTPServer run_cgi() method is only taken if the
> platform's os module contains neither fork nor popen2 nor popen3 attributes.
> My platform experience is pretty much limited to the mainstream . Which
> platforms are we looking at here besides Macintosh OS 9 and prior?

Python, before 2.0, on non-unix platforms.


From  Sun Aug 18 19:40:01 2002
From: (Steve Holden)
Date: Sun, 18 Aug 2002 14:40:01 -0400
Subject: [Python-Dev] Platforms missing both fork and popen2/3
References: <0a6501c246c0$84209c30$> <017a01c246d0$3842a9b0$ced241d5@hagrid>
Message-ID: <0c0101c246e6$b0a54a00$>

> Steve wrote:
> > The "third leg" of the CGIHTTPServer run_cgi() method is only taken if
> > platform's os module contains neither fork nor popen2 nor popen3
> > My platform experience is pretty much limited to the mainstream . Which
> > platforms are we looking at here besides Macintosh OS 9 and prior?
> Python, before 2.0, on non-unix platforms.

Thanks. I'm happy to ignore that one for the purposes of a 2.3 fix. I'm not
sure whether I ought to be looking at backporting this one to 2.2, though.
The *HTTPServer modules are so patently not production-quality code I
suspect it won't matter.

Steve Holden                       
Python Web Programming      

From  Sun Aug 18 21:04:56 2002
From: (Guido van Rossum)
Date: Sun, 18 Aug 2002 16:04:56 -0400
Subject: [Python-Dev] Platforms missing both fork and popen2/3
In-Reply-To: Your message of "Sun, 18 Aug 2002 14:40:01 EDT."
References: <0a6501c246c0$84209c30$> <017a01c246d0$3842a9b0$ced241d5@hagrid>
Message-ID: <>

> Thanks. I'm happy to ignore that one for the purposes of a 2.3 fix. I'm not
> sure whether I ought to be looking at backporting this one to 2.2, though.
> The *HTTPServer modules are so patently not production-quality code I
> suspect it won't matter.

I imagine a backport would be easy, because not much has changed in
that code.

--Guido van Rossum (home page:

From  Mon Aug 19 04:15:08 2002
From: (Tim Peters)
Date: Sun, 18 Aug 2002 23:15:08 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: <>
Message-ID: <>

[Oren Tirosh]
> ...
> The problem is in C extension modules. In C there is an incentive to rely
> on the immortality of interned strings because it makes the code simpler.

Are you sure about that?  I haven't seen it.

> There was an example of this in the Mac import code.

Code *inside* the core can play any dirty tricks it likes, because we
guarantee to keep it working as things change across releases.  But, AFAICT,
we have no evidence that anything outside the core abuses this stuff.

> PyString_InternInPlace should probably create immortal interned strings
> for backward compatibility (and deprecated, of course)

I still doubt it matters to anything outside the core.

> Maybe PyString_Intern should be renamed to PyString_InternReference to
> make it more obvious that it modifies the pointer "in place".

You're talking about a function that doesn't exist now, right (I don't
recognize the name PyString_Intern, and neither it nor
PyString_InternReference scream anything obvious to me)?

From  Mon Aug 19 04:52:11 2002
From: (Tim Peters)
Date: Sun, 18 Aug 2002 23:52:11 -0400
Subject: [Python-Dev] pystone(object)
In-Reply-To: <>
Message-ID: <>

> ...
> Slots can get you back most of this, but not all.  Dict lookup is
> already extremely tight code, and when I profiled this, most of the
> time was spent there -- twice as many lookup calls using new-style
> classes than for classic classes.

As I've said, and as Oren later demonstrated with code, the cost of a
namespace dict lookup now is more in the layers of function call overhead
than in the actual lookup.  We could whittle that down in Oren-like ways,
although I'd rather we spent whatever time we can devote to stuff like this
on advancing one of the more-general optimization schemes that were a hot
topic before the Python conference.

From  Mon Aug 19 05:15:12 2002
From: (Tim Peters)
Date: Mon, 19 Aug 2002 00:15:12 -0400
Subject: [Python-Dev] Re: SET_LINENO killer
In-Reply-To: <>
Message-ID: <>

[Michael Hudson]
> ...
> This makes no sense; after you've commented out the trace stuff, the
> only difference left is that the switch is smaller!

When things like this don't make sense, it just means we're naive <wink>.
The eval loop overwhelms most optimizers via a crushing overload of "too
many" variables and "too many" basic blocks connected via a complex
topology, and compiler optimization phases are in the business of using
(mostly) linear-time heuristics to solve exponential-time optimization
problems.  IOW, the performance of the eval loop is as touchy as a
heterosexual sailor coming off 2 years at sea, and there's no predicting
what minor changes will do to speed.  This has been observed repeatedly by
everyone who has tried to speed it, across many platforms, and across a
decade of staring at it:  the eval loop is in unstable equilibrium on its
best days.

In the limit, the eval loop "should be" a little slower now under -O, just
because we've added another test + taken-branch to the normal path.  From
that POV, your

> FWIW gcc makes my patch a small win even with -O.

is as much "a mystery" as why MSVC 6 hates it.

> Actually, there are some other changes, like always updating f->f_lasti,
> and allocating 8 more bytes on the stack.  Does commenting out the
> definition of instr_lb & instr_ub make any difference?

I'll try that on Tuesday, but don't hold your breath.  It could be that I
can get back all the loss by declaring tstate volatile -- or doing any other
random thing <wink>.

> ...
> Does reading assembly give any clues?  Not that I'd really expect
> anyone to read all of the main loop...

I will if it's important, but a good HW simulator is a better tool for this
kind of thing, and in any case I doubt I can make enough time to do what
would be needed to address this for real.

> I'm baffled.

Join the club -- we've held this invitation open for you for years <wink>.

> Perhaps you can put SET_LINENO back in for the Windows build
> <1e-6 wink>.

If it's an unfortunate I-cache conflict among heavily-hit code addresses
(something a good HW simulator can tell you), that could actually solve it!
Then anything that manages to move one of the colliding code chunks to a
different address could yield "a mysterious speedup".  These mysteries are
only irritating when they work against you <wink>.

relax-be-happy-ly y'rs  - tim

From  Mon Aug 19 05:46:35 2002
From: (Oren Tirosh)
Date: Mon, 19 Aug 2002 00:46:35 -0400
Subject: [Python-Dev] Alternative implementation of interning
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Sun, Aug 18, 2002 at 11:15:08PM -0400, Tim Peters wrote:
> [Oren Tirosh]
> > ...
> > The problem is in C extension modules. In C there is an incentive to rely
> > on the immortality of interned strings because it makes the code simpler.
> Are you sure about that?  I haven't seen it.
> > There was an example of this in the Mac import code.
> Code *inside* the core can play any dirty tricks it likes, because we
> guarantee to keep it working as things change across releases.  But, AFAICT,
> we have no evidence that anything outside the core abuses this stuff.

I doubt that whoever wrote that code was thinking "hey, this is part of the
core so I can do this".  More likely he was just following the *documented*
promise that interned string are immortal. There may be extension modules 
outside the core that also rely on this promise.

> > PyString_InternInPlace should probably create immortal interned strings
> > for backward compatibility (and deprecated, of course)
> I still doubt it matters to anything outside the core.

Perhaps I'm being overly cautious about breaking promises.  If both you and 
Guido say it's OK who am I to argue...

> > Maybe PyString_Intern should be renamed to PyString_InternReference to
> > make it more obvious that it modifies the pointer "in place".
> You're talking about a function that doesn't exist now, right (I don't
> recognize the name PyString_Intern, and neither it nor
> PyString_InternReference scream anything obvious to me)?

PyString_Intern is the name of the function in my patch that creates a
mortal interned string. PyString_InternInPlace creates immortal interned
strings for compatibility.


From  Mon Aug 19 10:39:05 2002
From: (Michael Hudson)
Date: 19 Aug 2002 10:39:05 +0100
Subject: [Python-Dev] Re: SET_LINENO killer
In-Reply-To: Tim Peters's message of "Mon, 19 Aug 2002 00:15:12 -0400"
References: <>
Message-ID: <>

Tim Peters <> writes:

> [Michael Hudson]
> > ...
> > This makes no sense; after you've commented out the trace stuff, the
> > only difference left is that the switch is smaller!
> When things like this don't make sense, it just means we're naive <wink>.
> The eval loop overwhelms most optimizers via a crushing overload of "too
> many" variables and "too many" basic blocks connected via a complex
> topology, and compiler optimization phases are in the business of using
> (mostly) linear-time heuristics to solve exponential-time optimization
> problems.  IOW, the performance of the eval loop is as touchy as a
> heterosexual sailor coming off 2 years at sea, and there's no predicting
> what minor changes will do to speed.  This has been observed repeatedly by
> everyone who has tried to speed it, across many platforms, and across a
> decade of staring at it:  the eval loop is in unstable equilibrium on its
> best days.

I knew all this, but was still surprised by the magnitude of the slowdown.

> In the limit, the eval loop "should be" a little slower now under -O, just
> because we've added another test + taken-branch to the normal path.  From
> that POV, your
> > FWIW gcc makes my patch a small win even with -O.
> is as much "a mystery" as why MSVC 6 hates it.

No kidding.

I wonder if some of the slow comes from repeatedly hauling the
threadstate into the cache.  I guess wonderings like this are almost
exactly valueless.

> > Actually, there are some other changes, like always updating f->f_lasti,
> > and allocating 8 more bytes on the stack.  Does commenting out the
> > definition of instr_lb & instr_ub make any difference?
> I'll try that on Tuesday, but don't hold your breath.  It could be that I
> can get back all the loss by declaring tstate volatile -- or doing any other
> random thing <wink>.
> > ...
> > Does reading assembly give any clues?  Not that I'd really expect
> > anyone to read all of the main loop...
> I will if it's important, but a good HW simulator is a better tool for this
> kind of thing, and in any case I doubt I can make enough time to do what
> would be needed to address this for real.

On linux there's cachegrind which comes with valgrind and might prove
helpful.  But that only runs on linux, and I'm not sure I want to
explain the linux mystery, as it might go away :)

> > I'm baffled.
> Join the club -- we've held this invitation open for you for years <wink>.

Attempting PhD in mathematics is providing enough bafflement for this
schmuck, but thanks for the offer.

> > Perhaps you can put SET_LINENO back in for the Windows build
> > <1e-6 wink>.
> If it's an unfortunate I-cache conflict among heavily-hit code addresses
> (something a good HW simulator can tell you), that could actually solve it!
> Then anything that manages to move one of the colliding code chunks to a
> different address could yield "a mysterious speedup".  These mysteries are
> only irritating when they work against you <wink>.

Well, quite.  Lets send Julian Seward an email asking him if he wants
to port valgrind to Windows <wink>.


  surely, somewhere, somehow, in the history of computing, at least
  one manual has been written that you could at least remotely
  attempt to consider possibly glancing at.              -- Adam Rixey

From  Mon Aug 19 13:53:15 2002
From: (=?ISO-8859-15?Q?Walter_D=F6rwald?=)
Date: Mon, 19 Aug 2002 14:53:15 +0200
Subject: [Python-Dev] PyString_DecodeEscape and PEP293
Message-ID: <>

A recent checkin added a function PyString_DecodeEscape()
to stringobject.c. To make this function PEP293 compatible
it would need access to unicode_decode_call_errorhandler
which is defined static in unicodeobject.c. Does
PyString_DecodeEscape() really need an errors argument?

If yes, we could either move it to unicodeobject.c or make
unicode_decode_call_errorhandler externally visible.

Another problem that I noticed is that string-escape can't
be used for encoding Unicode objects:

 >>> u"\u0100".encode("string-escape")
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
TypeError: escape_encode() argument 1 must be str, not unicode

    Walter Dörwald

From  Mon Aug 19 14:36:41 2002
From: (Matthias Urlichs)
Date: Mon, 19 Aug 2002 15:36:41 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
Message-ID: <p05111702b986a3c4d470@[]>

>  My guess it's not his listdir() or filesystem, but the keyboard
>  driver.
No, it's MacOSX. It always uses the decomposed form.

That is very noticeable via NFS volumes where files with combined 
character names are unopenable from the GUI.
I've filed a bug report about that - i don't know whether OSX 10.2 
will allow NFC filenames, at least read-only.

Matthias Urlichs

From  Mon Aug 19 14:44:08 2002
From: (Matthias Urlichs)
Date: Mon, 19 Aug 2002 15:44:08 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
Message-ID: <p05111701b9869e037b33@[]>

>  Indeed, that would be consistent. I deliberately want to leave this
>  out of PEP 277. On Unix, things are not that clear - as Jack points
>  out, readlink() and getcwd() also need consideration.
Linux and MacOSX use UTF-8 and should probably be treated as such,=20
i.e. I want to open("=E4=F6=FC"), not open("=E4=F6=FC".encode("utf-8"))=

One interesting tidbit is that MacOSX requires Unicode filenames to be =
in NFD.
I don't know whether anybody agreed on a standard normal form for Linux=

>  In this terrain, Windows has the cleaner API (they consider file nam=
>  as character strings, not as byte strings), so doing the right thing
>  is easier.
Byte strings are perfectly OK if they have a common encoding (meaning=20
UTF-8, in some accepted normal form). Character strings are bad if=20
their interpretation, or indeed their usability, changes with the=20
presense of some random environment variable / registry entry /=20
whatever. Under these constraints, calling it a character string vs.=20
a byte string, and/or using it as such, is a matter of programmers'=20

Matthias Urlichs

From  Mon Aug 19 15:06:47 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 10:06:47 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "17 Aug 2002 19:45:12 EDT."
References: <>
Message-ID: <>

> > - The set constructors have an optional second argument, sort_repr,
> >   defaulting to False, which decides whether the elements are sorted
> >   when str() or repr() is taken.  I'm not sure if there would be
> >   negative consequences of removing this argument and always sorting
> >   the string representation.

> Unless there is something deep attached to the properties of the sets
> themselves, I do not understand why the sorting/non-sorting virtues of
> `repr' should be tied with the constructor.
> There is a precedent with dicts.  They print non-sorted, but they
> pretty-print (through the `pprint' module) sorted.  Maybe the same could
> be done for sets: use `pprint' if you want a sorted representation.
> But otherwise, sets as well as dicts should print using the same order
> by which elements are to be iterated upon or listed, in various other
> circumstances.

This is a pretty convincing argument.  If dicts can survive being
rendered unsorted, then so can Sets.  Maybe I should remove the
sort_repr argument altogether; it's easy enough for the test suite to
use some other trick.  But for now, I'll just leave sort_repr=False
in.  I'm gonna check this in now, but that doesn't mean we can't tweak
the API or implementation, so keep those comments coming!

--Guido van Rossum (home page:

From  Mon Aug 19 15:51:29 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 19 Aug 2002 10:51:29 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <>
Message-ID: <>

[Guido van Rossum]

> [François]
> > [...] why the sorting/non-sorting virtues of `repr' should be tied
> > with the constructor.  [...]

> If dicts can survive being rendered unsorted, then so can Sets.  Maybe I
> should remove the sort_repr argument altogether; it's easy enough for
> the test suite to use some other trick.

I presume you already have a solution when testing dicts?

> But for now, I'll just leave sort_repr=False in.

As long as users do not discover it, they will not use it! :-)

By the way (this was discussed on Python list a while ago), it might
be worth stressing in the official documentation that dicts, and maybe
Sets as well, all have a "natural" iteration order which remains fixed at
least while the dict or Set does not loose or acquire keys, and that this
same fixed order is used for .items(), .keys(), .values(), and all three
.iter* flavours.  It is sometimes useful being able to rely on this fact,
especially if Python clearly commits itself through the documentation.

The printing order for dicts and Sets could be documentated as a simple
way to reveal the current "natural" fixed order.

François Pinard

From  Mon Aug 19 16:18:35 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 11:18:35 -0400
Subject: [Python-Dev] pystone(object)
In-Reply-To: Your message of "Sun, 18 Aug 2002 23:52:11 EDT."
References: <>
Message-ID: <>

> > Slots can get you back most of this, but not all.  Dict lookup is
> > already extremely tight code, and when I profiled this, most of the
> > time was spent there -- twice as many lookup calls using new-style
> > classes than for classic classes.
> As I've said, and as Oren later demonstrated with code, the cost of a
> namespace dict lookup now is more in the layers of function call overhead
> than in the actual lookup.  We could whittle that down in Oren-like ways,
> although I'd rather we spent whatever time we can devote to stuff like this
> on advancing one of the more-general optimization schemes that were a hot
> topic before the Python conference.

Here's a some info taken from a profile of a program that requests an
instance attribute of a new-style class without slots or properties ten
million times (using a for-loop over xrange(100000) and then 100
attribute lookups ( in the for-loop body).

The following functions are called for each attribute lookup:

#calls  seconds name

3	1.72    lookdict_string
1	1.17    PyObject_GenericGetAttr
1	1.10    _PyType_Lookup
3	1.00    PyDict_GetItem
1       0.45    _PyObject_GetDictPtr
1       0.38    PyObject_GetAttr

10	5.82	Subtotal

	3.28	eval_frame (one call!)

	9.10	Total

Here, "seconds" is the total time spent in 10 million times the number
of calls.  In addition, the program spent 3.28 seconds in 500 calls to
eval_frame, I assume nearly all of it in the one call that corresponds
to the body of the test function, so I've added that.

The call graph is as follows:

eval_frame -> (10 million times)
    PyObject_GetAttr ->
        PyObject_GenericGetAttr ->
	     _PyType_Lookup ->
	         PyDict_GetItem ->
	         PyDict_GetItem ->
	      PyDict_GetItem ->

If we want to be really aggressive about this, I suppose we could
inline all of that in PyObject_GenericGetAttr, for the case that the
name passes the PyString_CheckExact test and has a pre-calculated
hash.  In particular, PyDict_GetItem then pretty much boils down to
"mp->ma_lookup(mp, key, hash)->me_value".  That should cut out 5
function calls.

A quick small gain would be to inline just the call to
_PyObject_GetDictPtr.  (I tried this; it saves about 2% on the total
running time of this particular test when not using the profiler.)

An intermediate gain would be to inline the call to _PyType_Lookup.

Here's the code I profiled:

class C(object):

def main():
    a = C() = 42
    for i in xrange(100000):;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


If I add __slots__ = ['foo'] to the class definition, here's what I
get this call graph (prefixed with the total seconds for each
function; each function is called exactly once per attribute lookup in
this case):

3.22    eval_frame -> (10 million times)
0.33        PyObject_GetAttr ->
1.05            PyObject_GenericGetAttr ->
0.35                PyDescr_IsData
0.36                member_get ->
0.15                     descr_check ->
0.27                         PyObject_IsInstance
0.44                     PyMember_GetOne
0.49                 _PyType_Lookup ->
0.35                     PyDict_GetItem ->
1.17                          lookdict_string

8.18    Total

This profile points out a bug in descr_check!  It calls
PyObject_IsInstance, which is a very general routine and hence
relatively expensive.  But descr_check's call to it always passes a
genuine PyTypeObject as the second argument, and we can in-line this
by writing PyObject_TypeCheck(obj, descr->d_type); that's a macro that
may call PyType_IsSubtype but in this case never needs to, saving
about 6% on the total running time of this particular test when not
using the profiler.

--Guido van Rossum (home page:

From  Mon Aug 19 16:25:38 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 11:25:38 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Mon, 19 Aug 2002 10:51:29 EDT."
References: <> <> <>
Message-ID: <>

> > [François]
> > > [...] why the sorting/non-sorting virtues of `repr' should be tied
> > > with the constructor.  [...]
> > If dicts can survive being rendered unsorted, then so can Sets.  Maybe I
> > should remove the sort_repr argument altogether; it's easy enough for
> > the test suite to use some other trick.
> I presume you already have a solution when testing dicts?

These tests just require an equality test between the actual outcome
and the expected outcome.  Since sets support equality testing,
there's no reason not to use that.  (I guess the original test was
being paranoid, or was written before __eq__ was implemented.)

> > But for now, I'll just leave sort_repr=False in.
> As long as users do not discover it, they will not use it! :-)

We can mull that over until the first beta release.

> By the way (this was discussed on Python list a while ago), it might
> be worth stressing in the official documentation that dicts, and maybe
> Sets as well, all have a "natural" iteration order which remains fixed at
> least while the dict or Set does not loose or acquire keys, and that this
> same fixed order is used for .items(), .keys(), .values(), and all three
> .iter* flavours.  It is sometimes useful being able to rely on this fact,
> especially if Python clearly commits itself through the documentation.

AFAIK that's well documented.

--Guido van Rossum (home page:

From  Mon Aug 19 16:27:16 2002
From: (Raymond Hettinger)
Date: Mon, 19 Aug 2002 11:27:16 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
References: <><><> <>
Message-ID: <001201c24794$e709b460$98f8a4d8@othello>

> > But for now, I'll just leave sort_repr=False in.

> As long as users do not discover it, they will not use it! :-)

Just to make sure, why not move sort_repr=False out of the parameter
list and into the code body.

> By the way (this was discussed on Python list a while ago), it might
> be worth stressing in the official documentation that dicts, and maybe
> Sets as well, all have a "natural" iteration order which remains fixed at
> least while the dict or Set does not loose or acquire keys, and that this
> same fixed order is used for .items(), .keys(), .values(), and all three
> .iter* flavours.  It is sometimes useful being able to rely on this fact,
> especially if Python clearly commits itself through the documentation.

Just like stability for the new list.sort(), this promise ought to remain
a hidden, undocumented implementation detail.  Because of collision
resolution, the "natural" order can vary depending on the order that
the keys are inserted.  While the ordering stays constant until there
is a change, it is fragile and could be changed by a resize operation
even if the keys remain the same.  Let's keep the options open here
in case someday we want GC or a memory manager to rebuild the
dictionary at an arbitrary time.

Raymond Hettinger

From  Mon Aug 19 16:38:01 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 11:38:01 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Mon, 19 Aug 2002 11:27:16 EDT."
References: <> <> <> <>
Message-ID: <>

> Just to make sure, why not move sort_repr=False out of the parameter
> list and into the code body.

Good idea.  It can be a non-public class variable.

> [FP]
> > By the way (this was discussed on Python list a while ago), it might
> > be worth stressing in the official documentation that dicts, and maybe
> > Sets as well, all have a "natural" iteration order which remains fixed at
> > least while the dict or Set does not loose or acquire keys, and that this
> > same fixed order is used for .items(), .keys(), .values(), and all three
> > .iter* flavours.  It is sometimes useful being able to rely on this fact,
> > especially if Python clearly commits itself through the documentation.
> Just like stability for the new list.sort(), this promise ought to remain
> a hidden, undocumented implementation detail.  Because of collision
> resolution, the "natural" order can vary depending on the order that
> the keys are inserted.  While the ordering stays constant until there
> is a change, it is fragile and could be changed by a resize operation
> even if the keys remain the same.  Let's keep the options open here
> in case someday we want GC or a memory manager to rebuild the
> dictionary at an arbitrary time.

I don't think François was stating that the order was only dependent
on the inserted keys.  I believe he was merely referring to the fact
that the order doesn't change as long as you don't mutate a dict, and
that it's the same for items(), keys(), values(), iterators, and
display order.  There's no reason to keep that hidden, and I believe
it's documented (though François didn't find it).

--Guido van Rossum (home page:

From  Mon Aug 19 16:57:43 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 11:57:43 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Mon, 19 Aug 2002 11:38:18 EDT."
References: <> <15710.41215.451302.851810@localhost.localdomain> <>
Message-ID: <>

[Michael McLay]
> >
> >x/sets/
> Did you consider making BaseSet._data a slot?

Hm, maybe I should.  If this is a proposed standard data type, we
might as well get people used to the fact that they can't add random
new instance variables without subclassing first.

OTOH what then to do with _sort_repr -- make it a class var or an
instance var?

--Guido van Rossum (home page:

From  Mon Aug 19 17:36:09 2002
From: (Tim Peters)
Date: Mon, 19 Aug 2002 12:36:09 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

>From the Library Reference manual, section "Mapping Types":

    Keys and values are listed in random order.  If keys() and values()
    are called with no intervening modifications to the dictionary,
    the two lists will directly correspond.  This allows the creation of
    (value, key) pairs using zip(): "pairs = zip(a.values(), a.keys())".

The same footnote should be reworked to cover, and be referened from, the
.iter{keys, value, items} methods too.

From  Mon Aug 19 20:35:48 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 15:35:48 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Mon, 19 Aug 2002 11:57:43 EDT."
Message-ID: <>

By the way, I've checked in the "sets" module in Lib.  The unit tests
in test/ need work (no tests for ImmutableSet for example)
and there's no latex documentation; however "import sets; help(sets)"
shows a wealth of information derived from docstrings.  I plan to fix
the unit tests but could use help with the docs.

--Guido van Rossum (home page:

From  Mon Aug 19 20:44:22 2002
From: (Kevin Jacobs)
Date: Mon, 19 Aug 2002 15:44:22 -0400 (EDT)
Subject: [Python-Dev] Standard datetime objects?
Message-ID: <>

I know it has been asked before, but I was wondering where we are with our
new standard datatime objects?  I'm re-working some of my data/time code,
and will be in a position to also work on whatever is keeping the prototype
from being completed.


Kevin Jacobs
The OPAL Group - Enterprise Systems Architect
Voice: (216) 986-0710 x 19         E-mail:
Fax:   (216) 986-0714              WWW:

From  Mon Aug 19 20:50:00 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 15:50:00 -0400
Subject: [Python-Dev] Standard datetime objects?
In-Reply-To: Your message of "Mon, 19 Aug 2002 15:44:22 EDT."
References: <>
Message-ID: <>

> I know it has been asked before, but I was wondering where we are with our
> new standard datatime objects?  I'm re-working some of my data/time code,
> and will be in a position to also work on whatever is keeping the prototype
> from being completed.

Please have a look at the prototype in
python/nondist/sandbox/datetime/.  Note that there are comments
pointing to a Wiki with design discussions too.

Fred's working on completing the C reimplementation (also there); in
fact, I'm expecting a checkpoint checkin from him any moment now.

--Guido van Rossum (home page:

From  Mon Aug 19 20:55:31 2002
From: (Raymond Hettinger)
Date: Mon, 19 Aug 2002 15:55:31 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
References: <>
Message-ID: <001b01c247ba$60865260$af66accf@othello>

From: "Guido van Rossum" <>

> By the way, I've checked in the "sets" module in Lib.  The unit tests
> in test/ need work (no tests for ImmutableSet for example)
> and there's no latex documentation; however "import sets; help(sets)"
> shows a wealth of information derived from docstrings.  I plan to fix
> the unit tests but could use help with the docs.

I'll do the docs.

Raymond Hettinger

From  Mon Aug 19 21:23:27 2002
From: (Brett Cannon)
Date: Mon, 19 Aug 2002 13:23:27 -0700 (PDT)
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU>

[Guido van Rossum]

> OTOH what then to do with _sort_repr -- make it a class var or an
> instance var?

Well, how often can you imagine someone printing out a single set sorted,
but having other sets that they didn't want printed out sorted?  I would
suspect that it is going to be a very rare case when someone wants just
part of their sets printing sorted and the rest not.

I say make it a class var.

-Brett C.

From  Mon Aug 19 21:31:18 2002
From: (Michael Chermside)
Date: Mon, 19 Aug 2002 16:31:18 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
Message-ID: <>

[Guido van Rossum]

 > OTOH what then to do with _sort_repr -- make it a class var or an
 > instance var?

Setting a class var in a standard library class is like playing with a 
global variable with all the attendent problems. Senario... I want my 
sets sorted, but I import some library that uses sets of complex numbers 
for internal purposes. Or (slightly more plausible) I want my sets 
UNsorted, but I use some library whose author counted on the string 
output being sorted (ok... the author shouldn't have depended on it 
because of the existance of the rarely used class variable, but even 
non-experts write libraries using the standard library).

-- Michael Chermside

From  Mon Aug 19 21:38:41 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 16:38:41 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Mon, 19 Aug 2002 13:23:27 PDT."
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU>
Message-ID: <>

> [Guido van Rossum]
> > OTOH what then to do with _sort_repr -- make it a class var or an
> > instance var?

[Brett C]
> Well, how often can you imagine someone printing out a single set sorted,
> but having other sets that they didn't want printed out sorted?  I would
> suspect that it is going to be a very rare case when someone wants just
> part of their sets printing sorted and the rest not.
> I say make it a class var.

Hm, but what if two different library modules have conflicting
requirements?  E.g. module A creates sets of complex numbers and must
have sort_repr=False, while module B needs sort_repr=True for
user-friendliness (or because it relies on this).

My current approach (now in CVS!) is to remove the sort_repr flag to
the constructor, but to provide a method that can produce a sorted or
an unsorted representation.  __repr__ will always return the items
unsorted, which matches what repr(dict) does.  After all, I think it
could be confusing to a user when 'print s' shows

    Set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


    for i in s: print i,


    9 8 7 6 5 4 3 2 1 0

--Guido van Rossum (home page:

From  Mon Aug 19 22:02:52 2002
From: (Neil Schemenauer)
Date: Mon, 19 Aug 2002 14:02:52 -0700
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>; from on Mon, Aug 19, 2002 at 04:31:18PM -0400
References: <>
Message-ID: <>

Michael Chermside wrote:
> Setting a class var in a standard library class is like playing with a 
> global variable with all the attendent problems. Senario... I want my 
> sets sorted, but I import some library that uses sets of complex numbers 
> for internal purposes.

I think the intention is that you would subclass to override the class
variable.  As you point out, modifying a class variable in a library is
asking for trouble.


From  Mon Aug 19 21:56:01 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 19 Aug 2002 16:56:01 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU>
Message-ID: <>

[Guido van Rossum]

> My current approach (now in CVS!) is [...] to provide a method that
> can produce a sorted or an unsorted representation.

Could a method with a similar name be available for dicts as well?

François Pinard

From  Mon Aug 19 22:18:13 2002
From: (Skip Montanaro)
Date: Mon, 19 Aug 2002 16:18:13 -0500
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Objects bufferobject.c,2.19,2.20 complexobject.c,2.62,2.63 floatobject.c,2.114,2.115 intobject.c,2.91,2.92 stringobject.c,2.180,2.181 tupleobject.c,2.71,2.72
In-Reply-To: <>
References: <>
Message-ID: <15713.24725.984557.521006@gargle.gargle.HOWL>

    guido> Call me anal, but there was a particular phrase that was speading
    guido> to comments everywhere that bugged me: /* Foo is inlined */
    guido> instead of /* Inline Foo */.  Somehow the "is inlined" phrase
    guido> always confused me for half a second (thinking, "No it isn't"
    guido> until I added the missing "here").  The new phrase is hopefully
    guido> unambiguous.

Perhaps a comment at the definition of Foo that says "this code has been
inlined elsewhere" makes sense so that if people fix bugs or enhance them
they will be prompted to scout around for other places that need fixing.  (I
hesitate to suggest that all the places a piece of code is inlined should be
recorded, but perhaps that's another option.)


From  Mon Aug 19 22:20:40 2002
From: (Andrew Koenig)
Date: 19 Aug 2002 17:20:40 -0400
Subject: [Python-Dev] type categories -- an example
In-Reply-To: <>
References: <>
Message-ID: <>

In the message that started all of this type-category discussion,
I said:

   As far as I know, there is no uniform method of determining into
   which category or categories a particular object falls.  Of course,
   there are non-uniform ways of doing so, but in general, those ways
   are, um, nonuniform.  Therefore, if you want to check whether an
   object is in one of these categories, you haven't necessarily
   learned much about how to check if it is in a different one of
   these categories.

As it happens, I'm presently working on a program in which I
would like to be able to determine whether a given value is:

        -- a member of a particular class hierarchy that I've defined;
        -- a callable object;
        -- a compiled regular expression; or
        -- anything else.

and do something different in each of these four cases.  Testing for
the first category is easy: I evaluate isinstance(x, B), where B is the
base class of my hierarchy.

Testing for the second is also easy: I evaluate callable(x).

How do I test for the third?  I guess I need to know the name of the
type of a compiled regular expression object.  Hmmm... A quick scan
through the documentation doesn't reveal it.  So I do an experiment:

        >>> import re
        >>> re.compile("foo")
        <_sre.SRE_Pattern object at 0x111018>

Hmmm... This doesn't look good -- Can I really count on a compiled
regular expression being an instance of _sre.SRE_Pattern for the

From  Mon Aug 19 22:27:35 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 17:27:35 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Mon, 19 Aug 2002 16:56:01 EDT."
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <>
Message-ID: <>

> Could a method with a similar name be available for dicts as well?

Well, it wouldn't have any advantage over doing this "by hand",
extracting the keys into a list and sorting that.  The same reasoning
applies to the sets class, which is why I've made it a non-public
method (named '_repr').  It may go if I find a solution to the one use
there is in the test suite.

--Guido van Rossum (home page:

From  Mon Aug 19 22:43:01 2002
From: (Jeremy Hylton)
Date: Mon, 19 Aug 2002 17:43:01 -0400
Subject: [Python-Dev] type categories -- an example
In-Reply-To: <>
References: <>
Message-ID: <>

>>>>> "AK" == Andrew Koenig <> writes:

  AK> How do I test for the third?  I guess I need to know the name of
  AK> the type of a compiled regular expression object.  Hmmm... A
  AK> quick scan through the documentation doesn't reveal it.  So I do
  AK> an experiment:

  >>>> import re re.compile("foo")
  AK>         <_sre.SRE_Pattern object at 0x111018>

  AK> Hmmm... This doesn't look good -- Can I really count on a
  AK> compiled regular expression being an instance of
  AK> _sre.SRE_Pattern for the future?

I'd put this at the module level:

compiled_re_type = type(re.compile(""))

Then you can use isistance() to test:

isinstance(re.compile("spam+"), compiled_re_type)


From  Mon Aug 19 22:44:30 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 17:44:30 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Objects bufferobject.c,2.19,2.20 complexobject.c,2.62,2.63 floatobject.c,2.114,2.115 intobject.c,2.91,2.92 stringobject.c,2.180,2.181 tupleobject.c,2.71,2.72
In-Reply-To: Your message of "Mon, 19 Aug 2002 16:18:13 CDT."
References: <>
Message-ID: <>

From  Mon Aug 19 22:45:02 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 17:45:02 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Objects bufferobject.c,2.19,2.20 complexobject.c,2.62,2.63 floatobject.c,2.114,2.115 intobject.c,2.91,2.92 stringobject.c,2.180,2.181 tupleobject.c,2.71,2.72
In-Reply-To: Your message of "Mon, 19 Aug 2002 16:18:13 CDT."
References: <>
Message-ID: <>

> Perhaps a comment at the definition of Foo that says "this code has
> been inlined elsewhere" makes sense so that if people fix bugs or
> enhance them they will be prompted to scout around for other places
> that need fixing.  (I hesitate to suggest that all the places a
> piece of code is inlined should be recorded, but perhaps that's
> another option.)

That's a good idea.  Maybe one of the "code janitors" can help with

--Guido van Rossum (home page:

From  Mon Aug 19 22:48:15 2002
From: (Tim Peters)
Date: Mon, 19 Aug 2002 17:48:15 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

> ...
> My current approach (now in CVS!) is to remove the sort_repr flag to
> the constructor, but to provide a method that can produce a sorted or
> an unsorted representation.

+1.  That's the best way to go.

> __repr__ will always return the items unsorted, which matches what repr
> (dict) does.  After all, I think it could be confusing to a user when
> 'print s' shows
>     Set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
> but
>     for i in s: print i,
> prints
>     9 8 7 6 5 4 3 2 1 0

>>> from sets import Set
>>> print Set(range(10))
Set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

When I optimized a useless ~ out of the dict code for 2.2, it became much
more likely that the traversal order for an int-keyed dict would match
numeric order.  I have evidence that this has fooled newbies into believing
that dicts are ordered maps!  If it wouldn't cost an extra cycle, I'd be
tempted to slop the ~ back in again <0.9 wink>.

From  Mon Aug 19 23:23:02 2002
From: (Brett Cannon)
Date: Mon, 19 Aug 2002 15:23:02 -0700 (PDT)
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <Pine.SOL.4.44.0208191519550.18725-100000@death.OCF.Berkeley.EDU>

[Guido van Rossum]

> My current approach (now in CVS!) is to remove the sort_repr flag to
> the constructor, but to provide a method that can produce a sorted or
> an unsorted representation.  __repr__ will always return the items
> unsorted, which matches what repr(dict) does.  After all, I think it

I just updated my CVS copy and I like your implementation.  +1 from me for
how you are handling it.

> could be confusing to a user when 'print s' shows
>     Set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
> but
>     for i in s: print i,
> prints
>     9 8 7 6 5 4 3 2 1 0

Good point.  As Fran?s pointed out, newbies do expect dicts to always come
out in the same order (and I must say that I, like Fran?s, have never seen
any docs saying that it does always come out the same order as long as
nothing has mutated) and I would expect that expectation to carry over to

-Brett C.

From  Mon Aug 19 23:25:45 2002
From: (Andrew Koenig)
Date: Mon, 19 Aug 2002 18:25:45 -0400 (EDT)
Subject: [Python-Dev] type categories -- an example
In-Reply-To: <> (message from Jeremy
 Hylton on Mon, 19 Aug 2002 17:43:01 -0400)
References: <>
 <> <>
Message-ID: <>

Jeremy> I'd put this at the module level:

Jeremy> compiled_re_type = type(re.compile(""))

Jeremy> Then you can use isistance() to test:

Jeremy> isinstance(re.compile("spam+"), compiled_re_type)

But is it guaranteed that re.compile will always yield
an object of the same type?

From  Mon Aug 19 23:32:11 2002
From: (Jeremy Hylton)
Date: Mon, 19 Aug 2002 18:32:11 -0400
Subject: [Python-Dev] type categories -- an example
In-Reply-To: <>
References: <>
Message-ID: <>

>>>>> "AK" == Andrew Koenig <> writes:

  Jeremy> I'd put this at the module level: compiled_re_type =
  Jeremy> type(re.compile(""))

  Jeremy> Then you can use isistance() to test:

  Jeremy> isinstance(re.compile("spam+"), compiled_re_type)

  AK> But is it guaranteed that re.compile will always yield an object
  AK> of the same type?

Hard to say.  I can read the code and see that the current
implementation will always return objects of the same type.  In fact,
it's using type(sre_compile.compile("", 0)) internally to represent
that type.

That's not a guarantee.  Perhaps Fredrik wants to reserve the right to
change this in the future.  It's not unusual for Python modules to be
under-specified in this way.


From  Mon Aug 19 23:45:39 2002
From: (Andrew Koenig)
Date: 19 Aug 2002 18:45:39 -0400
Subject: [Python-Dev] type categories -- an example
In-Reply-To: <>
References: <>
Message-ID: <>

Jeremy> Hard to say.  I can read the code and see that the current
Jeremy> implementation will always return objects of the same type.
Jeremy> In fact, it's using type(sre_compile.compile("", 0))
Jeremy> internally to represent that type.

Jeremy> That's not a guarantee.  Perhaps Fredrik wants to reserve the
Jeremy> right to change this in the future.  It's not unusual for
Jeremy> Python modules to be under-specified in this way.

The real point is that this is an example of why a uniform way
of checking for such types would be nice.  I shouldn't have
to read the source to figure out how to tell if something is
a compiled regular expression.

Andrew Koenig,,

From  Mon Aug 19 23:49:44 2002
From: (Brett Cannon)
Date: Mon, 19 Aug 2002 15:49:44 -0700 (PDT)
Subject: [Python-Dev] type categories -- an example
In-Reply-To: <>
Message-ID: <Pine.SOL.4.44.0208191543080.19618-100000@death.OCF.Berkeley.EDU>

[Jeremy Hylton]

> >>>>> "AK" == Andrew Koenig <> writes:
>   Jeremy> I'd put this at the module level: compiled_re_type =
>   Jeremy> type(re.compile(""))
>   Jeremy> Then you can use isistance() to test:
>   Jeremy> isinstance(re.compile("spam+"), compiled_re_type)
>   AK> But is it guaranteed that re.compile will always yield an object
>   AK> of the same type?
> Hard to say.  I can read the code and see that the current
> implementation will always return objects of the same type.  In fact,
> it's using type(sre_compile.compile("", 0)) internally to represent
> that type.
> That's not a guarantee.

This might be a stupid question, but why wouldn't
isinstance(re.compile("spam+"), type(re.compile(''))) always work (this is
Jeremey's code, just inlined)?  Unless the instance being tested was
marshalled (I think), the test should always work.  Even using an
unpickled instance (I think, again) should work since it would use the
current implementation of a pattern object.  So as long as the instance
being tested is not somehow being stored and then brought back using a
newer version of Python it should always work.

If not true, then I have been lied to. =)

-Brett C.

From  Mon Aug 19 23:59:14 2002
From: (Jeremy Hylton)
Date: Mon, 19 Aug 2002 18:59:14 -0400
Subject: [Python-Dev] type categories -- an example
In-Reply-To: <>
References: <>
Message-ID: <>

>>>>> "ARK" == Andrew Koenig <> writes:

  Jeremy> Hard to say.  I can read the code and see that the current
  Jeremy> implementation will always return objects of the same type.
  Jeremy> In fact, it's using type(sre_compile.compile("", 0))
  Jeremy> internally to represent that type.

  Jeremy> That's not a guarantee.  Perhaps Fredrik wants to reserve
  Jeremy> the right to change this in the future.  It's not unusual
  Jeremy> for Python modules to be under-specified in this way.

  ARK> The real point is that this is an example of why a uniform way
  ARK> of checking for such types would be nice.  I shouldn't have to
  ARK> read the source to figure out how to tell if something is a
  ARK> compiled regular expression.

Let's assume for the moment that the re module wants to define an
explicit type of compiled regular expression objects.  This seems a
sensible thing to do, and it already has such a type internally.

I'm not sure how this relates to your real point.  You didn't have to
read the source code to figure out if something is a compiled regular
expression.  Instead, I recommended that you use type(obj) where obj
was a compiled regular expression.  It might have been convenient if
there was a module constant, such that re.CompiledRegexType ==

Then you asked if re.compile() was guaranteed to return an object of
the same type.  That question is all about the contract of the re
module.  The answer might have been: "No.  In version X, it happens to
always return objects of the same type, but in version Z, I may want
to change this."

I suppose we could get at the general question of checking types by
assuming that re.compile() returned instances of two apparently
unrelated classes and that we wanted a way to declare their
relationship.  I'm thinking of something like Haskell typeclasses


From  Tue Aug 20 01:33:56 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 20:33:56 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Mon, 19 Aug 2002 17:48:15 EDT."
References: <>
Message-ID: <>

> When I optimized a useless ~ out of the dict code for 2.2, it became much
> more likely that the traversal order for an int-keyed dict would match
> numeric order.  I have evidence that this has fooled newbies into believing
> that dicts are ordered maps!  If it wouldn't cost an extra cycle, I'd be
> tempted to slop the ~ back in again <0.9 wink>.

Maybe add a ~ to the int hash code?  That means it's not on the
critical path for dicts with string keys.

--Guido van Rossum (home page:

From  Tue Aug 20 01:35:53 2002
From: (David Abrahams)
Date: Mon, 19 Aug 2002 20:35:53 -0400
Subject: [Python-Dev] nested extension modules?
Message-ID: <012401c247e1$8c4570d0$>


Using the source (Luke), I was trying to figure out the best way to add a
nested submodule from within an extension module. I noticed that the module
initialization code will set the module name from the package context (if
set), altogether discarding any name passed explicitly:

[modsupport.c: Py_InitModule4()]

 if (_Py_PackageContext != NULL) {
  char *p = strrchr(_Py_PackageContext, '.');
  if (p != NULL && strcmp(name, p+1) == 0) {
   name = _Py_PackageContext;
   _Py_PackageContext = NULL;

This _Py_PackageContext is set up from within _PyImport_LoadDynamicModule

 oldcontext = _Py_PackageContext;
 _Py_PackageContext = packagecontext;
 _Py_PackageContext = oldcontext;

IIUC, this means that when an extension module is loaded as part of a
package, any submodules I create my calling Py_InitModule<whatever> will
come out with the same name.


a. Have I got the analysis right?
b. Is there a more-sanctioned way around this other than touching
_Py_PackageContext (which seems to be intended to be private)


           David Abrahams * Boost Consulting *

From  Tue Aug 20 01:42:50 2002
From: (Guido van Rossum)
Date: Mon, 19 Aug 2002 20:42:50 -0400
Subject: [Python-Dev] type categories -- an example
In-Reply-To: Your message of "Mon, 19 Aug 2002 18:25:45 EDT."
References: <> <> <>
Message-ID: <>

> But is it guaranteed that re.compile will always yield
> an object of the same type?

There are no guarantees in life, but I expect that that is something
that plenty of code depends on, so it will likely stay that way.

--Guido van Rossum (home page:

From  Tue Aug 20 02:12:09 2002
From: (Jonathan Riehl)
Date: Mon, 19 Aug 2002 20:12:09 -0500 (CDT)
Subject: [Python-Dev] PEP 269 versus 283.
Message-ID: <Pine.BSF.4.33.0208191959430.45577-100000@localhost>

	I was looking over some of the PEP's and I saw that 269 was
considered dead according to PEP 283.  This is kind of odd because I was
planning to have an implementation by the end of the week.  This is
subject to the constraints of reality; I am taking a whopping huge
vacation starting this next weekend.  It is either going be ready for
python-dev to play with this week or in the middle of next month.
	My posts to the parser-sig are trying to be deferential to the
charter of the SIG (starting w/requirements for a general purpose parser
generator, not implementation of PEP 269).
	I am certainly going to try to wrangle the parser-sig onwards, but
a pgen module is way overdue.


From  Tue Aug 20 02:09:38 2002
From: (Greg Ewing)
Date: Tue, 20 Aug 2002 13:09:38 +1200 (NZST)
Subject: [Python-Dev] Another command line parser
Message-ID: <>

In view of the recent discussion on command line parsers,
you may be interested in the attached module which I wrote
in response to a posting.

The return values are designed so that they can be used
as the *args and/or **kwds arguments to a function if

#  A Pythonically minimalistic command line parser
#  Inspired by ideas from Huaiyu Zhu 
#  <> and Robert Biddle
#  <>.
#  Author: Greg Ewing <>

class CommandLineError(Exception):

def clparse(switches, flags, argv = None):
  """clparse(switches, flags, argv = None)

  Parse command line arguments.

  switches = string of option characters not taking arguments
  flags = string of option characters taking an argument
  argv = command line to parse (including program name), defaults
         to sys.argv

  Returns (args, options) where:

  args = list of non-option arguments
  options = dictionary mapping switch character to number of
            occurrences of the switch, and flag character to
            list of arguments specified with that flag

  Arguments following "--" are regarded as non-option arguments
  even if they start with a hyphen.
  if not argv:
    import sys
    argv = sys.argv
  argv = argv[1:]
  opts = {}
  args = []
  for c in switches:
    opts[c] = 0
  for c in flags:
    if c in switches:
      raise ValueError("'%c' both switch and flag" % c)
    opts[c] = []
  seen_dashdash = 0
  while argv:
    arg = argv.pop(0)
    if arg == "--":
      seen_dashdash = 1
    elif not seen_dashdash and arg.startswith("-"):
      for c in arg[1:]:
        if c in switches:
          opts[c] += 1
        elif c in flags:
            val = argv.pop(0)
          except IndexError:
            raise CommandLineError("Missing argument for option -%c" % c)
          raise CommandLineError("Unknown option -%c" % c)
  return args, opts

if __name__ == "__main__":
  def spam(args, a, b, c, x, y, z):
    print "a =", a
    print "b =", b
    print "c =", c
    print "x =", x
    print "y =", y
    print "z =", z
    print "args =", args
  args, kwds = clparse("abc", "xyz")
  spam(args, **kwds)

From  Tue Aug 20 04:41:58 2002
From: (Jeremy Hylton)
Date: Mon, 19 Aug 2002 23:41:58 -0400
Subject: [Python-Dev] PEP 269 versus 283.
In-Reply-To: <Pine.BSF.4.33.0208191959430.45577-100000@localhost>
References: <Pine.BSF.4.33.0208191959430.45577-100000@localhost>
Message-ID: <>

I lobbied in favor of PEP 269 earlier, because it is mostly exposing
functionality that already exists but is difficult to reuse.  That
seems like a good thing even if some people want to use other parser

I've only seen pgen used for one application, and that one application
has been modestly successfuly.  So a language can do worse than
starting with pgen.


From  Tue Aug 20 04:51:15 2002
From: (Aahz)
Date: Mon, 19 Aug 2002 23:51:15 -0400
Subject: [Python-Dev] Names again (was Re: type categories)
Message-ID: <>

On Thu, Aug 15, 2002, Guido van Rossum wrote:
>> In a dynamically typed language there is no such thing as an 'integer
>> variable' but it can be simulated by a reference that may only point to
>> objects in the 'integer' category.
> This seems a game with words.  I don't see the difference between an
> integer variable and a reference that must point to an integer.
> (Well, I see a difference, in the sharing semantics, but that's just
> the difference between a value and an pointer in C.  They're both
> variables.)

Going off on a tangent (and riding one of my favorite hobby horses),
Python doesn't have variables.
Aahz (           <*>

Project Vote Smart:

From  Tue Aug 20 05:29:18 2002
From: (Greg Ewing)
Date: Tue, 20 Aug 2002 16:29:18 +1200 (NZST)
Subject: [Python-Dev] PEP 269 versus 283.
In-Reply-To: <>
Message-ID: <>

Jeremy Hylton <>:

> I lobbied in favor of PEP 269 earlier, because it is mostly exposing
> functionality that already exists but is difficult to reuse.  That
> seems like a good thing even if some people want to use other parser
> generators.

There's a downside that you'd then be committed to supporting
it, even if Python stopped using pgen itself some time in the

If that's not a worry, then fine -- just pointing out that
exposing previously unexposed functionality isn't necessarily
without cost.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Tue Aug 20 05:31:17 2002
From: (Greg Ewing)
Date: Tue, 20 Aug 2002 16:31:17 +1200 (NZST)
Subject: [Python-Dev] Names again (was Re: type categories)
In-Reply-To: <>
Message-ID: <>

Aahz <>:

> Going off on a tangent (and riding one of my favorite hobby horses),
> Python doesn't have variables.

Only for some definitions of the word "variable". And
not the definition we have in mind when we use the
word "variable" in a Python context (if we do at all).

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Tue Aug 20 05:55:50 2002
From: (Oren Tirosh)
Date: Tue, 20 Aug 2002 07:55:50 +0300
Subject: [Python-Dev] Re: Last call: mortal interned strings
In-Reply-To: <>; from on Fri, Aug 16, 2002 at 02:58:36PM -0400
References: <>
Message-ID: <>

I see that this has been checked in. 

My version:

	PyString_InternInPlace - immortal
	PyString_Intern - mortal 

Your version:

	PyString_InternInPlace - mortal
	PyString_InternImmortal - immortal

My version favors backward compatibility - existing modules will not break 
if they rely on the immortality of interned strings.

Your version appears to maximize the benefit of interned strings - existing
modules automatically get the new mortal semantics without requiring any 

I was wondering what was the rationale behind this decision.

If the only reason was that the name PyString_Intern is not descriptive
enough it can be renamed to something like PyString_InternReference to
make it clear that it operates on a reference to a string and modifies 
it "in place".


From  Tue Aug 20 06:31:51 2002
From: (Tim Peters)
Date: Tue, 20 Aug 2002 01:31:51 -0400
Subject: [Python-Dev] Re: Last call: mortal interned strings
In-Reply-To: <>
Message-ID: <>

[Oren Tirosh, to Guido]
> I see that this has been checked in.
> My version:
> 	PyString_InternInPlace - immortal
> 	PyString_Intern - mortal
> Your version:
> 	PyString_InternInPlace - mortal
> 	PyString_InternImmortal - immortal
> My version favors backward compatibility - existing modules will
> not break if they rely on the immortality of interned strings.
> Your version appears to maximize the benefit of interned strings
> - existing modules automatically get the new mortal semantics without
> requiring any changes.
> I was wondering what was the rationale behind this decision.

My guess is that it was so existing modules automatically get the benefit of
mortal semantics without requiring any changes <wink> -- coupled with that
nobody believes any module outside the core relies on immortality (Jack's
Mac support code is part of the core, and Jack knows that).

> If the only reason was that the name PyString_Intern is not descriptive
> enough it can be renamed to something like PyString_InternReference to
> make it clear that it operates on a reference to a string and modifies
> it "in place".

If that's all there were to it, I expect Guido would have renamed
PyString_Intern to PyString_InternMortal (a "reference" suffix still doesn't
mean anything to me -- and you've explained it twice <wink>).

If we're wrong that extension modules don't rely on immortality, the alpha
and beta releases should shake that out for all major extensions, including
any that work their way under the PBF umbrella.

From  Tue Aug 20 11:17:42 2002
From: (Fredrik Lundh)
Date: Tue, 20 Aug 2002 12:17:42 +0200
Subject: [Python-Dev] type categories -- an example
References: <Pine.SOL.4.44.0208191543080.19618-100000@death.OCF.Berkeley.EDU>
Message-ID: <028401c24832$d3f48fa0$0900a8c0@spiff>

brett wrote:

> This might be a stupid question, but why wouldn't
> isinstance(re.compile("spam+"), type(re.compile('')))
> always work.

re.compile is a factory function, and it might (in theory) return
different types for different kind of patterns.


From  Tue Aug 20 15:00:20 2002
From: (Casey Duncan)
Date: Tue, 20 Aug 2002 10:00:20 -0400
Subject: [Python-Dev] type categories -- an example
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

On Monday 19 August 2002 06:45 pm, Andrew Koenig wrote:
> Jeremy> Hard to say.  I can read the code and see that the current
> Jeremy> implementation will always return objects of the same type.
> Jeremy> In fact, it's using type(sre_compile.compile("", 0))
> Jeremy> internally to represent that type.
> Jeremy> That's not a guarantee.  Perhaps Fredrik wants to reserve the
> Jeremy> right to change this in the future.  It's not unusual for
> Jeremy> Python modules to be under-specified in this way.
> The real point is that this is an example of why a uniform way
> of checking for such types would be nice.  I shouldn't have
> to read the source to figure out how to tell if something is
> a compiled regular expression.

In general you wouldn't care whether is was a sre_foo or an sre_bar, just=
it acts like a compiled regular expression, and therefore supports that=20
interface. So, the real solution would be to have re assert that interfac=
e on=20
whatever the compiler returns so that you can check for it, something lik=

if ISre.isImplementedBy(unknown_ob):
  # It's a regex

Where ISre is the compiled regular expression interface object. If the=20
implementation varies the test would still work. Even if the interface=20
varied, the test would work (but it might break other stuff).


From  Tue Aug 20 14:56:52 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 09:56:52 -0400
Subject: [Python-Dev] PEP 269 versus 283.
In-Reply-To: Your message of "Mon, 19 Aug 2002 20:12:09 CDT."
References: <Pine.BSF.4.33.0208191959430.45577-100000@localhost>
Message-ID: <>

> 	I was looking over some of the PEP's and I saw that 269 was
> considered dead according to PEP 283.  This is kind of odd because I was
> planning to have an implementation by the end of the week.

Well, but you could've told me! :-)

I'll gladly revive it.

> This is subject to the constraints of reality; I am taking a
> whopping huge vacation starting this next weekend.  It is either
> going be ready for python-dev to play with this week or in the
> middle of next month.

Are you sure it's safe to expect your interest in this subject to
extend beyond the month of August? :-)

> 	My posts to the parser-sig are trying to be deferential to the
> charter of the SIG (starting w/requirements for a general purpose parser
> generator, not implementation of PEP 269).
> 	I am certainly going to try to wrangle the parser-sig onwards, but
> a pgen module is way overdue.


--Guido van Rossum (home page:

From  Tue Aug 20 15:14:26 2002
From: (Andrew Koenig)
Date: Tue, 20 Aug 2002 10:14:26 -0400 (EDT)
Subject: [Python-Dev] type categories -- an example
In-Reply-To: <> (message from Casey Duncan on
 Tue, 20 Aug 2002 10:00:20 -0400)
References: <> <> <> <>
Message-ID: <>

Casey> In general you wouldn't care whether is was a sre_foo or an sre_bar, just if 
Casey> it acts like a compiled regular expression, and therefore supports that 
Casey> interface. So, the real solution would be to have re assert that interface on 
Casey> whatever the compiler returns so that you can check for it, something like: 

Casey> if ISre.isImplementedBy(unknown_ob):
Casey>   # It's a regex

Casey> Where ISre is the compiled regular expression interface object. If the 
Casey> implementation varies the test would still work. Even if the interface 
Casey> varied, the test would work (but it might break other stuff).


From  Tue Aug 20 15:17:54 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 10:17:54 -0400
Subject: [Python-Dev] PEP 269 versus 283.
In-Reply-To: Your message of "Tue, 20 Aug 2002 16:29:18 +1200."
References: <>
Message-ID: <>

> There's a downside that you'd then be committed to supporting
> it, even if Python stopped using pgen itself some time in the
> future.

I'm not worried about that in this case.  First of all, supporting
pgen shouldn't be too much of an effort (I can see translating it into
Python at some point :-).  It could also be degraded into a 3rd party
module.  And if we switch to something better, the something better
will probably act as a better replacement for pgen (though with a
different API), coaxing people to upgrade anyway.

--Guido van Rossum (home page:

From  Tue Aug 20 15:19:35 2002
From: (Andrew Koenig)
Date: Tue, 20 Aug 2002 10:19:35 -0400 (EDT)
Subject: [Python-Dev] type categories -- an example
In-Reply-To: <>
 (message from Guido van Rossum on Mon, 19 Aug 2002 20:42:50 -0400)
References: <> <> <>
 <> <>
Message-ID: <>

>> But is it guaranteed that re.compile will always yield
>> an object of the same type?

Guido> There are no guarantees in life, but I expect that that is something
Guido> that plenty of code depends on, so it will likely stay that way.

The kind of situation I imagine is that a regular expression might be
implemented not just as a single type but as a whole hierarchy of
them, with the particular type used for a regular expression depending
on thevalue of the regular expression.  For example:

   class Compiled_regexp(object):
      # ...

   class Anchored_regexp(Compiled_regexp):
      # ...

   class Unanchored_regexp(Compiled_regexp):
      # ...

where whether a regexp is anchored or unanchored depends on whether it
begins with "^".  (Contrived, but you get the idea).  In that case, it
is entirely possible that re.compile("") and re.compile("^foo") return
types such that neither is an instance of the other.

I understand that the regexp library doesn't work this way, and will
probably never work this way, but I'm using this example to show why
the technique of using the type returned by a particular library function
call to identify the results of future calls doesn't work in general.

From  Tue Aug 20 15:21:28 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 10:21:28 -0400
Subject: [Python-Dev] Re: Last call: mortal interned strings
In-Reply-To: Your message of "Tue, 20 Aug 2002 07:55:50 +0300."
References: <>
Message-ID: <>

> I see that this has been checked in. 
> My version:
> 	PyString_InternInPlace - immortal
> 	PyString_Intern - mortal 
> Your version:
> 	PyString_InternInPlace - mortal
> 	PyString_InternImmortal - immortal
> My version favors backward compatibility - existing modules will not break 
> if they rely on the immortality of interned strings.
> Your version appears to maximize the benefit of interned strings - existing
> modules automatically get the new mortal semantics without requiring any 
> changes.
> I was wondering what was the rationale behind this decision.

I can only repeat what I said before about this:

"""But the vast majority of C code does *not* depend on this.  I'd
rather keep PyString_InternInPlace(), so we don't have to change all
call locations, only the very rare ones that rely on this."""

> If the only reason was that the name PyString_Intern is not descriptive
> enough it can be renamed to something like PyString_InternReference to
> make it clear that it operates on a reference to a string and modifies 
> it "in place".

It wasn't that.

--Guido van Rossum (home page:

From  Tue Aug 20 15:24:37 2002
From: (Andrew Koenig)
Date: 20 Aug 2002 10:24:37 -0400
Subject: [Python-Dev] type categories -- an example
In-Reply-To: <>
References: <>
Message-ID: <>

Jeremy> Then you asked if re.compile() was guaranteed to return an
Jeremy> object of the same type.  That question is all about the
Jeremy> contract of the re module.  The answer might have been: "No.
Jeremy> In version X, it happens to always return objects of the same
Jeremy> type, but in version Z, I may want to change this."

Jeremy> I suppose we could get at the general question of checking
Jeremy> types by assuming that re.compile() returned instances of two
Jeremy> apparently unrelated classes and that we wanted a way to
Jeremy> declare their relationship.  I'm thinking of something like
Jeremy> Haskell typeclasses here.

Right.  And the classes don't even have to be unrelated -- it's
enough that neither one is derived from the other (for instance,
that they both be derived from a third class).

Andrew Koenig,,

From  Tue Aug 20 17:13:50 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 12:13:50 -0400
Subject: [Python-Dev] nested extension modules?
In-Reply-To: Your message of "Mon, 19 Aug 2002 20:35:53 EDT."
References: <012401c247e1$8c4570d0$>
Message-ID: <>

> Using the source (Luke), I was trying to figure out the best way to add a
> nested submodule from within an extension module. I noticed that the module
> initialization code will set the module name from the package context (if
> set), altogether discarding any name passed explicitly:
> [modsupport.c: Py_InitModule4()]
>  ...
>  if (_Py_PackageContext != NULL) {
>   char *p = strrchr(_Py_PackageContext, '.');
>   if (p != NULL && strcmp(name, p+1) == 0) {
>    name = _Py_PackageContext;
>    _Py_PackageContext = NULL;
>   }
>  }
> This _Py_PackageContext is set up from within _PyImport_LoadDynamicModule
> [importdl.c:]
>  ...
>  oldcontext = _Py_PackageContext;
>  _Py_PackageContext = packagecontext;
>  (*p)();
>  _Py_PackageContext = oldcontext;
> IIUC, this means that when an extension module is loaded as part of a
> package, any submodules I create my calling Py_InitModule<whatever> will
> come out with the same name.
> Questions:
> a. Have I got the analysis right?

Not quite, if I understand what you're saying.  The package context,
despite its name, is not the package name, but the full name of the
*module*, when the shared library is found inside a package.

If, e.g., a package directory P contains an extension module file, the package context is set to "P.E".  The initE() function is
supposed to call Py_InitModule4() with "E" as the module name.
Py_InitModule4() then sees that this is the last component of the
package context, and changes the module name to "P.E".  It also nulls
out the package context.

The checkin comment I made back in 1997 explains this:

    Fix importing of shared libraries from inside packages.
    This is a bit of a hack: when the shared library is loaded, the
    module name is "package.module", but the module calls
    Py_InitModule*() with just "module" for the name.  The shared
    library loader squirrels away the true name of the module in
    _Py_PackageContext, and Py_InitModule*() will substitute this (if
    the name actually matches).

> b. Is there a more-sanctioned way around this other than touching
> _Py_PackageContext (which seems to be intended to be private)

I think using _Py_PackageContext is your only hope.  If you contribute
some docs for it we'll gladly add them to the API docs.

--Guido van Rossum (home page:

From David Abrahams" <  Tue Aug 20 17:54:49 2002
From: David Abrahams" < (David Abrahams)
Date: Tue, 20 Aug 2002 12:54:49 -0400
Subject: [Python-Dev] nested extension modules?
References: <012401c247e1$8c4570d0$>  <>
Message-ID: <047c01c2486a$db15afc0$>

From: "Guido van Rossum" <>

> > Using the source (Luke), I was trying to figure out the best way to add
> > nested submodule from within an extension module. I noticed that the
> > initialization code will set the module name from the package context
> > set), altogether discarding any name passed explicitly:
> >
> > [modsupport.c: Py_InitModule4()]
> >
> >  ...
> >  if (_Py_PackageContext != NULL) {
> >   char *p = strrchr(_Py_PackageContext, '.');
> >   if (p != NULL && strcmp(name, p+1) == 0) {
> >    name = _Py_PackageContext;
> >    _Py_PackageContext = NULL;
> >   }
> >  }
> >
> > This _Py_PackageContext is set up from within
> > [importdl.c:]
> >
> >  ...
> >  oldcontext = _Py_PackageContext;
> >  _Py_PackageContext = packagecontext;
> >  (*p)();
> >  _Py_PackageContext = oldcontext;
> >
> > IIUC, this means that when an extension module is loaded as part of a
> > package, any submodules I create my calling Py_InitModule<whatever>
> > come out with the same name.
> >
> > Questions:
> >
> > a. Have I got the analysis right?
> Not quite, if I understand what you're saying.  The package context,
> despite its name, is not the package name, but the full name of the
> *module*, when the shared library is found inside a package.

I think I understood that part.

> If, e.g., a package directory P contains an extension module file
>, the package context is set to "P.E".  The initE() function is
> supposed to call Py_InitModule4() with "E" as the module name.
> Py_InitModule4() then sees that this is the last component of the
> package context, and changes the module name to "P.E".

Yeah, that's what I expected.

> It also nulls out the package context.

Oops! I missed that part. Maybe that makes my problem imaginary, except
that you go on to say...

> The checkin comment I made back in 1997 explains this:
>     Fix importing of shared libraries from inside packages.
>     This is a bit of a hack: when the shared library is loaded, the
>     module name is "package.module", but the module calls
>     Py_InitModule*() with just "module" for the name.  The shared
>     library loader squirrels away the true name of the module in
>     _Py_PackageContext, and Py_InitModule*() will substitute this (if
>     the name actually matches).
> > b. Is there a more-sanctioned way around this other than touching
> > _Py_PackageContext (which seems to be intended to be private)
> I think using _Py_PackageContext is your only hope.  If you contribute
> some docs for it we'll gladly add them to the API docs.

Hmm, my only hope for what? What I was worried about was that if I tried to
create a nested sub-extension module from within my extension module by
calling Py_InitModuleXXX() directly, its name would be forced to be the
same as that of the outer extension module. Since you pointed out that
_Py_PackageContext was being nulled out, I don't think that's much of an
issue. What issues /do/ I need to be aware of when doing this?


           David Abrahams * Boost Consulting *

From  Tue Aug 20 18:06:24 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 13:06:24 -0400
Subject: [Python-Dev] nested extension modules?
In-Reply-To: Your message of "Tue, 20 Aug 2002 12:54:49 EDT."
References: <012401c247e1$8c4570d0$> <>
Message-ID: <>

> > It also nulls out the package context.
> Oops! I missed that part. Maybe that makes my problem imaginary, except
> that you go on to say...
> > The checkin comment I made back in 1997 explains this:
> >
> >     Fix importing of shared libraries from inside packages.
> >     This is a bit of a hack: when the shared library is loaded, the
> >     module name is "package.module", but the module calls
> >     Py_InitModule*() with just "module" for the name.  The shared
> >     library loader squirrels away the true name of the module in
> >     _Py_PackageContext, and Py_InitModule*() will substitute this (if
> >     the name actually matches).
> >
> > > b. Is there a more-sanctioned way around this other than touching
> > > _Py_PackageContext (which seems to be intended to be private)
> >
> > I think using _Py_PackageContext is your only hope.  If you contribute
> > some docs for it we'll gladly add them to the API docs.
> Hmm, my only hope for what? What I was worried about was that if I tried to
> create a nested sub-extension module from within my extension module by
> calling Py_InitModuleXXX() directly, its name would be forced to be the
> same as that of the outer extension module. Since you pointed out that
> _Py_PackageContext was being nulled out, I don't think that's much of an
> issue. What issues /do/ I need to be aware of when doing this?

I guess I misunderstood what you were trying to accomplish; I thought
you were asking if there was a more accepted way of doing this besides
setting _Py_PackageContext?

I don't understand why the nulling out of _Py_PackageContext makes a
difference for what you were trying to do -- unless the last component
of your submodule's name is the same as its parent's (e.g. you're
creating a submodule X.X inside a module X), the strcmp() could never
succeed.  Also, if the name passed to Py_InitModuleXXX() contains a
dot, the strcmp() can never succeed (since it's applied to the last
component of the package context).

--Guido van Rossum (home page:

From  Tue Aug 20 18:10:18 2002
From: (Michael McLay)
Date: Tue, 20 Aug 2002 13:10:18 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <>
Message-ID: <>

On Monday 19 August 2002 04:38 pm, Guido van Rossum wrote:
> > [Guido van Rossum]
> >
> > > OTOH what then to do with _sort_repr -- make it a class var or an
> > > instance var?
> [Brett C]
> > Well, how often can you imagine someone printing out a single set sorted,
> > but having other sets that they didn't want printed out sorted?  I would
> > suspect that it is going to be a very rare case when someone wants just
> > part of their sets printing sorted and the rest not.
> >
> > I say make it a class var.
> Hm, but what if two different library modules have conflicting
> requirements?  E.g. module A creates sets of complex numbers and must
> have sort_repr=False, while module B needs sort_repr=True for
> user-friendliness (or because it relies on this).

Adding a SortedSet class to the module would partially solve the problem of 
globally clobbering the usage of Set in other modules.  The downside is that 
the selection of the sort function would be a static decision made when when 
a set instance is created.

>>> class Set(object):
...     _sort_repr=False
...     __slots__ = ["_data"]
...    def __init__(self....

>>> class SortedSet(Set):
...     _sort_repr=True
>>> ss = SortedSet()
>>> s = Set()
>>> s._sort_repr
>>> ss._sort_repr

From  Tue Aug 20 18:22:32 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 13:22:32 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Tue, 20 Aug 2002 13:10:18 EDT."
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <>
Message-ID: <>

> Adding a SortedSet class to the module would partially solve the problem of 
> globally clobbering the usage of Set in other modules.

I say YAGNI.

I am still perplexed that I receoved *no* feedback on the sets module
except on this issue of sort order (which I consider solved by adding
a method _repr() that takes an optional 'sorted' argument).

--Guido van Rossum (home page:

From  Tue Aug 20 18:25:13 2002
From: (Michael Hudson)
Date: 20 Aug 2002 18:25:13 +0100
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Guido van Rossum's message of "Tue, 20 Aug 2002 13:22:32 -0400"
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <>
Message-ID: <>

Guido van Rossum <> writes:

> I am still perplexed that I receoved *no* feedback on the sets module
> except on this issue of sort order (which I consider solved by adding
> a method _repr() that takes an optional 'sorted' argument).

This is hardly without precedent, though, is it?  (I mean the only
getting feedback on trivia).

I haven't looked at the set implementationin detail, but given that
it's principal authors are you and Alex, I'm sure it must be
wonderful.  Is that better? :)


  Or here's an even simpler indicator of how much C++ sucks: Print
  out the C++ Public Review Document.  Have someone  hold it about
  three feet  above your head and then drop it.  Thus  you will be
  enlightened.                                        -- Thant Tessman

From  Tue Aug 20 21:07:52 2002
From: (Fredrik Lundh)
Date: Tue, 20 Aug 2002 22:07:52 +0200
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
References: <> <> <>              <>  <>
Message-ID: <026c01c24885$460247c0$ced241d5@hagrid>

guido wrote:    

> > As long as users do not discover it, they will not use it! :-)
> We can mull that over until the first beta release.

is there a list somewhere of things that should be mulled over?

(e.g. set api issues, textfile(filename, mode, encoding) instead of
that ugly "U" flag, datetime/basetime stuff, bwidgets additions to
tkinter, tk 8.4 updates, etc)


From  Tue Aug 20 21:28:54 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 16:28:54 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Tue, 20 Aug 2002 22:07:52 +0200."
References: <> <> <> <> <>
Message-ID: <>

> is there a list somewhere of things that should be mulled over?
> (e.g. set api issues, textfile(filename, mode, encoding) instead of
> that ugly "U" flag, datetime/basetime stuff, bwidgets additions to
> tkinter, tk 8.4 updates, etc)

I've added these to PEP 283.  Anybody who has a suggestion please edit
that PEP (or mail it to me if you don't have checkin perms).

--Guido van Rossum (home page:

From  Tue Aug 20 21:36:11 2002
From: (Jack Jansen)
Date: Tue, 20 Aug 2002 22:36:11 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <p05111702b986a3c4d470@[]>
Message-ID: <>

On maandag, augustus 19, 2002, at 03:36 , Matthias Urlichs wrote:

> Guido:
>>  My guess it's not his listdir() or filesystem, but the keyboard
>>  driver.
> No, it's MacOSX. It always uses the decomposed form.
> That is very noticeable via NFS volumes where files with 
> combined character names are unopenable from the GUI.
> I've filed a bug report about that - i don't know whether OSX 
> 10.2 will allow NFC filenames, at least read-only.

This must be an oversight (or maybe something they didn't 
implement because of lack of time?). They have all the machinery 
in place to do on the fly conversion of filenames, it is used 
for HFS (old-style HFS, not HFS+) and SMB filesystems, where you 
specify the character set of the filesystem at mount time, and 
they do NFC-NFD conversion in the system call interface.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Tue Aug 20 21:31:10 2002
From: (Michael McLay)
Date: Tue, 20 Aug 2002 16:31:10 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <>
Message-ID: <>

On Tuesday 20 August 2002 01:22 pm, Guido van Rossum wrote:

> I am still perplexed that I receoved *no* feedback on the sets module
> except on this issue of sort order (which I consider solved by adding
> a method _repr() that takes an optional 'sorted' argument).

I haven't read the entire thread, but I was puzzled by the implementation 
approach. Did you consider kjbuckets for the standard Python distribution? 
While the claim is rather old, the following quote from Aaron's intro [1]  to 
the module suggests it might improve performance:

   For suitably large compute intensive uses these types should provide up to
   an order of magnitude speedup versus an implementation that uses analogous
   operations implemented directly in Python. 

Adding the gadfly SQL database to the standard library would also be useful, 
but since it is back under development it would be best for gadfly to live on 
a separate release cycle. The kjbuckets software, however, doesn't seem  to 
be changing.

One more reason for adding kjbuckets, Tim Berner-Lee might find the kjGraphs 
class useful for the semantic web work. 


From  Tue Aug 20 21:49:25 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 16:49:25 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Tue, 20 Aug 2002 16:31:10 EDT."
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <>
Message-ID: <>

> > I am still perplexed that I receoved *no* feedback on the sets module
> > except on this issue of sort order (which I consider solved by adding
> > a method _repr() that takes an optional 'sorted' argument).
> I haven't read the entire thread, but I was puzzled by the implementation 
> approach. Did you consider kjbuckets for the standard Python distribution? 

No.  I think that would be the wrong idea at this point for two
reasons: (1) never change two variables at the same time; (2) let's
gather some experience with the new set API first, before we start
worrying about implementation speed.

I also believe that kjbuckets maintains its data in a sorted order,
which is unnecessary for sets -- a hash table is much faster.  After
all we use a very fast hash table implementation to represent sets.
(The only improvement would be that we could save maybe 4 bytes per
hash table entry because we don't need a value pointer.)

> While the claim is rather old, the following quote from Aaron's
> intro [1] to the module suggests it might improve performance:
>    For suitably large compute intensive uses these types should
>    provide up to an order of magnitude speedup versus an
>    implementation that uses analogous operations implemented
>    directly in Python.

The sets module does not implement analogous operations directly in
Python.  Almost all the implementation work is done by the dict

> Adding the gadfly SQL database to the standard library would also be
> useful, but since it is back under development it would be best for
> gadfly to live on a separate release cycle. The kjbuckets software,
> however, doesn't seem to be changing.

Because nobody is maintaining it any more.

> One more reason for adding kjbuckets, Tim Berner-Lee might find the
> kjGraphs class useful for the semantic web work.
> [1]

kjbuckets may be nice, but adding it to the core would add a serious
new maintenance burden for the core developers.  I don't see anyone
raising their hand to help out here.

--Guido van Rossum (home page:

From  Tue Aug 20 22:17:13 2002
From: (Skip Montanaro)
Date: Tue, 20 Aug 2002 16:17:13 -0500
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <>
Message-ID: <15714.45529.615062.997294@gargle.gargle.HOWL>

    tim> Straight character n-grams are very appealing because they're the
    tim> simplest and most language-neutral; I didn't have any luck with
    tim> them over the weekend, but the size of my training data was
    tim> trivial.

Anybody up for pooling corpi (corpora?)?


From  Tue Aug 20 22:27:06 2002
From: (Raymond Hettinger)
Date: Tue, 20 Aug 2002 17:27:06 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <>              <>  <>
Message-ID: <009801c24890$565b26e0$1bf8a4d8@othello>

From: "Guido van Rossum" <>
> I am still perplexed that I receoved *no* feedback on the sets module
> except on this issue of sort order (which I consider solved by adding
> a method _repr() that takes an optional 'sorted' argument).

I think the __init__() code in BaseSet should be pushed down into Set and ImmutableSet.  It should be replaced by code raising a
TypeError just like we do for basestring:

>>> basestring('abc')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: The basestring type cannot be instantiated

Raymond Hettinger

P.S.  More comments are on the way as we play with, profile, review, optimize, and document the module ;)

From  Tue Aug 20 22:27:11 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 17:27:11 -0400
Subject: [Python-Dev] What is a backport candidate?
In-Reply-To: Your message of "Tue, 20 Aug 2002 17:15:20 EDT."
References: <>
Message-ID: <>

> When we say "backport candidate", does that mean we need to think
> about it more or that it is waiting for someone like me to pounce on
> it and get it done?

It means somebody (like you :-) should do triage on the feasibility of
it.  The triage can have several outcomes:

- Trivial yes: the patch applies directly to the 2.2 branch and
  doesn't cause problems there.  In this case, you can apply it right
  away and be done with it.

- Trivial no: the patch doesn't make sense at all -- this should only
  happen when the patch patches code that was added in 2.3; in this
  case the backport/bugfix marking was a mistake, but mistakes happen.

- Needs work: the idea behind the patch applies to 2.2, but the code
  there is sufficiently different that patch (or cvs update -j)
  doesn't quite work.  There are gradations of this, depending on
  what's in the way.  In this case, you may put it off.

We need a database of these triage decisions; the new RoundUp-based
tracker (prototype at is supposed to have a feature
to add this info to the tracker, but I don't know how it works or
whether it is adequate yet.

I'm cc'ing this to python-dev since others may be interested in this
topic.  Also note that I believe we've been inconsistent in marking up
candidates: some say "bugfix candidate", some say "backport
candidate", some may not be marked at all. :-(

--Guido van Rossum (home page:

From  Tue Aug 20 22:23:12 2002
From: (Barry A. Warsaw)
Date: Tue, 20 Aug 2002 17:23:12 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
References: <>
Message-ID: <>

>>>>> "SM" == Skip Montanaro <> writes:

    tim> Straight character n-grams are very appealing because they're
    tim> the simplest and most language-neutral; I didn't have any
    tim> luck with them over the weekend, but the size of my training
    tim> data was trivial.

    SM> Anybody up for pooling corpi (corpora?)?

I've got collections from python-dev, python-list, edu-sig,
mailman-developers, and zope3-dev, chopped at Feb 2002, which is
approximately when Greg installed SpamAssassin.  The collections are
/all/ known good, but pretty close (they should be verified by hand).

The idea is to take some random subsets of these, cat them together
and use them as both training and test data, along with some
'net-available known spam collections.

No time more to play with this today though...

From  Tue Aug 20 22:41:13 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 17:41:13 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Tue, 20 Aug 2002 17:27:06 EDT."
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <>
Message-ID: <>

> From: "Guido van Rossum" <>
> > I am still perplexed that I receoved *no* feedback on the sets module
> > except on this issue of sort order (which I consider solved by adding
> > a method _repr() that takes an optional 'sorted' argument).
> I think the __init__() code in BaseSet should be pushed down into Set and ImmutableSet.  It should be replaced by code raising a
> TypeError just like we do for basestring:
> >>> basestring('abc')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: The basestring type cannot be instantiated

Good idea.  I checked this in, raising NotImplementedError.

> Raymond Hettinger
> P.S.  More comments are on the way as we play with, profile, review,
> optimize, and document the module ;)

Didn't you submit a SF patch/bug?  I think I replied to that.

--Guido van Rossum (home page:

From  Tue Aug 20 22:49:05 2002
From: (Matthias Urlichs)
Date: Tue, 20 Aug 2002 23:49:05 +0200
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>; from on Tue, Aug 20, 2002 at 10:36:11PM +0200
References: <p05111702b986a3c4d470@[]> <>
Message-ID: <>


Jack Jansen:
> > That is very noticeable via NFS volumes where files with 
> > combined character names are unopenable from the GUI.
> This must be an oversight (or maybe something they didn't 
> implement because of lack of time?). They have all the machinery 
> in place to do on the fly conversion of filenames, it is used 
> for HFS (old-style HFS, not HFS+) and SMB filesystems, where you 
> specify the character set of the filesystem at mount time, and 
> they do NFC-NFD conversion in the system call interface.

Specifying the charset at mount time doesn't work for mount_nfs -- maybe
they fix that in Jaguar (10.2).

Matthias Urlichs     |     noris network AG     |

From  Tue Aug 20 22:51:02 2002
From: (Tim Peters)
Date: Tue, 20 Aug 2002 17:51:02 -0400
Subject: [Python-Dev] RE: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <15714.45529.615062.997294@gargle.gargle.HOWL>
Message-ID: <>

[Skip Montanaro]
> Anybody up for pooling corpi (corpora?)?

Barry is collecting clean data from mailing-list archives for lists hosted
at  It's unclear that this will be useful for anything other
than mailing lists hosted at (which I expect have a lot of topic

There's a lovely spam archive here:

From  Tue Aug 20 23:39:56 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 20 Aug 2002 18:39:56 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU>
Message-ID: <>

[Guido van Rossum]

> I am still perplexed that I received *no* feedback on the sets module

As I previously said, I feel comfortable with what I read and saw.
I'd probably have to use sets for having more circumstantiated comments.
Unless you offer the source on and ask for more users' opinions?

Maybe some people would have preferred to see more usual notation, like `+'
for union and `*' for intersection, rather than `or' and `and'?  There are
tiny pros and cons in each direction.  For one, I'll gladly use what is
available, I'm not really going to crusade for either notation...

Should there be special provisions for Sets to interoperate magically with
lists or iterators?  Lists and iterators could be considered as ordered sets
with duplicates allowed.  Even if it could be tinily useful, it is surely
not difficult to explicitly "cast" lists and iterators using the `Set'
constructor.  It is already easy to build an iterator or a list out of a set.

Criticism?  OK!  What about supporting infinite sets? :-) Anything else?
Hmph!  The module doc-string has the word "actually" with three `l'! :-)

François Pinard

From  Wed Aug 21 00:06:34 2002
From: (Paul Prescod)
Date: Tue, 20 Aug 2002 16:06:34 -0700
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
References: <>
 <15714.45529.615062.997294@gargle.gargle.HOWL> <>
Message-ID: <>

Some perhaps relevant links (with no off-topic discusssion):




"""My finding is that it is _nowhere_ near sufficient to have two
populations, "spam" versus "not spam."  

If you muddle together the Nigerian Pyramid schemes with the "Penis
enhancement" ads along with the offers of new credit cards as well as
the latest sites where you can talk to "hot, horny girls LIVE!", the
statistics don't work out nearly so well.

It's hard to tell, on the face of it, why Nigerian scams _should_ be
considered textually similar to phone sex ads, and in practice, the
result of throwing them all together"

There are a few things left to improve about Ifile, and I'd like to
redo it in some language fundamentally less painful to work with than
C """

"Barry A. Warsaw" wrote:
> >>>>> "SM" == Skip Montanaro <> writes:
>     tim> Straight character n-grams are very appealing because they're
>     tim> the simplest and most language-neutral; I didn't have any
>     tim> luck with them over the weekend, but the size of my training
>     tim> data was trivial.
>     SM> Anybody up for pooling corpi (corpora?)?
> I've got collections from python-dev, python-list, edu-sig,
> mailman-developers, and zope3-dev, chopped at Feb 2002, which is
> approximately when Greg installed SpamAssassin.  The collections are
> /all/ known good, but pretty close (they should be verified by hand).
> The idea is to take some random subsets of these, cat them together
> and use them as both training and test data, along with some
> 'net-available known spam collections.
> No time more to play with this today though...
> -Barry
> _______________________________________________
> Python-Dev mailing list

 Paul Prescod

From  Wed Aug 21 00:17:38 2002
From: (Eric S. Raymond)
Date: Tue, 20 Aug 2002 19:17:38 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <>
Message-ID: <>

[Guido van Rossum] 
> I am still perplexed that I received *no* feedback on the sets module

It should have powerset and cartesian-product methods.  Shall I code them?
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 00:23:46 2002
From: (Eric S. Raymond)
Date: Tue, 20 Aug 2002 19:23:46 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <15714.45529.615062.997294@gargle.gargle.HOWL> <> <>
Message-ID: <>

Paul Prescod <>:
> Some perhaps relevant links (with no off-topic discusssion):
>  *

I'm in the process of speed-tuning this now.  I intend for it to be
blazingly fast, usable for sites that process 100K mails a day, and I
think I know how to do that.  This is not a natural application for
Python :-).
> """My finding is that it is _nowhere_ near sufficient to have two
> populations, "spam" versus "not spam."  

Well, except it seems to work quite well.  The Nigerian trigger-word
population is distinct from the penis-enlargement population, but they
both show up under Bayesian analysis.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 00:23:30 2002
From: (Raymond Hettinger)
Date: Tue, 20 Aug 2002 19:23:30 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <>              <009801c24890$565b26e0$1bf8a4d8@othello>  <>
Message-ID: <011001c248a0$99575580$1bf8a4d8@othello>

> > > I am still perplexed that I receoved *no* feedback on the sets module
> > > except on this issue of sort order (which I consider solved by adding
> > > a method _repr() that takes an optional 'sorted' argument).

> > P.S.  More comments are on the way as we play with, profile, review,
> > optimize, and document the module ;)

> Didn't you submit a SF patch/bug?  I think I replied to that.

Yes.  I've now revised the patch accordingly.

More thoughts:

1. Rename .remove() to __del__().  Its usage is inconsistent with list.remove(element) which can leave other instances of element
in the list.  It is more consistent with 'del adict[element]'.

2.  discard() looks like a useful standard API.  Perhaps it shoulds be added to the dictionary API.

3.  Should we add .as_temporarily_immutable  to dictionaries and lists so that they will also be potential elements of a set?

4. remove(), update(), add(), and __contains__() all work hard to check for .as_temporarily_immutable().  Should this propagated
to other methods that add set members(i.e. replace all instances of data[element] = value with self.add(element) or use
self.update() in the code for __init__())?

The answer is tough because it causes an enormous slowdown in the common use cases of uniquifying a sequence.  OTOH, why check in
some places but not others -- why is .add(aSetInstance) okay but not Set([aSetInstance]).

If the answer is yes, then the code for update() should be super-optimized by taking moving the try/except outside the for-loop
and wrapping the whole thing in a while 1.  Also, we could bypass the slower .add() method when incoming source of elements is
known to be an instance of BaseSet.

5. Add a quick pre-check to issubset() and issuperset() along the lines of:

        def issubset(self, other):
            """Report whether another set contains this set."""
            if len(self) > len(other): return False   # Fast check for the obvious case
            for elt in self:
                if elt not in other:
                    return False
            return True

6.  For clarity and foolish consistency, replace all occurrences of 'elt' with 'element'.

Raymond Hettinger

From  Wed Aug 21 00:27:53 2002
From: (Paul Svensson)
Date: Tue, 20 Aug 2002 19:27:53 -0400 (EDT)
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

On Tue, 20 Aug 2002, Guido van Rossum wrote:

>> From: "Guido van Rossum" <>
>> > I am still perplexed that I receoved *no* feedback on the sets module
>> > except on this issue of sort order (which I consider solved by adding
>> > a method _repr() that takes an optional 'sorted' argument).
>> I think the __init__() code in BaseSet should be pushed down into Set and ImmutableSet.  It should be replaced by code raising a
>> TypeError just like we do for basestring:
>> >>> basestring('abc')
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in ?
>> TypeError: The basestring type cannot be instantiated
>Good idea.  I checked this in, raising NotImplementedError.

Is there any particular reason BaseSet and basestring need to raise different
exceptions on an attempt at instantiation ?


From  Wed Aug 21 00:44:52 2002
From: (Tim Peters)
Date: Tue, 20 Aug 2002 19:44:52 -0400
Subject: [Python-Dev] Re: [Python-checkins]
In-Reply-To: <>
Message-ID: <>

[Paul Prescod]
> Some perhaps relevant links (with no off-topic discusssion):
>  *

Damn -- wish I'd read that before.  Among other things, Eric found a good
use for Judy arrays <wink>.

>  *

Knew about that.  Good stuff.


Seems confused, assuming Graham's approach is a minor variant of ifile's.
But Graham's computation is to classic Bayesian classifiers (like ifile) as
Python's lambda is to Lisp's <0.7 wink>.  Heart of the confusion:

    Integrating the whole set of statistics together requires adding up
    statistics for _all_ the words found in a message, not just the
    words "sex" and "sexy."

The rub is that Graham doesn't try to add up the statistics for all the
words found in a msg.  To the contrary, it ends up ignoring almost all of
the words.  In particular, if the database indicates that "sex" and "sexy"
aren't good spam-vs-non-spam discriminators, Graham's approach ignores them
completely (their presence or absence doesn't affect the final outcome at
all -- it's like the words don't exist; this isn't what ifile does, and
ifile probably couldn't get away with this because it's trying to do N-way
classification instead of strictly 2-way -- someone who understands the math
and reads Graham's article carefully will likely have a hard time figuring
out what Bayes has to do with it at all!  I sure did.).

> """My finding is that it is _nowhere_ near sufficient to have two
> populations, "spam" versus "not spam."

In ifile I believe that.  But the data will speak for itself soon enough, so
I'm not going to argue about this.

From  Wed Aug 21 00:56:53 2002
From: (Aahz)
Date: Tue, 20 Aug 2002 19:56:53 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <011001c248a0$99575580$1bf8a4d8@othello>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <009801c24890$565b26e0$1bf8a4d8@othello> <> <011001c248a0$99575580$1bf8a4d8@othello>
Message-ID: <>

On Tue, Aug 20, 2002, Raymond Hettinger wrote:
> 1. Rename .remove() to __del__().  Its usage is inconsistent with
> list.remove(element) which can leave other instances of element in the
> list.  It is more consistent with 'del adict[element]'.

You mean __delitem__, I think.  __del__ is only for deleting the object
itself when its refcount goes to zero.

> 3.  Should we add .as_temporarily_immutable to dictionaries and lists
> so that they will also be potential elements of a set?

There's been some talk in the past about creating lockable dicts and
lists (emphasis on dicts because lists have tuple-equivalence).
Aahz (           <*>

Project Vote Smart:

From  Wed Aug 21 00:57:04 2002
From: (Tim Peters)
Date: Tue, 20 Aug 2002 19:57:04 -0400
Subject: [Python-Dev] Re: [Python-checkins]
In-Reply-To: <>
Message-ID: <>

[Eric S. Raymond]
> I'm in the process of speed-tuning this now.  I intend for it to be
> blazingly fast, usable for sites that process 100K mails a day, and I
> think I know how to do that.  This is not a natural application for
> Python :-).

I'm not sure about that.  The all-Python version I checked in added 20,000
Python-Dev messages to the database in 2 wall-clock minutes.  The time for
computing the statistics, and for scoring, is simply trivial (this wouldn't
be true of a "normal" Bayesian classifier (NBC), but Graham skips most of
the work an NBC does, in particular favoring fast classification time over
fast model-update time).

What we anticipate is that the vast bulk of the time will end up getting
spent on better tokenization, such as decoding base64 portions, and giving
special care to header fields and URLs.  I also *suspect* (based on a
previous life in speech recogniation) that experiments will show that a
mixture of character n-grams and word bigrams is significantly more
effective than a "naive" tokenizer that just looks for US ASCII alphanumeric

>> """My finding is that it is _nowhere_ near sufficient to have two
>> populations, "spam" versus "not spam."

> Well, except it seems to work quite well.  The Nigerian trigger-word
> population is distinct from the penis-enlargement population, but they
> both show up under Bayesian analysis.

In fact, I'm going to say "Nigerian" and "penis enlargement" one more time
each here, just to demonstrate that *this* message won't be a false positive
when the smoke settles <wink>.  Human Growth Hormone too, while I'm at it.

From  Wed Aug 21 00:59:33 2002
From: (Eric S. Raymond)
Date: Tue, 20 Aug 2002 19:59:33 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <>
Message-ID: <>

Tim Peters <>:
> >  *
> Damn -- wish I'd read that before.  Among other things, Eric found a good
> use for Judy arrays <wink>.

It's a freaking *ideal* use for Judy arrays.  Platonically perfect.  They
couldn't fit better if they'd been designed for this application.  Bogofilter
was actually born in the moment that I realized this.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 01:02:05 2002
From: (Raymond Hettinger)
Date: Tue, 20 Aug 2002 20:02:05 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <009801c24890$565b26e0$1bf8a4d8@othello> <> <011001c248a0$99575580$1bf8a4d8@othello> <>
Message-ID: <016e01c248a5$fca0de40$1bf8a4d8@othello>

From: "Aahz" <>
> > 1. Rename .remove() to __del__().  Its usage is inconsistent with
> > list.remove(element) which can leave other instances of element in the
> > list.  It is more consistent with 'del adict[element]'.
> You mean __delitem__, I think.  __del__ is only for deleting the object
> itself when its refcount goes to zero.


From  Wed Aug 21 01:09:57 2002
From: (Tim Peters)
Date: Tue, 20 Aug 2002 20:09:57 -0400
Subject: [Python-Dev] Re: [Python-checkins]
In-Reply-To: <>
Message-ID: <>

[Eric S. Raymond]
> It's a freaking *ideal* use for Judy arrays.  Platonically perfect.  They
> couldn't fit better if they'd been designed for this application.
> Bogofilter was actually born in the moment that I realized this.

I believe that so long as it stays in memory.  But, as you mention in your

    startup is too slow for sites handling thousands of mails an hour

That likely makes a Zope OOBTree stored under ZODB a better choice still, as
that's designed for efficient update and access in a persistent database
(the version of this we've got now does update during scoring, to keep track
of when tokens were last used, and how often they've proved useful in
discriminating -- there needs to be a way to expire tokens over time, else
the database will grow without bound).

I've corresponded with Douglas Baskins about "this kind of thing", and he's
keen to address it (along with every other problem in the world <0.9 wink>);
it would help if HP weren't laying off the people who have worked on this

From  Wed Aug 21 01:37:49 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 20:37:49 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Tue, 20 Aug 2002 18:39:56 EDT."
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <>
Message-ID: <>

> > I am still perplexed that I received *no* feedback on the sets
> > module
> As I previously said, I feel comfortable with what I read and saw.
> I'd probably have to use sets for having more circumstantiated
> comments.

Fair enough.

> Unless you offer the source on and ask for more users' opinions?

Last time I tried that it turned out a bad idea.  I prefer feedback
over a flame war.

> Maybe some people would have preferred to see more usual notation,
> like `+' for union and `*' for intersection, rather than `or' and
> `and'?  There are tiny pros and cons in each direction.  For one,
> I'll gladly use what is available, I'm not really going to crusade
> for either notation...

Um, the notation is '|' and '&', not 'or' and 'and', and those are
what I learned in school.  Seems pretty conventional to me (Greg
Wilson actually tried this out on unsuspecting newbies and found that
while '+' worked okay, '*' did not -- read the PEP).

But yes, this is decent feedback (with good enough arguments, Greg's
conclusion might even be overturned).

> Should there be special provisions for Sets to interoperate
> magically with lists or iterators?  Lists and iterators could be
> considered as ordered sets with duplicates allowed.  Even if it
> could be tinily useful, it is surely not difficult to explicitly
> "cast" lists and iterators using the `Set' constructor.  It is
> already easy to build an iterator or a list out of a set.

You can do an in-place union of a Set and a sequence or iterable with
set.update(seq).  If you want intersection or a difference, or your
set is immutable, you'd have to cast the sequence to a set.  What's
the use case?

Which brings me to another open issue.

set.update(seq) and set.add(element) have a provision to transform the
inserted element(s) to an ImmutableSet if needed.  Should the
constructor do the same?

> Criticism?  OK!  What about supporting infinite sets? :-) Anything else?
> Hmph!  The module doc-string has the word "actually" with three `l'! :-)

Not any more, thanks to Raymond Hettinger. :-)

--Guido van Rossum (home page:

From  Wed Aug 21 01:46:23 2002
From: (Greg Ewing)
Date: Wed, 21 Aug 2002 12:46:23 +1200 (NZST)
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

> Um, the notation is '|' and '&', not 'or' and 'and', and those are
> what I learned in school.

Really? The notation I learned in school was big-rounded-U
for union and big-upside-down-rounded-U for intersection.
Not available in the ASCII character set, unfortunately.

But I agree that | and & are fairlly intuitive substitutes
for these, and they agree with what you use for bit twiddling.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug 21 01:52:52 2002
From: (Eric S. Raymond)
Date: Tue, 20 Aug 2002 20:52:52 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <>
Message-ID: <>

(Copied to Paul Graham.  Paul, this is the mailing list of the Python
maintainers.  I thought you'd find the bits about lexical analysis in
bogofilter interesting.  Pythonistas, Paul is one of the smartest and
sanest people in the LISP community, as evidenced partly by the fact
that he hasn't been too proud to learn some lessons from Python :-).
It would be a good thing for some bridges to be built here.)

Tim Peters <>:
> What we anticipate is that the vast bulk of the time will end up getting
> spent on better tokenization, such as decoding base64 portions, and giving
> special care to header fields and URLs.

This is one of bogofilter's strengths.  It already does this stuff at
the lexical level using a speed-tuned flex scanner (I spent a lot of
the development time watching token strings go by and tweaking the
scanner rules to throw out cruft).  

In fact, look at this.  It's a set of lex rules with interpolated comments:

BASE64		[A-Za-z0-9/+]
IPADDR		[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+
MIME_BOUNDARY	^--[^[:blank:]\n]*$


# Recognize and discard From_ headers
^From\ 						{return(FROM);}

# Recognize and discard headers that contain dates and tumblers.
^Date:.*|Delivery-Date:.*			;
^Message-ID:.*					;

# Throw away BASE64 enclosures.  No point in using this as a discriminator;
# spam that has these in it always has other triggers in the headers.
# Therefore don't pay the overhead to decode it.
^{BASE64}+$					;

# Throw away tumblers generated by MTAs
^\tid\ .*					;
SMTP\ id\ .*					;

# Toss various meaningless MIME cruft
boundary=.*					;
name=\"						;
filename=\"					;

# Keep IP addresses
{IPADDR}					{return(TOKEN);}

# Keep wordlike tokens of length at least three
[a-z$][a-z0-9$'.-]+[a-z0-9$]			{return(TOKEN);}

# Everything else, including all one-and-two-character tokens,
# gets tossed.
.						;

This small set of rules does a remarkably effective job of tossing
out everything that isn't a potential recognition feature.  Watching
the filtered token stream from a large spam corpus go by is actually
quite an enlightening experience.

It does a better job that Paul's original in one important respect; IP
addresses and hostnames are preserved whole for use as recognition
features.  I think I know why Paul didn't do this -- he's not a C
coder, and if lex/flex isn't part of one's toolkit one's brain doesn't
tend to wander down design paths that involve elaborate lexical
analysis, because it's just too hard.

This is actually the first new program I've coded in C (rather than
Python) in a good four years or so.  There was a reason for this;
I have painful experience with doing lexical analysis in Python that
tells me flex-generated C will be a major performance win here.  The
combination of flex and Judy made it a no-brainer.

>                                        I also *suspect* (based on a
> previous life in speech recogniation) that experiments will show that a
> mixture of character n-grams and word bigrams is significantly more
> effective than a "naive" tokenizer that just looks for US ASCII alphanumeric
> runs.

I share part of your suspicions -- I'm thinking about going to bigram
analysis for header lines.  But I'm working on getting the framework
up first.  Feature extraction is orthogonal to both the Bayesian
analysis and (mostly) to the data-storage method, and can be a drop-in
change if the framework is done right.

> In fact, I'm going to say "Nigerian" and "penis enlargement" one more time
> each here, just to demonstrate that *this* message won't be a false positive
> when the smoke settles <wink>.  Human Growth Hormone too, while I'm at it.

I think I know why, too.  It's the top-15 selection -- the highest-variance
words don't blur into non-spam English the way statistics on *all* tokens
would.  It's like an edge filter.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 01:56:05 2002
From: (Eric S. Raymond)
Date: Tue, 20 Aug 2002 20:56:05 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <> <>
Message-ID: <>

Guido van Rossum <>:
> Um, the notation is '|' and '&', not 'or' and 'and', and those are
> what I learned in school.  Seems pretty conventional to me (Greg
> Wilson actually tried this out on unsuspecting newbies and found that
> while '+' worked okay, '*' did not -- read the PEP).

+1 on preferring | and & to `or' and `and'.  To me, `or' and `and' say
that what's being composed are predicates, not sets.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 02:00:48 2002
From: (Greg Ewing)
Date: Wed, 21 Aug 2002 13:00:48 +1200 (NZST)
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

"Eric S. Raymond" <>:

> To me, `or' and `and' say
> that what's being composed are predicates, not sets.

Besides which, they can't be overridden in Python anyway.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug 21 02:05:17 2002
From: (Eric S. Raymond)
Date: Tue, 20 Aug 2002 21:05:17 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <> <>
Message-ID: <>

Greg Ewing <>:
> > Um, the notation is '|' and '&', not 'or' and 'and', and those are
> > what I learned in school.
> Really? The notation I learned in school was big-rounded-U
> for union and big-upside-down-rounded-U for intersection.
> Not available in the ASCII character set, unfortunately.

For historical reasons, there are three different notations for Boolean
algebra in common use.  You're describing the one derived from  set theory. 
I personally favor the one derived from lattice algebra; the distinctive
feature of that one is the pointy and &/| operators that look like /\  and
\/.  The third uses | and &.

The set-theoretic notation is the oldest.  I think Birkhoff & MacLane
invented the lattice-theory notation in the 1940s.  It is probably
*slightly* more popular than the set-theoretic notation.  The | & one
is distinctly less common than either, at least among mathematicians;
I think EEs and suchlike may use it more than we do.

> But I agree that | and & are fairlly intuitive substitutes
> for these, and they agree with what you use for bit twiddling.

Not an insignificant point.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 02:14:29 2002
From: (Eric S. Raymond)
Date: Tue, 20 Aug 2002 21:14:29 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <>
Message-ID: <>

Tim Peters <>:
> [Eric S. Raymond]
> > It's a freaking *ideal* use for Judy arrays.  Platonically perfect.  They
> > couldn't fit better if they'd been designed for this application.
> > Bogofilter was actually born in the moment that I realized this.
> I believe that so long as it stays in memory. 

VM, dude, VM is your friend.  I thought this through carefully.  The
size of bogofilter's working set isn't limited by core.  And because
it's a B-tree variant, the access frequency will be proportional to
log2 of the wordlist size and the patterns will be spatially bursty.
This is a memory access pattern that plays nice with an LRU pager.

> But, as you mention in your manpage
>     startup is too slow for sites handling thousands of mails an hour
> That likely makes a Zope OOBTree stored under ZODB a better choice still, as
> that's designed for efficient update and access in a persistent database

I'm working on a simpler solution, one which might have a Pythonic spinoff.
Stay tuned.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 02:46:08 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 21:46:08 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Tue, 20 Aug 2002 19:27:53 EDT."
References: <>
Message-ID: <>

> Is there any particular reason BaseSet and basestring need to raise
> different exceptions on an attempt at instantiation ?

Hm, I dunno.  NotImplementedError was intended for this kind of use,
but TypeError also matches.  I'll add an XXX for this.

--Guido van Rossum (home page:

From  Wed Aug 21 03:05:03 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 22:05:03 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Tue, 20 Aug 2002 19:23:30 EDT."
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <009801c24890$565b26e0$1bf8a4d8@othello> <>
Message-ID: <>

> 1. Rename .remove() to __del__().  Its usage is inconsistent with
> list.remove(element) which can leave other instances of element in
> the list.  It is more consistent with 'del adict[element]'.

(You mean __delete__.)

-1.  Sets don't support the x[y] notation for accessing or setting
elements, so it would be weird to use that for deleting.  You're not
deleting the value corresponding to the key or index y (like you are
when using del x[y] on a list or dict), you're deleting y itself.
That's more like x.remove(y) for lists.

> 2.  discard() looks like a useful standard API.  Perhaps it shoulds
> be added to the dictionary API. 

Perhaps.  But the dict API already has so many things.  And why not to
the list API?  I'm -0 on this.

> 3.  Should we add .as_temporarily_immutable  to dictionaries and
> lists so that they will also be potential elements of a set?

Hm, I think this is premature.  I'd like to see a use case for a set
of dicts and a set of lists first.

Then you can code list and dict subclasses in Python that implement
_as_temporarily_immutable (and _as_immutable too, I suppose).

Then we'll see how often this ends up being used.

For now, I'd say YAGNI.  (I've said YAGNI to sets for years, but the
realization that as a result lots of people independently invented
using the keys of a dict to represent a set made me change my mind.
Sort of like how I changed my mind on a Boolean type, too, after 12
years of thinking it wasn't needed. :-)

> 4. remove(), update(), add(), and __contains__() all work hard to
> check for .as_temporarily_immutable().  Should this propagated to
> other methods that add set members(i.e. replace all instances of
> data[element] = value with self.add(element) or use self.update() in
> the code for __init__())?

I've been thinking the same thing.  I think that the only case where
this could apply is __init__(), by the way.

> The answer is tough because it causes an enormous slowdown in the
> common use cases of uniquifying a sequence.  OTOH, why check in some
> places but not others -- why is .add(aSetInstance) okay but not
> Set([aSetInstance]).

Really?  Why the slowdown?

I was thinking of simply changing __init__ into

    if seq is not None:

If that's too slow, perhaps update() could be changed to the

    it = iter(seq)
        for elt in it:
            data[elt] = value
    except TypeError:
        transform = getattr(elt, '_as_immutable', None)
        if transform is None:
        data[transform()] = value

That is, if there are no elements that require transformation, the
added cost is a single try/except setup (plus an extra call to
iter()).  If any element requires transformation, the rest of the
elements are dealt with as fast as update() can.

Hm, maybe this could be applied to update() too (except it shouldn't
call itself recursively but simply write the loop out a second time,
with a try/except around each element).

> If the answer is yes, then the code for update() should be
> super-optimized by taking moving the try/except outside the for-loop
> and wrapping the whole thing in a while 1.

That's a similar idea as I just sketched.  Can you email me a proposed
patch?  (Let's skip SF for this.)

> Also, we could bypass the slower .add() method when incoming source
> of elements is known to be an instance of BaseSet.

Huh?  Nobody calls add() internally AFAIK.

> 5. Add a quick pre-check to issubset() and issuperset() along the
> lines of:
>         def issubset(self, other):
>             """Report whether another set contains this set."""
>             self._binary_sanity_check(other)
>             if len(self) > len(other): return False   # Fast check for the obvious case
>             for elt in self:
>                 if elt not in other:
>                     return False
>             return True

Sure.  Check it in.

> 6.  For clarity and foolish consistency, replace all occurrences of
> 'elt' with 'element'.

Hm, no.  'element' for a loop control variable seems too long (I'd be
happy with 'x' but Greg Wilson used 'element').  However I like
'element' as the argument name because it can be used as a keyword
argument and then it's better spelled out in full.  I think Greg
Wilson used 'item' most of the time; I prefer to be consistent and say
'element' all the time since that's the accepted set terminology.

--Guido van Rossum (home page:

From  Wed Aug 21 03:25:17 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 22:25:17 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Tue, 20 Aug 2002 19:17:38 EDT."
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <>
Message-ID: <>

> It should have powerset and cartesian-product methods.  Shall I code
> them?

The Cartesian product of two sets is easy:

def product(s1, s2):
    cp = Set()
    for x in s1:
        for y in s2:
            cp.add((x, y))
    return cp

But I'm not sure if this is useful enough to add to the set class --
it seems a great way to waste a lot of memory, and typical uses are
probably better served by writing out the for-loops.  Perhaps this
could be coded as a generator though.

More fun would be the cartesian product of N sets; I guess it would
have to be recursive.  Here's a generator version:

def product(s, *sets):
    if not sets:
        for x in s:
            yield (x,)
        subproduct = list(product(*sets))
        for x in s:
            for t in subproduct:
                yield (x,) + t

Note that this doesn't require sets; its arguments can be arbitrary
iterables.  So maybe this belongs in a module of iterator utilities
rather than in the sets module.  API choice: product() with a single
argument yields a series of 1-tuples.  That's slightly awkward, but
works better for the recursion.  And specifically asking for the
Cartesian product of a single set is kind of pointless.

If we *do* add this to the Set class, it could be aliased to __mul__
(giving another reason why using * for set intersection is a bad

Here's a naive powerset implementation returning a set:

def power(s):
    ps = Set()
    for elt in s:
        s1 = Set([elt])
        ps1 = Set()
        for ss in ps:
           ps1.add(ss | s1)
        ps |= ps1
    return ps

This is even more of a memory hog; however the algorithm is slightly
more subtle so it's perhaps more valuable to have this in the library.

Here's a generator version:

def power(s):
    if len(s) == 0:
        yield Set()
        # Non-destructively choose a random element:
        x = Set([iter(s).next()])
        for ss in power(s - x):
            yield ss
            yield ss | x

I'm not totally happy with this -- it recurses for each element in s,
creating a new set at each level that is s minus one element.  I'd
prefer to build the set up from the other end, like the first

IOW I'd love to see your version. :-)

The first power() example raises a point about the set API: the Set()
constructor can be called without an iterable argument, but
ImmutableSet() cannot.  Maybe ImmutableSet() should be allowed too?
It creates an immutable empty set.  (Hm, this could be a singleton.
__new__ could take care of that.)

While comparing the various versions of power(), I also ran into an
interesting bug in the code.  While Set([1]) == ImmutableSet([1]),
Set([Set([1])]) != Set([ImmutableSet([1])]).  I have to investigate

--Guido van Rossum (home page:

From  Wed Aug 21 03:44:31 2002
From: (Eric S. Raymond)
Date: Tue, 20 Aug 2002 22:44:31 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <009801c24890$565b26e0$1bf8a4d8@othello> <> <011001c248a0$99575580$1bf8a4d8@othello> <>
Message-ID: <>

Guido van Rossum <>:
> Hm, no.  'element' for a loop control variable seems too long (I'd be
> happy with 'x' but Greg Wilson used 'element').  However I like
> 'element' as the argument name because it can be used as a keyword
> argument and then it's better spelled out in full.  I think Greg
> Wilson used 'item' most of the time; I prefer to be consistent and say
> 'element' all the time since that's the accepted set terminology.

Briefly reverting to type as a logician, Eric applauds.  Sometimes I
tell you not to sweat what the my ex-colleagues will think, but this
is a case in which using mathemtically-correct terminology will *not*
obscure the difference between stateless/mathematical reasoning and
stateful/programming reasoning, and is therefore a good idea.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 03:57:25 2002
From: (Eric S. Raymond)
Date: Tue, 20 Aug 2002 22:57:25 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <> <> <>
Message-ID: <>

Guido van Rossum <>:
> Here's a generator version:
> def power(s):
>     if len(s) == 0:
>         yield Set()
>     else:
>         # Non-destructively choose a random element:
>         x = Set([iter(s).next()])
>         for ss in power(s - x):
>             yield ss
>             yield ss | x
> I'm not totally happy with this -- it recurses for each element in s,
> creating a new set at each level that is s minus one element.  I'd
> prefer to build the set up from the other end, like the first
> version.

You're right, that is an ugly and opaque piece of code.  Guido me lad,
you have been led down the garden path by a dubious lust for recursive
elegance.  One might almost think you were a LISP hacker or something.

> IOW I'd love to see your version. :-)

Here's the pre-generator version I wrote using lists as the underlying
representation.  Should be trivially transformable into a generator
version.  I'd do it myself but I'm heads-down on bogofilter just now

def powerset(base):
    "Compute the set of all subsets of a set."
    powerset = []
    for n in xrange(2 ** len(base)):
	subset = []
	for e in xrange(len(base)):
	     if n & 2 ** e:
    return powerset

Are you slapping your forehead yet? :-)
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 04:06:21 2002
From: (Skip Montanaro)
Date: Tue, 20 Aug 2002 22:06:21 -0500
Subject: [Python-Dev] Automatic flex interface for Python?
In-Reply-To: <>
References: <>
Message-ID: <15715.941.27029.778363@gargle.gargle.HOWL>

    Eric> This is one of bogofilter's strengths.  It already does this stuff
    Eric> at the lexical level using a speed-tuned flex scanner (I spent a
    Eric> lot of the development time watching token strings go by and
    Eric> tweaking the scanner rules to throw out cruft).

This reminds me of something which tickled my interesting bone the other
day.  The SpamAssassin folks are starting to look at Flex for much faster
regular expression matching in situations where large numbers of static re's
must be matched.  I wonder if using something like SciPy's weave tool would
make it (relatively) painless to incorporate fairly high-speed scanners into
Python programs.  It seems like it would just be an extra layer of
compilation for something like weave.  Instead of inserting C code into a
string, wrapping it with module sticky stuff and compiling it, you'd insert
Flex rules into the string, call a slightly higher level function which
calls flex to generate the scanner code and use a slightly different bit of
module sticky stuff to make it callable from Python.


From  Wed Aug 21 04:20:18 2002
From: (Eric S. Raymond)
Date: Tue, 20 Aug 2002 23:20:18 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <15715.941.27029.778363@gargle.gargle.HOWL>
References: <> <> <> <15715.941.27029.778363@gargle.gargle.HOWL>
Message-ID: <>

Skip Montanaro <>:
> This reminds me of something which tickled my interesting bone the other
> day.  The SpamAssassin folks are starting to look at Flex for much faster
> regular expression matching in situations where large numbers of static re's
> must be matched.

*snort* Took'em long enough.  No, I shouldn't be snarky.  Flex is only
obvious to Unix old-timers -- the traditions that gave rise to it have
fallen into desuetitude in the last decade.

> ...insert Flex rules into the string, call a slightly higher level
> function which calls flex to generate the scanner code and use a
> slightly different bit of module sticky stuff to make it callable
> from Python.

Lexers are painful in Python.  They hit the language in a weak spot
created by the immutability of strings.  I've found this an obstacle
more than once, but then I'm a battle-scarred old compiler jock who
attacks *everything* with lexers and parsers.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 04:19:54 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 23:19:54 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Tue, 20 Aug 2002 21:46:08 EDT."
References: <>
Message-ID: <>

> > Is there any particular reason BaseSet and basestring need to raise
> > different exceptions on an attempt at instantiation ?
> Hm, I dunno.  NotImplementedError was intended for this kind of use,
> but TypeError also matches.  I'll add an XXX for this.

I found a good reason why it should be TypeError, so TypeError it is.

--Guido van Rossum (home page:

From  Wed Aug 21 04:25:52 2002
From: (Skip Montanaro)
Date: Tue, 20 Aug 2002 22:25:52 -0500
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
References: <>
Message-ID: <15715.2112.952920.736993@gargle.gargle.HOWL>

    >> ...insert Flex rules into the string, call a slightly higher level
    >> function which calls flex to generate the scanner code and use a
    >> slightly different bit of module sticky stuff to make it callable
    >> from Python.

    Eric> Lexers are painful in Python.  They hit the language in a weak
    Eric> spot created by the immutability of strings.

Yeah, that's why you inline what is essentially a .l file into your Python
code. ;-)

I'm actually here in Austin for a couple days visiting Eric Jones and the
SciPy gang.  Perhaps Eric and I can bat something out over lunch tomorrow...


From  Wed Aug 21 03:59:46 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 20 Aug 2002 22:59:46 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU>
Message-ID: <>

[Guido van Rossum]

> Um, the notation is '|' and '&', not 'or' and 'and', and those are
> what I learned in school.  Seems pretty conventional to me (Greg
> Wilson actually tried this out on unsuspecting newbies and found that
> while '+' worked okay, '*' did not -- read the PEP).

The very usual notation for me has been the big `U' for union and the same,
upside-down, for intersection, but even now that Python supports Unicode,
these are not Python operators _yet_. :-)

I never saw `|' nor `&' in literature, except `|' which means "such that"
in set comprehensions, as Pythoneers would be tempted to say!  On the
other hand, for programmers, `|' and `&' are rather natural and easy.

Eric has offered the idea of adding Cartesian product, and despite the
usual notation is a tall thin `X', maybe it would be nice reserving `*'
for that?  It might not be explicit enough, and besides, there are other
circumstances in Algebra, not so far from sets, when one might need many
multiplicative operators, so `*' would easily get over-used.

François Pinard

From  Wed Aug 21 04:47:11 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 23:47:11 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Tue, 20 Aug 2002 22:57:25 EDT."
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <> <> <>
Message-ID: <>

> Here's the pre-generator version I wrote using lists as the underlying
> representation.  Should be trivially transformable into a generator
> version.  I'd do it myself but I'm heads-down on bogofilter just now
> def powerset(base):
>     "Compute the set of all subsets of a set."
>     powerset = []
>     for n in xrange(2 ** len(base)):
>         subset = []
>         for e in xrange(len(base)):
>              if n & 2 ** e:
>                 subset.append(base[e])
>         powerset.append(subset)
>     return powerset
> Are you slapping your forehead yet? :-)

Yes!  I didn't actually know that algorithm.

Here's the generator version for sets (still requires a real set as

def powerset(base):
    size = len(base)
    for n in xrange(2**size):
        subset = []
        for e, x in enumerate(base):
            if n & 2**e:
        yield Set(subset)

I would like to write n & (1<<e) instead of n & 2**e; but 1<<e drops
bits when e is > 31.  Now, for a set with that many elements, there's
no hope that this will ever complete in finite time, but does that
mean it shouldn't start?  I could write 1L<<e and avoid the issue, but
then I'd be paying for long ops that I'll only ever need in a case
that's only of theoretical importance.

A variation: rather than calling enumerate(base) 2**size times,
concretize in it into a list.  We know it can't be very big or else
the result isn't representable:

def powerset(base):
    size = len(base)
    pairs = list(enumerate(base))
    for n in xrange(2**size):
        subset = []
        for e, x in pairs:
            if n & 2**e:
        yield Set(subset)

Ah, and now it's trivial to write this so that base can be an
arbitrary iterable again, rather than a set (or sequence):

def powerset(base):
    pairs = list(enumerate(base))
    size = len(pairs)
    for n in xrange(2**size):
        subset = []
        for e, x in pairs:
            if n & 2**e:
        yield subset

This is a generator that yields a series of lists whose values are the
items of base.  And again, like cartesian product, it's now more a
generator thing than a set thing.

BTW, the correctness of all my versions trivially derives from the
correctness of your version -- each is a very simple transformation of
the previous one.  My mentor Lambert Meertens calls this process
Algorithmics (and has developed a mathematical notation and theory for
program transformations).


--Guido van Rossum (home page:

From  Wed Aug 21 04:49:21 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 23:49:21 -0400
Subject: [Python-Dev] Automatic flex interface for Python?
In-Reply-To: Your message of "Tue, 20 Aug 2002 22:06:21 CDT."
References: <> <> <>
Message-ID: <>

> I wonder if using something like SciPy's weave tool would make it
> (relatively) painless to incorporate fairly high-speed scanners into
> Python programs.

I haven't given up on the re module for fast scanners (see Tim's note
on the speed of tokenizing 20,000 messages in mere minutes).  Note
that the Bayes approach doesn't *need* a trick to apply many regexes
in parallel to the text.

--Guido van Rossum (home page:

From  Wed Aug 21 04:57:51 2002
From: (Guido van Rossum)
Date: Tue, 20 Aug 2002 23:57:51 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: Your message of "Tue, 20 Aug 2002 23:20:18 EDT."
References: <> <> <> <15715.941.27029.778363@gargle.gargle.HOWL>
Message-ID: <>

> Lexers are painful in Python.  They hit the language in a weak spot
> created by the immutability of strings.  I've found this an obstacle
> more than once, but then I'm a battle-scarred old compiler jock who
> attacks *everything* with lexers and parsers.

I think you're exaggerating the problem, or at least underestimating
the re module.  The re module is pretty fast!  Reading a file
line-by-line is very fast in Python 2.3 with the new "for line in
open(filename)" idiom.  I just scanned nearly a megabyte of ugly data
(a Linux kernel) in 0.6 seconds using the regex '\w+', finding 177,000
words.  The regex (?:\d+|[a-zA-Z_]+) took 1 second, yielding 1 second,
finding 190,000 words.  I expect that the list creation (one hit at a
time) took more time than the matching.

--Guido van Rossum (home page:

From  Wed Aug 21 05:00:21 2002
From: (Aahz)
Date: Wed, 21 Aug 2002 00:00:21 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

On Tue, Aug 20, 2002, Guido van Rossum wrote:
>>> Is there any particular reason BaseSet and basestring need to raise
>>> different exceptions on an attempt at instantiation ?
>> Hm, I dunno.  NotImplementedError was intended for this kind of use,
>> but TypeError also matches.  I'll add an XXX for this.
> I found a good reason why it should be TypeError, so TypeError it is.

Mind telling us?  (I've always used NotImplementedError, so I'm
Aahz (           <*>

Project Vote Smart:

From  Wed Aug 21 05:02:09 2002
From: (Guido van Rossum)
Date: Wed, 21 Aug 2002 00:02:09 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Wed, 21 Aug 2002 00:00:21 EDT."
References: <> <> <>
Message-ID: <>

> Mind telling us?  (I've always used NotImplementedError, so I'm
> curious.)

OK, from the checkins:

- Changed the NotImplementedError in BaseSet.__init__ to TypeError,
  both for consistency with basestring() and because we have to use
  TypeError when denying Set.__hash__.  Together those provide
  sufficient evidence that an unimplemented method needs to raise

--Guido van Rossum (home page:

From  Wed Aug 21 05:12:52 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 00:12:52 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <> <> <> <> <>
Message-ID: <>

Guido van Rossum <>:
> > Are you slapping your forehead yet? :-)
> Yes!  I didn't actually know that algorithm.

I thought it up myself.  Which is funny, since to get there you have to
think like a hardware engineer rather than a logician.  My brain was
definitely out of character that day.
> This is a generator that yields a series of lists whose values are the
> items of base.  And again, like cartesian product, it's now more a
> generator thing than a set thing.

I don't care where it lives, really.  I just like the concision of
being able to say foo.powerset().  Not that I've used this yet, but I
know one algorithm for which it would be helpful.  Another one I
invented, actually, back when I really was a mathematician -- a closed
form for sums of certain categories of probability distributions.  I
called it the Dungeon Dice Theorem.  Never published it.

> BTW, the correctness of all my versions trivially derives from the
> correctness of your version -- each is a very simple transformation of
> the previous one.  My mentor Lambert Meertens calls this process
> Algorithmics (and has developed a mathematical notation and theory for
> program transformations).

Web pointer?
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 05:13:29 2002
From: (Aahz)
Date: Wed, 21 Aug 2002 00:13:29 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
References: <> <> <> <15715.941.27029.778363@gargle.gargle.HOWL> <> <>
Message-ID: <>

I'm mildly curious why nobody has suggested mxTextTools or anything like
Aahz (           <*>

Project Vote Smart:

From  Wed Aug 21 05:16:47 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 00:16:47 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <> <> <>
Message-ID: <>

Fran=E7ois Pinard <>:
> I never saw `|' nor `&' in literature, except `|' which means "such tha=
> in set comprehensions, as Pythoneers would be tempted to say!  On the
> other hand, for programmers, `|' and `&' are rather natural and easy.

Now that I think of it, I've never seen a mathematician use this at all.
But I agree that it's a good choice for programmersl
> Eric has offered the idea of adding Cartesian product, and despite the
> usual notation is a tall thin `X', maybe it would be nice reserving `*'
> for that?=20

Mildly in favor, but I wouldn't cry if it didn't happen.
		<a href=3D"">Eric S. Raymond</a>

From  Wed Aug 21 05:22:20 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 00:22:20 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

> ...
> I also believe that kjbuckets maintains its data in a sorted order,
> which is unnecessary for sets -- a hash table is much faster.

It's Zope's BTrees that maintain sorted order.  kjbuckets consists of 3
variations ("set", "dict", "graph") of a rather complicated hash table,
driven by the need for the graph flavor to associate multiple values with a
single key.  The kj hash table slots each contain a small contiguous vector,
and then it starts to get complicated <wink>.

> After all we use a very fast hash table implementation to represent sets.
> (The only improvement would be that we could save maybe 4 bytes per
> hash table entry because we don't need a value pointer.)

The set flavor of kjbucket does skip the value pointer.  I suppose it could
have supported multisets too ("bags" -- duplicate keys allowed), but it
doesn't (they can be faked via a kjgraph using dummy values, though -- much
like faking a set in Python via a dict with dummy values!).

> ...
> The sets module does not implement analogous operations directly in
> Python.  Almost all the implementation work is done by the dict
> implementation.

kjbuckets has a rich set of operations coded in C, including intersection,
graph composition, and transitive closure.  If the sets module were to
sprout those operations too, they would have to look like the sets
intersection implementation (nests of Python loops and ifs), as Python dicts
don't support primitives able to polish off large chunks of the necessary
work at C speed.  Aaron's claim that kjbuckets can do those kinds of things
10x faster than Python code seems quite safe <wink>.

> ...
> kjbuckets may be nice, but adding it to the core would add a serious
> new maintenance burden for the core developers.  I don't see anyone
> raising their hand to help out here.

Not me.  It's about 3500 lines of hairy code, and you wouldn't like some of
the interface decisions it made.  For example,

   a_kjset[3] = 5

adds 3 to the set and ignores 5.  Even Aaron blushes at that one <wink>, but
some of the others would be much harder to sort out.  It would be a major
undertaking to do so.

From  Wed Aug 21 05:26:01 2002
From: (Sean Reifschneider)
Date: Tue, 20 Aug 2002 22:26:01 -0600
Subject: [Python-Dev] Sort() returning sorted list
Message-ID: <>

It would be nice to have, at times, sort() return the sorted list,
but doing so could lead people to believe that returns a sorted copy of
the list.  What about if we could do "list.sort(returnAfterSorting = 1)"
or something?  It'd be nice if it were short and sweet, but it should
be clear what it's doing too...

 Passionate hatred can give meaning and purpose to an empty life.
                 -- Eric Hoffer
Sean Reifschneider, Inimitably Superfluous <> - Linux Consulting since 1995. Qmail, KRUD, Firewalls, Python

From  Wed Aug 21 05:50:39 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 00:50:39 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
Message-ID: <>

[Eric S. Raymond]
> VM, dude, VM is your friend.  I thought this through carefully.  The
> size of bogofilter's working set isn't limited by core.

Do you expect that to be an issue?  When I built a database from 20,000
messages, the whole thing fit in a Python dict consuming about 10MB.  That's
all.  As is always the case in this kind of thing, about half the tokens are
utterly useless since they appear only once in only one message (think of
things like misspellings and message ids here -- about half the tokens
generated will be "obviously useless", although the useless won't become
obvious until, e.g., a month passes and you never seen the token again).  In
addition, this is a statistical-sampling method, and 20,000 messages is a
very large sample.  I concluded that, in practice, and since we do have ways
to identify and purge useless tokens, 5MB is a reasonable upper bound on the
size of this thing.  It doesn't fit in my L2 cache, but I'd need a
magnifying glass to find it in my RAM.

> And because it's a B-tree variant, the access frequency will be
> to log2 of the wordlist size

I don't believe Judy is faster than Python string-keyed dicts at the sizes
I'm expecting (I've corresponded with Douglas about that too, and his timing
data has a hard time disagreeing <wink>).

> and the patterns will be spatially bursty.

Why?  Do you sort tokens before looking them up?  Else I don't see a reason
to expect that, from one lookup to the next, the paths from root to leaf
will enjoy spatial overlap beyond the root node.

> This is a memory access pattern that plays nice with an LRU pager.

Well, as I said before, all the evidence I've seen says the scoring time for
a message is near-trivial (including the lookup times), even in pure Python.
It's only the parsing that I'm still worried about, and I may yet confess a
bad case of Flex envy.

> I'm working on a simpler solution, one which might have a Pythonic
> Stay tuned.

I figure this means something simpler than a BTree under ZODB.  If so, you
should set yourself a tougher challenge <wink>.

From  Wed Aug 21 06:11:03 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 01:11:03 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

[Guido, eventually arrives at ...]
> def powerset(base):
>     pairs = list(enumerate(base))
>     size = len(pairs)
>     for n in xrange(2**size):
>         subset = []
>         for e, x in pairs:
>             if n & 2**e:
>                 subset.append(x)
>         yield subset

Now let's rewrite that in modern Python <wink>:

def powerset(base):
    pairs = [(2**i, x) for i, x in enumerate(base)]
    for i in xrange(2**len(pairs)):
        yield [x for (mask, x) in pairs if i & mask]

> This is a generator that yields a series of lists whose values are the
> items of base.  And again, like cartesian product, it's now more a
> generator thing than a set thing.

Generators are great for constructing families of combinatorial objects,
BTW.  They can be huge, and in algorithms that use them as inputs for
searches, you can often expect not to need more than a miniscule fraction of
the entire family, but can't predict how many you'll need.  Generators are
perfect then.

> BTW, the correctness of all my versions trivially derives from the
> correctness of your version -- each is a very simple transformation of
> the previous one.  My mentor Lambert Meertens calls this process
> Algorithmics (and has developed a mathematical notation and theory for
> program transformations).

Was he interested in that before his sabbatical with the SETL folks?  The
SETL project produced lots of great research in that area, largely driven by
the desire to help ultra-high-level SETL programs finish in their lifetimes.

From  Wed Aug 21 06:14:27 2002
From: (Guido van Rossum)
Date: Wed, 21 Aug 2002 01:14:27 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Tue, 20 Aug 2002 23:47:11 EDT."
Message-ID: <>

A few transformations down the road, here's a 4-line powerset() generator:

def powerset(base):
    pairs = [(2**i, x) for i, x in enumerate(base)]
    for n in xrange(2**len(pairs)):
        yield [x for m, x in pairs if m&n]

--Guido van Rossum (home page:

From  Wed Aug 21 06:23:15 2002
From: (Guido van Rossum)
Date: Wed, 21 Aug 2002 01:23:15 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Wed, 21 Aug 2002 01:11:03 EDT."
References: <>
Message-ID: <>

> Now let's rewrite that in modern Python <wink>:
> def powerset(base):
>     pairs = [(2**i, x) for i, x in enumerate(base)]
>     for i in xrange(2**len(pairs)):
>         yield [x for (mask, x) in pairs if i & mask]

Honest, I didn't see this before I posted mine. :-)

> > This is a generator that yields a series of lists whose values are the
> > items of base.  And again, like cartesian product, it's now more a
> > generator thing than a set thing.
> Generators are great for constructing families of combinatorial
> objects, BTW.  They can be huge, and in algorithms that use them as
> inputs for searches, you can often expect not to need more than a
> miniscule fraction of the entire family, but can't predict how many
> you'll need.  Generators are perfect then.

Yup.  That's why I strove for these to be generators.

> > BTW, the correctness of all my versions trivially derives from the
> > correctness of your version -- each is a very simple
> > transformation of the previous one.  My mentor Lambert Meertens
> > calls this process Algorithmics (and has developed a mathematical
> > notation and theory for program transformations).
> Was he interested in that before his sabbatical with the SETL folks?

Dunno.  Maybe it sparked his interest.

> The SETL project produced lots of great research in that area,
> largely driven by the desire to help ultra-high-level SETL programs
> finish in their lifetimes.

--Guido van Rossum (home page:

From  Wed Aug 21 06:25:42 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 01:25:42 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

[Guido, retroactively channeling Tim]
> A few transformations down the road, here's a 4-line powerset() generator:
> def powerset(base):
>     pairs = [(2**i, x) for i, x in enumerate(base)]
>     for n in xrange(2**len(pairs)):
>         yield [x for m, x in pairs if m&n]

Mine used one less local variable, so is more cache-friendly <wink>.

From  Wed Aug 21 06:26:17 2002
From: (Guido van Rossum)
Date: Wed, 21 Aug 2002 01:26:17 -0400
Subject: [Python-Dev] Sort() returning sorted list
In-Reply-To: Your message of "Tue, 20 Aug 2002 22:26:01 MDT."
References: <>
Message-ID: <>

> It would be nice to have, at times, sort() return the sorted list,
> but doing so could lead people to believe that returns a sorted copy of
> the list.  What about if we could do "list.sort(returnAfterSorting = 1)"
> or something?  It'd be nice if it were short and sweet, but it should
> be clear what it's doing too...

Nah, this is not provided precisely for the reason you give.  You can
write your own utility easily enough:

def sort(L):
    return L

--Guido van Rossum (home page:

From  Wed Aug 21 06:35:56 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 01:35:56 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <>
Message-ID: <>

Tim Peters <>:
> [Eric S. Raymond]
> > VM, dude, VM is your friend.  I thought this through carefully.  The
> > size of bogofilter's working set isn't limited by core.
> Do you expect that to be an issue?  When I built a database from 20,000
> messages, the whole thing fit in a Python dict consuming about 10MB.

Hm, that's a bit smaller than I would have thought, but the order of 
magnitude I was expecting.

>                                                                     That's
> all.  As is always the case in this kind of thing, about half the tokens are
> utterly useless since they appear only once in only one message (think of
> things like misspellings and message ids here -- about half the tokens
> generated will be "obviously useless", although the useless won't become
> obvious until, e.g., a month passes and you never seen the token again). 

Recognition features should age!  Wow!  That's a good point!  With the age
counter being reset when they're recognized.

> > and the patterns will be spatially bursty.
> Why?  Do you sort tokens before looking them up? 

I thought part of the point of the method was that you get sorting for free
because of the way elements are inserted.

>                                             Else I don't see a reason
> to expect that, from one lookup to the next, the paths from root to leaf
> will enjoy spatial overlap beyond the root node.

No, but think about how the pointer in a binary search moves.  It's
spatially bursty, Memory accesses frequencies for repeated binary
searches will be a sum of bursty signals, analogous to the way
network traffic volumes look in the time domain.  In fact the
graph of memory adress vs. number of accesses is gonna win up
looking an awful lot like 1/f noise, I think.  *Not* evenly
distributed; something there for LRU to weork with.

> > I'm working on a simpler solution, one which might have a Pythonic spinoff.
> > Stay tuned.
> I figure this means something simpler than a BTree under ZODB.  If so, you
> should set yourself a tougher challenge <wink>.

What I'm starting to test now is a refactoring of the program where it
spawn a daemon version of itself first time it's called.  The daemon
eats the wordlists and stays in core fielding requests from subsequent
program runs.  Basically an answer to "how you call bogofilter 1K
times a day from procmail without bringing your disks to their knees"
problem" -- persistence on the cheap.

Thing is that the solution to this problem is very generic.  Might
turn into a Python framework.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 06:36:26 2002
From: (Greg Ewing)
Date: Wed, 21 Aug 2002 17:36:26 +1200 (NZST)
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

"Eric S. Raymond" <>:

> The | & one is distinctly less common than either, at least among
> mathematicians; I think EEs and suchlike may use it more than we do.

I'm surprised that mathematicians use | and & at all. I had always
assumed that these were invented by the programming community, being
available ASCII characters used in programming languages, and that
mathematicians wouldn't ever use them if they had a choice. But maybe
I'm wrong!

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug 21 06:41:11 2002
From: (Greg Ewing)
Date: Wed, 21 Aug 2002 17:41:11 +1200 (NZST)
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

Guido van Rossum <>:

> def product(s1, s2):
>     cp = Set()
>     for x in s1:
>         for y in s2:
>             cp.add((x, y))
>     return cp

Oh, no. Someone is bound to want set comprehensions, now...

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug 21 06:46:31 2002
From: (Greg Ewing)
Date: Wed, 21 Aug 2002 17:46:31 +1200 (NZST)
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <> (=?iso-8859-1?q?Fran=E7ois?= Pinard):

> Eric has offered the idea of adding Cartesian product, and despite the
> usual notation is a tall thin `X', maybe it would be nice reserving
> `*' for that?

Maybe '%'? It looks a bit X-ish.

Or if Python ever gets a dedicated matrix-multiplication
operator, maybe that could be reused.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug 21 06:58:25 2002
From: (Zack Weinberg)
Date: Tue, 20 Aug 2002 22:58:25 -0700
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

On Wed, Aug 21, 2002 at 01:35:56AM -0400, Eric S. Raymond wrote:
> What I'm starting to test now is a refactoring of the program where it
> spawn a daemon version of itself first time it's called.  The daemon
> eats the wordlists and stays in core fielding requests from subsequent
> program runs.  Basically an answer to "how you call bogofilter 1K
> times a day from procmail without bringing your disks to their knees"
> problem" -- persistence on the cheap.

For use at ISPs, the daemon should be able to field requests from lots
of different users, maintaining one unified word list.  Without
needing any access whatsoever to user home directories.


From  Wed Aug 21 07:00:26 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 02:00:26 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <> <>
Message-ID: <>

Greg Ewing <>:
> > The | & one is distinctly less common than either, at least among
> > mathematicians; I think EEs and suchlike may use it more than we do.
> I'm surprised that mathematicians use | and & at all. I had always
> assumed that these were invented by the programming community, being
> available ASCII characters used in programming languages, and that
> mathematicians wouldn't ever use them if they had a choice. But maybe
> I'm wrong!

Your post crossed one of mine in which, on reflection, I said I'd never
seen a mathematician use these.  Not even me, not when I'm doing math
anyway.  I still think in Birkhoff's lattice-theory notation.

Nevertheless I'm quite comfortable with | & when programming.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 07:03:33 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 02:03:33 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <> <>
Message-ID: <>

Greg Ewing <>:
> (=?iso-8859-1?q?Fran=E7ois?= Pinard):
> > Eric has offered the idea of adding Cartesian product, and despite the
> > usual notation is a tall thin `X', maybe it would be nice reserving
> > `*' for that?
> Maybe '%'? It looks a bit X-ish.

Ugh.  *No.*  -1 on that, 

I'd look at % and see a string format operator.   Suddenly I like Francois's
suggestion a lot better.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 07:13:16 2002
From: (Sean Reifschneider)
Date: Wed, 21 Aug 2002 00:13:16 -0600
Subject: [Python-Dev] Sort() returning sorted list
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Wed, Aug 21, 2002 at 01:26:17AM -0400, Guido van Rossum wrote:
>def sort(L):
>    L.sort()
>    return L

Yeah, that's not QUITE as bad as having to write my own string-handling
routines.  ;-/

So, there's really no place in Python itself for this?  Like,
list.sort_inplace() and list.sort_copy(), both of which return something
because it's obvious what they do?

 A ship in port is safe, but that is not what ships are for.
                 -- Rear Admiral Grace Murray Hopper
Sean Reifschneider, Inimitably Superfluous <> - Linux Consulting since 1995. Qmail, KRUD, Firewalls, Python

From  Wed Aug 21 07:22:26 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 02:22:26 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

Zack Weinberg <>:
> On Wed, Aug 21, 2002 at 01:35:56AM -0400, Eric S. Raymond wrote:
> > 
> > What I'm starting to test now is a refactoring of the program where it
> > spawn a daemon version of itself first time it's called.  The daemon
> > eats the wordlists and stays in core fielding requests from subsequent
> > program runs.  Basically an answer to "how you call bogofilter 1K
> > times a day from procmail without bringing your disks to their knees"
> > problem" -- persistence on the cheap.
> For use at ISPs, the daemon should be able to field requests from lots
> of different users, maintaining one unified word list.  Without
> needing any access whatsoever to user home directories.

I'm on it.  The following is not yet working, but it's a straight road to get

There is a public spam-checker port.  Your client program sends it
packets consisting of a list of header token counts.  You
can send lots of these blocks; each one has to be under the maximum
atomic-message size for sockets (I think that's 32K).  

The server accumulates the frequency counts you ship it until you say
"OK, what is it?"  Does the Bayes test.  Ships you back a result.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 07:26:24 2002
From: (Delaney, Timothy)
Date: Wed, 21 Aug 2002 16:26:24 +1000
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
Message-ID: <>

> From: Guido van Rossum []
> OK, from the checkins:
> - Changed the NotImplementedError in BaseSet.__init__ to TypeError,
>   both for consistency with basestring() and because we have to use
>   TypeError when denying Set.__hash__.  Together those provide
>   sufficient evidence that an unimplemented method needs to raise
>   TypeError.

Hmm ... is there a case that NotImplementedError should be a subclass of
TypeError? Conceptually it would make sense (this *type* does not implement
this method).

Of course, it would probably also break code ...

Tim Delaney

From  Wed Aug 21 09:16:55 2002
From: (Martin =?ISO-8859-1?Q?Sj=F6gren?=)
Date: 21 Aug 2002 10:16:55 +0200
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <>
Message-ID: <1029917815.581.3.camel@winterfell>

Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

ons 2002-08-21 klockan 03.05 skrev Eric S. Raymond:
> For historical reasons, there are three different notations for Boolean
> algebra in common use.  You're describing the one derived from  set theor=
> I personally favor the one derived from lattice algebra; the distinctive
> feature of that one is the pointy and &/| operators that look like /\  an=
> \/.  The third uses | and &.

Uhm, what about + and juxtaposition? They are quite common at least here
in Sweden, for boolean algebra.


Content-Type: application/pgp-signature; name=signature.asc
Content-Description: Detta =?ISO-8859-1?Q?=E4r?= en digitalt signerad

Version: GnuPG v1.0.7 (GNU/Linux)



From  Wed Aug 21 09:17:44 2002
From: (Fredrik Lundh)
Date: Wed, 21 Aug 2002 10:17:44 +0200
Subject: [Python-Dev] Re: Automatic flex interface for Python?
References: <> <> <> <15715.941.27029.778363@gargle.gargle.HOWL> <> <> <>
Message-ID: <015901c248eb$97772ec0$0900a8c0@spiff>

aahz wrote:

> I'm mildly curious why nobody has suggested mxTextTools or anything =
> that.

I'm mildly curious why mxTextTools proponents=20

From  Wed Aug 21 09:28:32 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 04:28:32 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <1029917815.581.3.camel@winterfell>
References: <> <> <> <1029917815.581.3.camel@winterfell>
Message-ID: <>

Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Martin Sj=F6gren <>:
> Uhm, what about + and juxtaposition? They are quite common at least here
> in Sweden, for boolean algebra.

Is it + for disjunction and juxtaposition for conjunction, or the other
way around?  Not that I've ever seen either variant.
		<a href=3D"">Eric S. Raymond</a>

Content-Type: application/pgp-signature
Content-Disposition: inline

Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see



From  Wed Aug 21 09:34:49 2002
From: (Fredrik Lundh)
Date: Wed, 21 Aug 2002 10:34:49 +0200
Subject: [Python-Dev] Re: Automatic flex interface for Python?
References: <> <> <> <15715.941.27029.778363@gargle.gargle.HOWL> <> <> <> <015901c248eb$97772ec0$0900a8c0@spiff>
Message-ID: <018e01c248ed$9ffc0d20$0900a8c0@spiff>

> I'm mildly curious why mxTextTools proponents=20

eh?  why did my mailer send that mail?  what I was trying to say
is that I'm mildly curious why people tend to treat mxTextTools like
some kind of silver bullet, without actually comparing it to a well-
written regular expression.

I've heard from people who've spent days rewriting their application,
only to find that the resulting program was slower.

(as Guido noted, for problems like this, the overhead isn't so much
in the engine itself, as in the effort needed to create Python data


From  Wed Aug 21 09:36:07 2002
From: (Martin =?ISO-8859-1?Q?Sj=F6gren?=)
Date: 21 Aug 2002 10:36:07 +0200
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <>
 <> <1029917815.581.3.camel@winterfell>
Message-ID: <1029918967.582.13.camel@winterfell>

Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

ons 2002-08-21 klockan 10.28 skrev Eric S. Raymond:
> Martin Sj=C3=B6gren <>:
> > Uhm, what about + and juxtaposition? They are quite common at least her=
> > in Sweden, for boolean algebra.
> Is it + for disjunction and juxtaposition for conjunction, or the other
> way around?  Not that I've ever seen either variant.

I've often seen it in the context of electronics. a+1 =3D 1, a0 =3D 0 and s=
on. That is, + is disjunction and juxtaposition (or a multiplication
dot) is conjunction.

Hmm, I just realized that I've also seen it in an American book on
discrete maths, so it's not just us Swedes ;)


Content-Type: application/pgp-signature; name=signature.asc
Content-Description: Detta =?ISO-8859-1?Q?=E4r?= en digitalt signerad

Version: GnuPG v1.0.7 (GNU/Linux)



From  Wed Aug 21 10:05:27 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 05:05:27 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <1029918967.582.13.camel@winterfell>
References: <> <> <> <1029917815.581.3.camel@winterfell> <> <1029918967.582.13.camel@winterfell>
Message-ID: <>

Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Martin Sj=F6gren <>:
> > Is it + for disjunction and juxtaposition for conjunction, or the other
> > way around?  Not that I've ever seen either variant.
> I've often seen it in the context of electronics. a+1 =3D 1, a0 =3D 0 and=
> on. That is, + is disjunction and juxtaposition (or a multiplication
> dot) is conjunction.

Makes sense.  Hardware designers care a lot about reduction to disjunctive
normal form. Much more than logicians do, actually.
> Hmm, I just realized that I've also seen it in an American book on
> discrete maths, so it's not just us Swedes ;)

Odd that I haven't encountered it.
		<a href=3D"">Eric S. Raymond</a>

Content-Type: application/pgp-signature
Content-Disposition: inline

Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see



From  Wed Aug 21 11:30:35 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 06:30:35 -0400
Subject: [Python-Dev] Embarassed in Malvern
Message-ID: <>

Apparently I was hallucinating when I thought these has been released in
open source.  Aaarrgghh.  Well, they had a butt-ugly API anyway. I'll
replace the with Damian Ivereigh's libredblack in 0.3.
		<a href="">Eric S. Raymond</a>

Government should be weak, amateurish and ridiculous. At present, it
fulfills only a third of the role.	-- Edward Abbey

From  Wed Aug 21 12:57:45 2002
From: (Barry A. Warsaw)
Date: Wed, 21 Aug 2002 07:57:45 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
References: <>
Message-ID: <>

>>>>> "ZW" == Zack Weinberg <> writes:

    ZW> For use at ISPs, the daemon should be able to field requests
    ZW> from lots of different users, maintaining one unified word
    ZW> list.  Without needing any access whatsoever to user home
    ZW> directories.

An approach like this certainly makes sense for a mailing list server,
especially when all the lists are roughly about the same topic.  Even
without that, I suspect that spam across lists all looks the same,
while non-spam will differ so there may be list/site organizations you
can exploit here.


From  Wed Aug 21 13:28:24 2002
From: (Aahz)
Date: Wed, 21 Aug 2002 08:28:24 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <018e01c248ed$9ffc0d20$0900a8c0@spiff>
References: <> <> <> <15715.941.27029.778363@gargle.gargle.HOWL> <> <> <> <015901c248eb$97772ec0$0900a8c0@spiff> <018e01c248ed$9ffc0d20$0900a8c0@spiff>
Message-ID: <>

On Wed, Aug 21, 2002, Fredrik Lundh wrote:
> > I'm mildly curious why mxTextTools proponents 
> eh?  why did my mailer send that mail?  what I was trying to say
> is that I'm mildly curious why people tend to treat mxTextTools like
> some kind of silver bullet, without actually comparing it to a well-
> written regular expression.
> I've heard from people who've spent days rewriting their application,
> only to find that the resulting program was slower.

Okay, so that's one datapoint.  I've never actually used mxTextTools;
I'm mostly going by comments Tim Peters has made in the past suggesting
that regex tools are poor for parsing.  Since he's the one saying that
regex is fast enough this time, I figured it'd be an appropriate time to
throw up a question.
Aahz (           <*>

Project Vote Smart:

From  Wed Aug 21 13:39:44 2002
From: (Guido van Rossum)
Date: Wed, 21 Aug 2002 08:39:44 -0400
Subject: [Python-Dev] Sort() returning sorted list
In-Reply-To: Your message of "Wed, 21 Aug 2002 00:13:16 MDT."
References: <> <>
Message-ID: <>

> So, there's really no place in Python itself for this?  Like,
> list.sort_inplace() and list.sort_copy(), both of which return something
> because it's obvious what they do?


--Guido van Rossum (home page:

From  Wed Aug 21 13:41:05 2002
From: (Guido van Rossum)
Date: Wed, 21 Aug 2002 08:41:05 -0400
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Wed, 21 Aug 2002 16:26:24 +1000."
References: <>
Message-ID: <>

> Hmm ... is there a case that NotImplementedError should be a
> subclass of TypeError? Conceptually it would make sense (this *type*
> does not implement this method).

I think you're overthinking this.  NotImplementedError is fine for
code that wants to send that particular message to the user.  We're
playing with TypeError here because we're trying to be close to the

--Guido van Rossum (home page:

From  Wed Aug 21 14:31:36 2002
From: (Gordon McMillan)
Date: Wed, 21 Aug 2002 09:31:36 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <018e01c248ed$9ffc0d20$0900a8c0@spiff>
Message-ID: <3D635DF8.9480.90CBCDDD@localhost>

On 21 Aug 2002 at 10:34, Fredrik Lundh wrote:

> eh?  why did my mailer send that mail?  what I was
> trying to say is that I'm mildly curious why people
> tend to treat mxTextTools like some kind of silver
> bullet, without actually comparing it to a well-
> written regular expression.
> I've heard from people who've spent days rewriting
> their application, only to find that the resulting
> program was slower.
> (as Guido noted, for problems like this, the
> overhead isn't so much in the engine itself, as in
> the effort needed to create Python data
> structures...) 

mxTextTools lets (encourages?) you to break all
the rules about lex -> parse. If you can (& want to)
put a good deal of the "parse" stuff into the scanning
rules, you can get a speed advantage. You're also
not constrained by the rules of BNF, if you choose
to see that as an advantage :-).

My one successful use of mxTextTools came after
using SPARK to figure out what I actually needed
in my AST, and realizing that the ambiguities in the
grammar didn't matter in practice, so I could produce
an almost-AST directly.

-- Gordon

From  Wed Aug 21 14:36:44 2002
From: (Gordon McMillan)
Date: Wed, 21 Aug 2002 09:36:44 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
References: <018e01c248ed$9ffc0d20$0900a8c0@spiff>
Message-ID: <3D635F2C.4839.90D07DA3@localhost>

On 21 Aug 2002 at 8:28, Aahz wrote:

> ...  I've never actually used mxTextTools; I'm
> mostly going by comments Tim Peters has made in the
> past suggesting that regex tools are poor for
> parsing.  

They suck for parsing. They excel for lexing, however.

-- Gordon

From  Wed Aug 21 14:47:11 2002
From: (Skip Montanaro)
Date: Wed, 21 Aug 2002 08:47:11 -0500
Subject: [Python-Dev] Automatic flex interface for Python?
In-Reply-To: <>
References: <>
Message-ID: <15715.39391.629762.256170@gargle.gargle.HOWL>

    Guido> I haven't given up on the re module for fast scanners (see Tim's
    Guido> note on the speed of tokenizing 20,000 messages in mere minutes).
    Guido> Note that the Bayes approach doesn't *need* a trick to apply many
    Guido> regexes in parallel to the text.

Right.  I'm thinking of it in situations where you do need such tricks.
SpamAssassin is one such place.  I think Eric has an application (quickly
tokenizing the data produced by an external program, where the data can run
into several hundreds of thousands of lines) where this might be beneficial
as well.


From  Wed Aug 21 16:00:45 2002
From: (Skip Montanaro)
Date: Wed, 21 Aug 2002 10:00:45 -0500
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
References: <>
Message-ID: <15715.43805.195606.442523@gargle.gargle.HOWL>

    aahz> I'm mostly going by comments Tim Peters has made in the past
    aahz> suggesting that regex tools are poor for parsing.  

parsing != tokenizing. ;-)

Regular expressions are great for tokenizing (most of the time).


From  Wed Aug 21 16:15:54 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 11:15:54 -0400
Subject: [Python-Dev] Embarassed in Malvern
In-Reply-To: <>
Message-ID: <>

[Eric S. Raymond]
> Apparently I was hallucinating when I thought these has been released in
> open source.

Are you talking about Judy?  If so, the LGPL'ed source isn't at HP, it's at

From  Wed Aug 21 16:31:54 2002
From: (Raymond Hettinger)
Date: Wed, 21 Aug 2002 11:31:54 -0400
Subject: [Python-Dev] Backwards compatiblity
References: <>  <>
Message-ID: <001701c24927$e162dca0$9feb7ad1@othello>

For 2.2.2, if we add False,True=0,1 to __builtins__, then code written for 2.3 will more likely run without modification.  For
instance, that is all the sets module need to run under 2.2.

Since 2.2 may end-up being the Py-in-a-Tie and because we want 2.3 book examples to be likely to run in an older environment, I
propose we add the two builtins.  Further, since we don't want to encourage further propagation of custom dictionary based sets,
we should consider adding the sets module also.  In both cases, it can't hurt to add the extra functions and it can certainly help
some of the time.

Raymond Hettinger

From  Wed Aug 21 16:36:18 2002
From: (Alex Martelli)
Date: Wed, 21 Aug 2002 17:36:18 +0200
Subject: [Python-Dev] Backwards compatiblity
In-Reply-To: <001701c24927$e162dca0$9feb7ad1@othello>
References: <> <> <001701c24927$e162dca0$9feb7ad1@othello>
Message-ID: <>

On Wednesday 21 August 2002 05:31 pm, Raymond Hettinger wrote:
> For 2.2.2, if we add False,True=0,1 to __builtins__, then code written for
> 2.3 will more likely run without modification.  For instance, that is all
> the sets module need to run under 2.2.

False and True with those values (as well as function bool) are already
in 2.2.1's __builtins__.


From  Wed Aug 21 16:44:09 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 11:44:09 -0400
Subject: [Python-Dev] Backwards compatiblity
In-Reply-To: <001701c24927$e162dca0$9feb7ad1@othello>
Message-ID: <>

[Raymond Hettinger]
> For 2.2.2, if we add False,True=0,1 to __builtins__, then code
> written for 2.3 will more likely run without modification.  For
> instance, that is all the sets module need to run under 2.2.

Good idea!  Before spending *too* much time on it, though <wink>, note that
Guido already did it for 2.2.1:

Python 2.2.1 (#34, Apr  9 2002, 19:34:33) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> True
>>> False

> ...
> Further, since we don't want to encourage further propagation of custom
> dictionary based sets, we should consider adding the sets module also.

Strongly doubt that one will happen.

> In both cases, it can't hurt to add the extra functions and it can
> help some of the time.

The sets module is still pre-alpha, and adding pre-alpha anything to a
"stability release" is highly dubious.  At best, it would create
artificialcompatibility problems if 2.3 alpha and beta tests shows a need to
change the sets API.  If people want new features, that's what new releases
are for.

From  Wed Aug 21 16:42:22 2002
From: (Raymond Hettinger)
Date: Wed, 21 Aug 2002 11:42:22 -0400
Subject: [Python-Dev] Backwards compatiblity
References: <> <> <001701c24927$e162dca0$9feb7ad1@othello> <>
Message-ID: <000701c24929$57d8ada0$9feb7ad1@othello>

> False and True with those values (as well as function bool) are already
> in 2.2.1's __builtins__.


From  Wed Aug 21 16:51:04 2002
From: (Raymond Hettinger)
Date: Wed, 21 Aug 2002 11:51:04 -0400
Subject: [Python-Dev] Backwards compatiblity
References: <>
Message-ID: <001a01c2492a$8f27cec0$9feb7ad1@othello>

> The sets module is still pre-alpha, and adding pre-alpha anything to a
> "stability release" is highly dubious.  At best, it would create
> artificialcompatibility problems if 2.3 alpha and beta tests shows a need to
> change the sets API.  If people want new features, that's what new releases
> are for.

Once, it is firmed-up a bit, how about putting on or in the Vaults of Parnassus?

From  Wed Aug 21 17:00:43 2002
From: (Barry A. Warsaw)
Date: Wed, 21 Aug 2002 12:00:43 -0400
Subject: [Python-Dev] Backwards compatiblity
References: <>
Message-ID: <>

>>>>> "RH" == Raymond Hettinger <> writes:

    RH> Once, it is firmed-up a bit, how about putting on
    RH> or in the Vaults of Parnassus?

Or making a nice little distutils package available on SF?


From  Wed Aug 21 17:24:02 2002
From: (Aahz)
Date: Wed, 21 Aug 2002 12:24:02 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <15715.43805.195606.442523@gargle.gargle.HOWL>
References: <> <> <15715.941.27029.778363@gargle.gargle.HOWL> <> <> <> <015901c248eb$97772ec0$0900a8c0@spiff> <018e01c248ed$9ffc0d20$0900a8c0@spiff> <> <15715.43805.195606.442523@gargle.gargle.HOWL>
Message-ID: <>

On Wed, Aug 21, 2002, Skip Montanaro wrote:
>     aahz> I'm mostly going by comments Tim Peters has made in the past
>     aahz> suggesting that regex tools are poor for parsing.  
> parsing != tokenizing. ;-)
> Regular expressions are great for tokenizing (most of the time).

Ah.  Here we see one of the little drawbacks of not finishing my CS
degree.  ;-)  Can someone suggest a good simple reference on the
distinctions between parsing / lexing / tokenizing, particularly in the
context of general string processing (e.g. XML) rather than the arcane
art of compiler technology?
Aahz (           <*>

Project Vote Smart:

From  Wed Aug 21 17:46:17 2002
From: (Fredrik Lundh)
Date: Wed, 21 Aug 2002 18:46:17 +0200
Subject: [Python-Dev] Re: Automatic flex interface for Python?
References: <> <> <15715.941.27029.778363@gargle.gargle.HOWL> <> <> <> <015901c248eb$97772ec0$0900a8c0@spiff> <018e01c248ed$9ffc0d20$0900a8c0@spiff> <> <15715.43805.195606.442523@gargle.gargle.HOWL> <>
Message-ID: <01dd01c24932$476307a0$ced241d5@hagrid>

aahz wrote:

> Ah.  Here we see one of the little drawbacks of not finishing my CS
> degree.  ;-)  Can someone suggest a good simple reference on the
> distinctions between parsing / lexing / tokenizing

start here:

> particularly in the context of general string processing (e.g. XML)
> rather than the arcane art of compiler technology?

words tend to mean slightly different things in the XML universe,
so I'll leave that to the XML experts.


From  Wed Aug 21 17:54:10 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 12:54:10 -0400
Subject: [Python-Dev] Parsing vs. lexing.
In-Reply-To: <>
References: <> <15715.941.27029.778363@gargle.gargle.HOWL> <> <> <> <015901c248eb$97772ec0$0900a8c0@spiff> <018e01c248ed$9ffc0d20$0900a8c0@spiff> <> <15715.43805.195606.442523@gargle.gargle.HOWL> <>
Message-ID: <>

Aahz <>:
> Ah.  Here we see one of the little drawbacks of not finishing my CS
> degree.  ;-)  Can someone suggest a good simple reference on the
> distinctions between parsing / lexing / tokenizing, particularly in the
> context of general string processing (e.g. XML) rather than the arcane
> art of compiler technology?

It's pretty simple, actually.  Lexing *is* tokenizing; it's breaking the 
input stream into appropopriate lexical units.  When you say "lexing" it
implies that your tokenizer may be doing other things as well -- handling 
comment syntax, or gathering low-level semantic information like "this 
is a typedef".

Parsing, on the other hand, consists of attempting to match your input
to a grammar.  The result of a parse is typically either "this
matches" or to throw an error.  

There are two kinds of parsers -- event generators and structure builders.
Event generators call designated hook functions when they recognize a
piece of syntax you're interested in.  In XML-land, SAX is like this.
Structure builders return some data structure (typically a tree)
representing the syntax of your input.  In XML-land, DOM is like this.

There is a vast literature on parsing.  You don't need to know most of
it.  The key thing to remember is that, except for very simple cases,
writing parsers by hand is usually stupid.  Even when it's simple to
do, machine-generated parsers have better hooks for error recovery.
There are several `parser generator' tools that will compile a grammar
specification to a parser; the best-known one is Bison, an open-source
implementation of the classic Unix tool YACC (Yet Another Compiler

Python has its own parser generator, SPARK.  Unfortunately, while
SPARK is quite powerful (that is, good for handling ambiguities in the
spec), the Earley algorithm it uses gives O(n**3) performance in the
generated parser. It's not usable for production on larger than toy

The Python standard library includes a lexer class suitable for a large
class of shell-like syntaxes. As Guido has pointed out, regexps provide
another attack on the problem.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 17:58:55 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 12:58:55 -0400
Subject: [Python-Dev] Embarassed in Malvern
In-Reply-To: <>
References: <> <>
Message-ID: <>

Tim Peters <>:
> [Eric S. Raymond]
> > Apparently I was hallucinating when I thought these has been released in
> > open source.
> Are you talking about Judy?  If so, the LGPL'ed source isn't at HP, it's at
> SourceForge:

It's been hiding its existence effectively.  I got three pieces of email
from people wondering what I was going to do about the lack of source,
then couldn't find any pointers to the source either on the HP site or
via Google.

Phew.  Well, the interface is still butt-ugly, but that performance...
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 18:03:53 2002
From: (Zack Weinberg)
Date: Wed, 21 Aug 2002 10:03:53 -0700
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <> <> <> <>
Message-ID: <>

On Wed, Aug 21, 2002 at 02:22:26AM -0400, Eric S. Raymond wrote:
> I'm on it.  The following is not yet working, but it's a straight road to get
> there....
> There is a public spam-checker port.  Your client program sends it
> packets consisting of a list of header token counts.  You
> can send lots of these blocks; each one has to be under the maximum
> atomic-message size for sockets (I think that's 32K).  
> The server accumulates the frequency counts you ship it until you say
> "OK, what is it?"  Does the Bayes test.  Ships you back a result.

My ISP-postmaster friend's reaction to that:

| As far it it goes, yes.  How would it learn?
| On a more mundane note, I'd like to see decoding of base64 in it.
| (Oh, and on a blue-sky note, has anyone taken up Graham's suggestion
| of having one of these things that looks at word pairs instead of
| words?)
| It's neat that ESR saw immediately that the daemon should be
| self-contained, no access to home directories.  SpamAssassin doesn't
| have a simple way of doing that, and [ISP] is modifying it to have
| one -- and you wouldn't believe the resistance to the proposed
| changes from some of the SA developers.  Some of them really seem
| to think that it's better and simpler to store user configuration
| in a database than to have the client send its config file to the
| server along with each message.

I remember you said you didn't want to do base64 decode because it was
too slow?


From  Wed Aug 21 18:13:11 2002
From: (Eric S. Raymond)
Date: Wed, 21 Aug 2002 13:13:11 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <> <> <> <> <>
Message-ID: <>

Zack Weinberg <>:
> My ISP-postmaster friend's reaction to that:
> | As far it it goes, yes.  How would it learn?

Your users' mailers would have two delete buttons -- spam and nonspam.
On each delete the message would be shipped to bogofilter, which would
would merge the content into its token lists.  

> I remember you said you didn't want to do base64 decode because it was
> too slow?

And not necessary.  Base64 spam invariably has telltales that Bayesian
amalysis will pick up in the headers and MIME cruft.  A rather large
percentage of it is either big5 or images.
		<a href="">Eric S. Raymond</a>

From  Wed Aug 21 18:15:22 2002
From: (David Abrahams)
Date: Wed, 21 Aug 2002 13:15:22 -0400
Subject: [Python-Dev] More pydoc questions
Message-ID: <0d6101c24936$57d6c5f0$>

I recently added an invocation to help(my_extension_module) to the
Boost.Python test suite, to prove that I can give reasonable help output.
Worked great for me, since I was always running the test from within emacs.
However, some other developer complained that the test required user
intervention to run, since it would prompt at each screenful. So, I changed
it to:

    print pydoc.TextDoc().docmodule(my_extension_module)

Now I get (well, I'm not sure how this will show up in your mailer, but for
me it's full of control characters):




    A simple test module for documentation strings
    Exercised by


    class XX(Boost.Python.instance)
     |  A simple class wrapper around a C++ int


So my question is, is there a way to dump the text help for a module
without prompting and without any extra control characters?


P.S. Another question: the docmodule() function takes two optional
arguments whose role is undocumented. What are they for?

           David Abrahams * Boost Consulting *

From  Wed Aug 21 18:27:26 2002
From: (Neil Schemenauer)
Date: Wed, 21 Aug 2002 10:27:26 -0700
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <>
Message-ID: <>

Tim Peters wrote:
> the version of this we've got now does update during scoring

Are you planning to check this into the sandbox?


From  Wed Aug 21 18:31:57 2002
From: (Aahz)
Date: Wed, 21 Aug 2002 13:31:57 -0400
Subject: [Python-Dev] PEP 277 (unicode filenames): please review
In-Reply-To: <>
References: <001e01c242e5$49697ff0$bd5d4540@Dell2> <> <> <>
Message-ID: <>

[doing an archeological dig through e-mail]

On Tue, Aug 13, 2002, M.-A. Lemburg wrote:
> At least is good :-) NFC is NFD + canonical composition. Decomposition
> isn't all that hard (using unicodedata.decomposition()). For
> composition the situation is different: not all information is
> available in the unicodedata database (the exclusion list) and
> the database also doesn't provide the reverse mapping from
> decomposed code points to composed one. See the Annexes to the
> tech report to get an impression of just how hard combining is...

In a message just prior to this one, you wrote:

    The recommended way of doing normalization is to go by
    Normalization Form C: Canonical Decomposition,
    followed by Canonical Composition.

So, um, which way is it?
Aahz (           <*>

Project Vote Smart:

From  Wed Aug 21 19:24:51 2002
From: (Stepan Koltsov)
Date: Wed, 21 Aug 2002 22:24:51 +0400
Subject: [Python-Dev] q about default args
Message-ID: <>

Hi, Guido, other Python developers and other subscribers.

First of all, if this question was discussed here or somewhere
else 8086 times, please direct me to discussion archives.
I couldn't guess the keywords to search for in the python-dev archives
as I haven't found the search page where to enter these keywords :-)

The question is: To be or^H^H^H^H^H^H^H^H^H Why not evaluate default
parameters of a function at THE function call, not at function def
(as is done currenly)? For example, C++ (a nice language, isn't it? ;-)
) evaluates default parameters at function call.

Example below illustrates the point in mind:
In Python:

def func(l = []):

to make it clearer (though this is not exactly the same):


def func(l = list()):

or, in C++:


void func(list l = list()) {


Implementation details:

Add a flag to the code object, that means "evaluate default args".
Compile default args to small code objects and store them where values
for default args are stored in current Python (i.e. co_consts). When
a function is called, evaluate the default args (if the above flag
is set) in the context of that function. So, for inst., this code is
now possible:


class Tree:
    # this iter walks through a Tree level-wise (i.e. left to right the down).
    def levelIter(self, nodes=[self]):
                            # ^^^^^^ look here
        # the following is not "mission critical"  :-)
        if len(nodes) == 0:
        for node in nodes:
            yield node
        nodes = reduce(operator.add, [n.children() for n in nodes])
        for node in self.levelIter(nodes):
       	    yield node


About compatibility: compiled python files stay backward compatible as long as they do not define the mentioned flag.

An alternative way to go (a little example... LOOK ON, PERSONALY, I LIKE IT ALLOT):


def f(x=12+[]):


compiled into something like:

0: LOAD_CONST 1 (12)
3: STORE_FAST 0 (x)
4: # here code of stmts begin

in the case if 'x' was specfied, the code is executed instruction 4
onword This should work perfectly, ideologically correct and I think
even faster then current interpreter implementation.

Motivation (he-he, the most difficult part of this letter):

1. Try to import this module:

import math
def func(a = map(lambda x: math.sqrt(x)):
# there is no call to func


This code does nothing but define a single function,
but look at the execution time...

2. Currently, default arguments are like static function variables,
defined in the function parameter list! That is wrong.

4. Again: I dislike code like


def f(l=None):
    if l is None:
        l = []


5. I asked my friend (also big Python fan): why the current
behaviour is correct?  his answer was: "the curren behaviour is
correct, becausethat is the way it was done in the first place :-)
..." I don't see any advantages of the current style, and lack of
advantages is advantage of new style :-)

I hope, that the current state of things is a result of laziness (or is
it "business"), not sabotage :-) .  and not an ideological decision. It
isn't late to fix Python yet :-) , as when Cpt. J. L. Picard once
again saves the galaxy (this time, not from the evil Borg), it will
be difficult to change self-modificating Python compilers, reconstruct
hardware Python bytecode interpreters and verify tetrabytes of source
code, written in Python  (NOTE: I speek of the not so distant future
:-) )

mailto: Stepan Koltsov <>

From  Wed Aug 21 19:31:46 2002
From: (Jeff Epler)
Date: Wed, 21 Aug 2002 13:31:46 -0500
Subject: [Python-Dev] More pydoc questions
In-Reply-To: <0d6101c24936$57d6c5f0$>
References: <0d6101c24936$57d6c5f0$>
Message-ID: <>

On Wed, Aug 21, 2002 at 01:15:22PM -0400, David Abrahams wrote:
> I recently added an invocation to help(my_extension_module) to the
> Boost.Python test suite, to prove that I can give reasonable help output.
> Worked great for me, since I was always running the test from within emacs.
> However, some other developer complained that the test required user
> intervention to run, since it would prompt at each screenful. So, I changed
> it to:
>     print pydoc.TextDoc().docmodule(my_extension_module)
> Now I get (well, I'm not sure how this will show up in your mailer, but for
> me it's full of control characters):

In my mailer, X^HX is displayed as a bold X.  It's an old trick of
impact printers and interpreted by fine unix screen pagers such as

I'm not sure how to disable it.  However,
    re.sub("\10.", "", s)
should remove it from "s" without hurting anything else.  I don't know
if pydoc produces underlines.  If underlines
are expressed as X^H_, then it'll convert those to regular text too.
But if underlines are _^HX, you'll want to use
    re.sub(".\10", "", s)
instead.  That'll work for both bold and underline.


From  Wed Aug 21 19:45:35 2002
From: (David Abrahams)
Date: Wed, 21 Aug 2002 14:45:35 -0400
Subject: [Python-Dev] More pydoc questions
References: <0d6101c24936$57d6c5f0$> <>
Message-ID: <0f8501c24942$f0e67720$>

From: "Jeff Epler" <>

> On Wed, Aug 21, 2002 at 01:15:22PM -0400, David Abrahams wrote:
> > I recently added an invocation to help(my_extension_module) to the
> > Boost.Python test suite, to prove that I can give reasonable help
> > Worked great for me, since I was always running the test from within
> > However, some other developer complained that the test required user
> > intervention to run, since it would prompt at each screenful. So, I
> > it to:
> >
> >     print pydoc.TextDoc().docmodule(my_extension_module)
> >
> > Now I get (well, I'm not sure how this will show up in your mailer, but
> > me it's full of control characters):
> In my mailer, X^HX is displayed as a bold X.  It's an old trick of
> impact printers and interpreted by fine unix screen pagers such as
> "less".

Figured it was something like that.

> I'm not sure how to disable it.  However,
>     re.sub("\10.", "", s)
> should remove it from "s" without hurting anything else.  I don't know
> if pydoc produces underlines.  If underlines
> are expressed as X^H_, then it'll convert those to regular text too.
> But if underlines are _^HX, you'll want to use
>     re.sub(".\10", "", s)
> instead.  That'll work for both bold and underline.

I didn't want to resort to that, but then I also thought it would be uglier
than it turned out to be.


           David Abrahams * Boost Consulting *

From  Wed Aug 21 19:59:37 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 21 Aug 2002 14:59:37 -0400
Subject: [Python-Dev] Re: More pydoc questions
In-Reply-To: <0d6101c24936$57d6c5f0$>
References: <0d6101c24936$57d6c5f0$>
Message-ID: <>


[David Abrahams]

> Now I get (well, I'm not sure how this will show up in your mailer, but for
> me it's full of control characters):

It shows nicely here, using Gnus for a mail reader.  See:

Content-Type: image/png
Content-Disposition: inline; filename=xx.png
Content-Transfer-Encoding: base64

Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

François Pinard


From  Wed Aug 21 20:03:50 2002
From: (Jeremy Hylton)
Date: Wed, 21 Aug 2002 15:03:50 -0400
Subject: [Python-Dev] q about default args
In-Reply-To: <>
References: <>
Message-ID: <>

>>>>> "SK" == Stepan Koltsov <> writes:

  SK> 2. Currently, default arguments are like static function
  SK>    variables,
  SK> defined in the function parameter list! That is wrong.

No, it's not.  The Python language definition is completely clear on
the semantics of default arguments.  They are evaluated at function
definition time and stored like static function variables.

  SK> 4. Again: I dislike code like

  SK> ---

  SK> def f(l=None):
  SK>     if l is None:
  SK>         l = []
  SK>     ...

  SK> ===

I don't see anything wrong with this code.

  SK> 5. I asked my friend (also big Python fan): why the current
  SK> behaviour is correct?  his answer was: "the curren behaviour is
  SK> correct, becausethat is the way it was done in the first place
  SK> :-) ..." I don't see any advantages of the current style, and
  SK> lack of advantages is advantage of new style :-)

Even if I liked the semantics you propose, it would create enormous
pain to change the language semantics here.

You'll have to work a lot harder on motivation if you want us to fix
something that isn't broken :-).


From  Wed Aug 21 20:19:27 2002
From: (Guido van Rossum)
Date: Wed, 21 Aug 2002 15:19:27 -0400
Subject: [Python-Dev] Backwards compatiblity
In-Reply-To: Your message of "Wed, 21 Aug 2002 12:00:43 EDT."
References: <> <001a01c2492a$8f27cec0$9feb7ad1@othello>
Message-ID: <>

>     RH> Once, it is firmed-up a bit, how about putting on
>     RH> or in the Vaults of Parnassus?
> Or making a nice little distutils package available on SF?

Sorry, I'm not interested.  It's a standard library module for Python
2.3.  Everything else is a distraction from my POV.

--Guido van Rossum (home page:

From  Wed Aug 21 20:28:51 2002
From: (Guido van Rossum)
Date: Wed, 21 Aug 2002 15:28:51 -0400
Subject: [Python-Dev] Parsing vs. lexing.
In-Reply-To: Your message of "Wed, 21 Aug 2002 12:54:10 EDT."
References: <> <15715.941.27029.778363@gargle.gargle.HOWL> <> <> <> <015901c248eb$97772ec0$0900a8c0@spiff> <018e01c248ed$9ffc0d20$0900a8c0@spiff> <> <15715.43805.195606.442523@gargle.gargle.HOWL> <>
Message-ID: <>

> There are several `parser generator' tools that will compile a grammar
> specification to a parser; the best-known one is Bison, an open-source
> implementation of the classic Unix tool YACC (Yet Another Compiler
> Compiler).
> Python has its own parser generator, SPARK.  Unfortunately, while
> SPARK is quite powerful (that is, good for handling ambiguities in the
> spec), the Earley algorithm it uses gives O(n**3) performance in the
> generated parser. It's not usable for production on larger than toy
> grammars.
> The Python standard library includes a lexer class suitable for a large
> class of shell-like syntaxes. As Guido has pointed out, regexps provide
> another attack on the problem.

SPARK can hardly make clame to the fame of being "Python's own parser
generator".  While it's a parser generator for Python programs and
itself written in Python, it is not distributed with Python.
"Python's own" would be pgen, which lives in the Parser subdirectory
of the Python source tree.  Pgen is used to parse the Python source
code and construct a parse tree out of it.  As parser generators go,
pgen is appropriately (and pythonically) stupid -- its power is
restricted to that of LL(1) languages, equivalent to recursive-descent
parsers.  Its only interesting feature may be that it uses a
regex-like notation to feed it the grammar for which to generate a
parser.  (That's what the *, ?, [], | and () meta-symbols in the file
Grammar/Grammar are for.)

I would note that for small languages (much smaller than Python),
writing a recursive-descent parser by hand is actually one of the most
effective ways of creating a parser.  I recently had the pleasure to
write a recursive-descent parser for a simple Boolean query language;
there was absolutely no need to involve a big gun like a parser
generator.  OTOH I would not consider writing a recursive-descent
parser by hand for Python's Grammar -- that's why I created pgen in
the first place. :-)

Another note for Aahz: when it comes to scanning data that's not
really a programming language, e.g. email messages, the words parsing,
scanning, lexing and tokenizing are often used pretty much

--Guido van Rossum (home page:

From  Wed Aug 21 20:32:08 2002
From: (Guido van Rossum)
Date: Wed, 21 Aug 2002 15:32:08 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: Your message of "Wed, 21 Aug 2002 13:13:11 EDT."
References: <> <> <> <> <> <>
Message-ID: <>

> > I remember you said you didn't want to do base64 decode because it was
> > too slow?
> And not necessary.  Base64 spam invariably has telltales that Bayesian
> amalysis will pick up in the headers and MIME cruft.  A rather large
> percentage of it is either big5 or images.

I'd be curious to know if that will continue to be true in the future.
At least one of my non-tech friends sends email that's exclusively
HTML (even though the content is very lightly marked-up plain text),
from a hotmail account.  Spam could easily have the same origin, but
the HTML contents would be very different.

--Guido van Rossum (home page:

From  Wed Aug 21 20:50:29 2002
From: (Zack Weinberg)
Date: Wed, 21 Aug 2002 12:50:29 -0700
Subject: [Python-Dev] Parsing vs. lexing.
In-Reply-To: <>
References: <> <> <> <015901c248eb$97772ec0$0900a8c0@spiff> <018e01c248ed$9ffc0d20$0900a8c0@spiff> <> <15715.43805.195606.442523@gargle.gargle.HOWL> <> <> <>
Message-ID: <>

On Wed, Aug 21, 2002 at 03:28:51PM -0400, Guido van Rossum wrote:
> I would note that for small languages (much smaller than Python),
> writing a recursive-descent parser by hand is actually one of the most
> effective ways of creating a parser.  I recently had the pleasure to
> write a recursive-descent parser for a simple Boolean query language;
> there was absolutely no need to involve a big gun like a parser
> generator.  OTOH I would not consider writing a recursive-descent
> parser by hand for Python's Grammar -- that's why I created pgen in
> the first place. :-)

You might be interested to know that over in GCC land we're changing
the C++ front end to use a hand-written recursive descent parser.
It's not done yet, but we expect it to be easier to maintain, faster,
and better at generating diagnostics than the existing yacc-based


From  Wed Aug 21 20:53:42 2002
From: (Barry A. Warsaw)
Date: Wed, 21 Aug 2002 15:53:42 -0400
Subject: [Python-Dev] Backwards compatiblity
References: <>
Message-ID: <>

>>>>> "GvR" == Guido van Rossum <> writes:

    >> RH> Once, it is firmed-up a bit, how about putting on
    >> RH> or in the Vaults of Parnassus?
    >> Or making a nice little distutils package available on SF?

    GvR> Sorry, I'm not interested.  It's a standard library module
    GvR> for Python 2.3.  Everything else is a distraction from my
    GvR> POV.

I didn't mean to imply you should do it.  But it would be easy enough
to do for anybody who was sufficiently motivated.


From  Wed Aug 21 20:59:42 2002
From: (Barry A. Warsaw)
Date: Wed, 21 Aug 2002 15:59:42 -0400
Subject: [Python-Dev] Parsing vs. lexing.
References: <>
Message-ID: <>

>>>>> "GvR" == Guido van Rossum <> writes:

    GvR> Another note for Aahz: when it comes to scanning data that's
    GvR> not really a programming language, e.g. email messages, the
    GvR> words parsing, scanning, lexing and tokenizing are often used
    GvR> pretty much interchangeably.

True, although even stuff like email messages are defined by a formal
grammar, i.e. RFC 2822.  email.Generator of course doesn't strictly
use that grammar because it's trying to allow a much greater leniency
in its input than a language compiler would.  But note that approaches
like Emacs's mail-extr.el package do in fact try to do more strict
parsing based on the grammar.


From  Wed Aug 21 21:23:04 2002
From: (Barry A. Warsaw)
Date: Wed, 21 Aug 2002 16:23:04 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
References: <>
Message-ID: <>

| As far it it goes, yes.  How would it learn?

I have some ideas about how you could hook this into Mailman to do
community/membership assisted learning.  Understanding that people
will be highly motivated to inform you about spam but not about good
messages, you essentially queue a copy of a random sampling of
messages for a few days.  Members can let the list admin know about
leaked spam (via a url or -spam address, or whatever) and after the
list admin verifies it, this trains the system on that spam.  If no
feedback on a message happens after a few days, you train the system
on that known good message.

You need list admin verification to avoid attack vectors (I get mad at
Guido so I -- a normal user -- label all his messages as spam).

| On a more mundane note, I'd like to see decoding of base64 in it.
| (Oh, and on a blue-sky note, has anyone taken up Graham's suggestion
| of having one of these things that looks at word pairs instead of
| words?)
| It's neat that ESR saw immediately that the daemon should be
| self-contained, no access to home directories.  SpamAssassin doesn't
| have a simple way of doing that, and [ISP] is modifying it to have
| one -- and you wouldn't believe the resistance to the proposed
| changes from some of the SA developers.  Some of them really seem
| to think that it's better and simpler to store user configuration
| in a database than to have the client send its config file to the
| server along with each message.

>>>>> "ZW" == Zack Weinberg <> writes:

    ZW> I remember you said you didn't want to do base64 decode
    ZW> because it was too slow?

But there might be some interesting, integrated ways around that.  Say
for example, you take a Python-enabled mail server, parse the message
into its decoded form early (but not before low level SMTP-based
rejections) and then pass that parsed and decoded message object tree
around to all the other subsystems that are interested, e.g. the Bayes
filter, and Mailman.  You can at least amortize the cost of parsing
and decoding once for the rest of the lifetime of that message on your

I think we have all the pieces in place to play with this approach on


From  Wed Aug 21 21:21:18 2002
From: (Jeremy Hylton)
Date: Wed, 21 Aug 2002 16:21:18 -0400
Subject: [Python-Dev] Parsing vs. lexing.
In-Reply-To: <>
References: <>
Message-ID: <>

>>>>> "ZW" == Zack Weinberg <> writes:

  ZW> You might be interested to know that over in GCC land we're
  ZW> changing the C++ front end to use a hand-written recursive
  ZW> descent parser.  It's not done yet, but we expect it to be
  ZW> easier to maintain, faster, and better at generating diagnostics
  ZW> than the existing yacc-based parser.

LCC also uses a hand-written recursive descent parser, for exactly the
reasons you mention.

Thought I'd also mention a neat new paper about an old algorithm for
recursive descent parsers with backtracking and unlimited lookahead.

Packrat Parsing: Simple, Powerful, Lazy, Linear Time, Bryan Ford. ICFP 2002


From  Wed Aug 21 21:24:34 2002
From: (Barry A. Warsaw)
Date: Wed, 21 Aug 2002 16:24:34 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
References: <>
Message-ID: <>

>>>>> "ESR" == Eric S Raymond <> writes:

    >> My ISP-postmaster friend's reaction to that: | As far it it
    >> goes, yes.  How would it learn?

    ESR> Your users' mailers would have two delete buttons -- spam and
    ESR> nonspam.  On each delete the message would be shipped to
    ESR> bogofilter, which would would merge the content into its
    ESR> token lists.

You need some kind of list admin oversight or your system is open to
attack vectors on individual posters.


From  Wed Aug 21 21:25:27 2002
From: (Alex Martelli)
Date: Wed, 21 Aug 2002 22:25:27 +0200
Subject: [Python-Dev] Parsing vs. lexing.
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

On Wednesday 21 August 2002 09:50 pm, Zack Weinberg wrote:
> You might be interested to know that over in GCC land we're changing
> the C++ front end to use a hand-written recursive descent parser.
> It's not done yet, but we expect it to be easier to maintain, faster,
> and better at generating diagnostics than the existing yacc-based
> parser.

Interesting!  This reminds me of a long-ago interview with Borland's techies
about how they had managed to create Turbo Pascal, which ran well in
a 64K (K, not M-) one-floppy PC, when their competitor, Microsoft Pascal, 
forced one to do a lot of disc-jockeying even with 256K and 2 floppies.

Basically, their take was "we just did everything by the Dragon Book -- except
that the parser is a hand-written recursive descent parser [Aho &c being
adamant defenders of Yacc & the like], which buys us a lot" ...


From  Wed Aug 21 21:25:41 2002
From: (Barry A. Warsaw)
Date: Wed, 21 Aug 2002 16:25:41 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
References: <>
Message-ID: <>

>>>>> "NS" == Neil Schemenauer <> writes:

    NS> Are you planning to check this into the sandbox?


From  Wed Aug 21 21:42:10 2002
From: (Noah)
Date: Wed, 21 Aug 2002 13:42:10 -0700
Subject: [Python-Dev] RE: Python-Dev digest, Vol 1 #2574 - 14 msgs
In-Reply-To: <>
Message-ID: <>

On Wed, 21 Aug 2002 Aahz wrote:
> On Wed, Aug 21, 2002, Skip Montanaro wrote: 
> > parsing != tokenizing. ;-)
> > Regular expressions are great for tokenizing (most of the time).
> Ah.  Here we see one of the little drawbacks of not finishing my CS
> degree.  ;-)  Can someone suggest a good simple reference on the
> distinctions between parsing / lexing / tokenizing, particularly in the
> context of general string processing (e.g. XML) rather than the arcane
> art of compiler technology?
> Aahz (           <*>

It's been 8 or 9 years since I took a compiler design class, so this info
is probably be WRONG, but as luck would have it I've been reviewing
some of this stuff lately so I can feel some of the old neuron paths
warming up. 

Basically the distinction between a lexer and a parser refers to 
the complexity of symbol inputs that they can recognize.
A lexer (AKA tokenizer) is modeled by a finite state machine (FSM).
These don't have a stack or memory, just a state. They are not
good for things that require nested structure.

A parser recognizes more complex structures. They are good for
things like XML and source code where you have NESTED tree structures
(familiarly known as SYNTAX). If you have to remember how many levels deep 
you are in something then it means you need a parser. Technically a parser is 
something that can be defined by a context free grammar and can 
be recognized by a Push Down Automata (PDA). A PDA is a FSM with memory.
A PDA has at least one stack. The "context free" on the grammar means
that you can unambiguously recognize any section of the input stream 
no matter what came earlier in the stream. ... Sometimes real grammars
are a little dirty and context does matter, but only within a small window.
That means you might have to "look ahead" or behind a few symbols to
eliminate some ambiguity. The amount that you have to look ahead sets
the complexity of the grammar. This is called LALR (look ahead left reduce).
So a simple grammar with no look ahead is called LALR(0). A slightly more 
complex grammar that requires 1 symbol look-ahead is called LALR(1).
We like most parsing to be simple. I think languages like C and Python
require are LALR(0). I think FORTRAN does require a look-ahead, so
it's LALR(1). I have no idea what it must require to parse Perl.
[Again Note: I sure I probably got some details wrong here.]

If you go further in complexity and you want to handle evaluating expressions 
then you need a Turing Machine (TM). These are problems where
a context-free grammar cannot be tweaked with a look-ahead.
Another way to think about it is if your input is so complex 
that it must be described algorithmically  then 
you need a TM. For example neither a FSM nor a PDA can 
recognize a non-rational number like SQRT(2) or pi. 
Nor can they recognize the truthfulness of expressions like "2*5=10" 
(although a PDA can recognize that it's a valid expression).

RegExs are hybrids of FSM and PDA and are fine for ad hoc lexing. 
They are not very good for parsing. I think older style RegExs started
life off as pure FSM, but newer flavors made popular by Perl added memory and
became PDAs... or something like that. But somehow they are limited and
not quite as powerful as a real PDA, so they can't parse.

Traditionally C programmers used a combination of LEX and YACC
(or GNU's versions of FLEX and Bison) to build parsers. 
You really only need YACC, but the problem is so much simpler if 
the input stream is tokenized before you try to parse it, 
which is why you also use LEX.

Hopefully that captures the essence if not the actual facts.
But then I'm trying to compress one year of computer science study
into this email :-)

If you are interested I just wrote a little PDA class for Python.


From  Wed Aug 21 22:05:58 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 17:05:58 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
Message-ID: <>

> the version of this we've got now does update during scoring

[Neil Schemenauer]
> Are you planning to check this into the sandbox?

Update-during-scoring was already in the initial version.  This works with a
Python dict, though (which Barry pickles and unpickles across runs), not
with a persistent database (like ZODB).  Changes to use a ZODB BTree would
be easy, but not yet most interesting to me.  There are many more basic open
questions, like which kinds of tokenization ("feature extraction") do and
don't work.  BTW, that's why the WordInfo records have a .killcount
attribute -- the data will tell us which ways work best.

From  Wed Aug 21 22:14:42 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 21 Aug 2002 17:14:42 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <15715.941.27029.778363@gargle.gargle.HOWL>
References: <>
Message-ID: <>

[Skip Montanaro]

> The SpamAssassin folks are starting to look at Flex for much faster
> regular expression matching in situations where large numbers of static
> re's must be matched.

This problem was also vivid in `procmail'-based SPAM filtering, as I observed
it many years ago, and I remembered having quite lurked on the side of Flex
at the time.  I finally solved my own problem by writing a Python program
able to sustain thousands of specialised rules rather speedily, proving once
again that algorithmic approaches are often much slower than languages :-).

> I wonder if using something like SciPy's weave tool would make it
> (relatively) painless to incorporate fairly high-speed scanners into
> Python programs.

For a pure Python solution, PLEX could also be an avenue.  It compiles a fast
automaton, similar to what Flex does, from a grammar describing all tokens.
I tried PLEX recently and was satisfied, even if not astonished by the speed.
Also wanting fast parsing, I first used various Python heuristics, but the
result was not solid enough to my taste.  So I finally opted for Bison,
and once on that road, it was just natural to rewrite the PLEX part in Flex.

[Guido van Rossum]

> I think you're exaggerating the problem, or at least underestimating
> the re module.  The re module is pretty fast!

There are limits to what a backtracking matcher could do, speed-wise,
when there are many hundreds of alternated patterns.

[Eric S. Raymond]

> It's pretty simple, actually.  Lexing *is* tokenizing; it's breaking the
> input stream into appropopriate lexical units.  [...]  Parsing, on the
> other hand, consists of attempting to match your input to a grammar.
> The result of a parse is typically either "this matches" or to throw
> an error.  There are two kinds of parsers -- event generators and
> structure builders.

Maybe lexing matches a grammar to a point, generates an event according to
the match, advances the cursor and repeats until the end-of-file is met.
Typically, parsing matches a grammar at point and does not repeat.

In some less usual applications, they may be successive lexing stages,
each taking its input from the output of the previous one.  Parsing may
be driven in stages too.  I guess that we use the word `lexer' when the
output structure is more flat, and `parser' when the output structure is
more sexy!  There are cases where the distinction is almost fuzzy.

[Skip Montanaro]

>     Guido> I haven't given up on the re module for fast scanners (see
>     Guido> Tim's note on the speed of tokenizing 20,000 messages in
>     Guido> mere minutes).  Note that the Bayes approach doesn't *need*
>     Guido> a trick to apply many regexes in parallel to the text.

> Right.  I'm thinking of it in situations where you do need such tricks.
> SpamAssassin is one such place.  I think Eric has an application (quickly
> tokenizing the data produced by an external program, where the data can
> run into several hundreds of thousands of lines) where this might be
> beneficial as well.

Heap queues could be useful here as well.  Consider you have hundreds of
regexps to match in parallel on a big text.  When the regexp is not too
complex, it is more easy or likely that there exists a fast searcher for it.
Let's build one searcher per regexp.  Always beginning from the start of
text, find the spot where each regexp first matches.  Build a heap queue
using the position as key, and both the regexp and match data as value.

The top of the heap is the spot of your first match, process it, and while
removing it from the heap, search forward from that spot for the same regexp,
and add any result back to the heap.  Merely repeat until the heap is empty.

I'm not fully sure, but I have the intuition the above could be advantageous.

François Pinard

From  Wed Aug 21 22:29:33 2002
From: (Jonathan Riehl)
Date: Wed, 21 Aug 2002 16:29:33 -0500 (CDT)
Subject: [Python-Dev] Parsing vs. lexing.
In-Reply-To: <>
Message-ID: <Pine.BSF.4.33.0208211601190.51337-100000@localhost>

On Wed, 21 Aug 2002, Guido van Rossum wrote:

> I would note that for small languages (much smaller than Python),
> writing a recursive-descent parser by hand is actually one of the most
> effective ways of creating a parser.  I recently had the pleasure to
> write a recursive-descent parser for a simple Boolean query language;
> there was absolutely no need to involve a big gun like a parser
> generator.  OTOH I would not consider writing a recursive-descent
> parser by hand for Python's Grammar -- that's why I created pgen in
> the first place. :-)

As per Zach's comments, I think this is pretty funny.  I have just spent
more time trying to expose pgen to the interpreter than I took to write a
R-D parser for Python 1.3 (granted, once Fred's parser module came around,
I felt a bit silly).  Considering the scope of my parser generator
integration wishlist, having GCC move to a hand coded recursive descent
parser is going to make my head explode.  Even TenDRA
( used a LL(n) parser generator, despite its highly
tweaked lookahead code.  So now I'm going to have to extract grammars from
call trees?  As if the 500 languages problem isn't already intractable,
there are going to be popular language implementations that don't even
bother with an abstract syntax specificaiton!?  (Stop me from further
hyperbole if I am incorrect.)

No wonder there are no killer software engineering apps.  Maybe I should
just start writing toy languages for kids...


From  Wed Aug 21 23:56:41 2002
From: (Delaney, Timothy)
Date: Thu, 22 Aug 2002 08:56:41 +1000
Subject: [Python-Dev] PEP 218 (sets); moving to Lib
Message-ID: <>

> From: Guido van Rossum []
> > Hmm ... is there a case that NotImplementedError should be a
> > subclass of TypeError? Conceptually it would make sense (this *type*
> > does not implement this method).
> I think you're overthinking this.  NotImplementedError is fine for
> code that wants to send that particular message to the user.  We're
> playing with TypeError here because we're trying to be close to the
> metal.

I'd be willing to concede that I was overthinking, except that it took
surprisingly little thought for the connection to be made ;)

Tim Delaney

From  Thu Aug 22 00:20:06 2002
From: (Greg Ewing)
Date: Thu, 22 Aug 2002 11:20:06 +1200 (NZST)
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <1029917815.581.3.camel@winterfell>
Message-ID: <>

Martin =?ISO-8859-1?Q?Sj=F6gren?= <>:

> Uhm, what about + and juxtaposition? They are quite common at least
> here in Sweden, for boolean algebra.

They're not normally used for sets, though, in my
experience (despite the fact that set theory is
a boolean algebra:-).

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 22 00:25:07 2002
From: (Greg Ewing)
Date: Thu, 22 Aug 2002 11:25:07 +1200 (NZST)
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

"Eric S. Raymond" <>
Subject: Re: [Python-Dev] Re: PEP 218 (sets):

> Is it + for disjunction and juxtaposition for conjunction, or the other
> way around?

+ is 'or' and juxtaposition (or sometimes a dot) is 'and'
(I prefer those words because they're shorter than
'disjunction' and 'conjunction', and I can remember which
is which:-).

These are probably where Wirth got the idea of using
+ and * from in Pascal -- 'or' is considered the
'addition' operator in boolean algebra, and 'and'
the 'multiplication' operator. (Although the two are
actually completely symmetrical in their algebraic
properties, so it's an arbitrary choice.)

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 22 00:31:44 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 21 Aug 2002 19:31:44 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
References: <>
Message-ID: <>

[François Pinard]

> I finally solved my own problem by writing a Python program able to
> sustain thousands of specialised rules rather speedily, proving once
> again that algorithmic approaches are often much slower than languages :-).

Sorry, my English is so unclear!  I meant that people sometimes say Python is
slow, yet because it allows clearer algorithms, one ends up with more speed
in Python than other solutions in supposedly faster languages.  For these
other solutions, the slowness comes from the algorithms.  People are often
tempted to benchmark languages, while they should rather benchmark ideas. :-)

François Pinard

From  Thu Aug 22 00:40:44 2002
From: (Greg Ewing)
Date: Thu, 22 Aug 2002 11:40:44 +1200 (NZST)
Subject: [Python-Dev] More pydoc questions
In-Reply-To: <0d6101c24936$57d6c5f0$>
Message-ID: <>

David Abrahams <>:

>    docstring_ext

That looks like it's designed to produce bold characters
on a mechanical-impact printing device. Which is surely
an anachronism in this day and age -- it's not going to
work on most of the printers in use nowadays, is it?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 22 00:54:07 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 21 Aug 2002 19:54:07 -0400
Subject: [Python-Dev] Re: More pydoc questions
In-Reply-To: <>
References: <>
Message-ID: <>

[Greg Ewing]

> David Abrahams <>:

> >    docstring_ext

> That looks like it's designed to produce bold characters
> on a mechanical-impact printing device. Which is surely
> an anachronism in this day and age -- it's not going to
> work on most of the printers in use nowadays, is it?

It works pretty well, as many printer filters know how to interpret such
overstrike when meant to bold or underline.  Some of these filters even
know how to combine diacritics over/under letters.  For glass screens,
`less' also does the proper thing.

A few years ago, I had to write a tool that should underline and bold, and
after some looking around, found out that using overstrike was the most
versatile and supported way to proceed, however anachronic it may look.
Of course, we could resort to bigger hammers, like Docbook or XML, and
converters of all sorts, but there is also a place for simple things! :-)

François Pinard

From  Thu Aug 22 01:03:05 2002
From: (Greg Ewing)
Date: Thu, 22 Aug 2002 12:03:05 +1200 (NZST)
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
Message-ID: <>

Aahz <>:

> Can someone suggest a good simple reference on the
> distinctions between parsing / lexing / tokenizing

Lexical analysis, otherwise known as "lexing" or
"tokenising", is the process of splitting the input
up into a sequence of "tokens", such as (in the case
of a programming language) identifiers, operators, 
string literals, etc.

Parsing is the next higher level in the process,
which takes the sequence of tokens and recognises
language constructs -- statements, expressions, etc.

> particularly in the context of general string processing (e.g. XML)
> rather than the arcane art of compiler technology?

The lexing and parsing part of compiler technology isn't really any
more arcane than it is for XML or anything else -- exactly the same
principles apply.

It's more a matter of how deeply you want to get into the theory. The
standard text on this stuff around here seems to be Aho, Hopcroft and
Ullman, "The Theory of Parsing, Translation and Compiling", but you
might find that a bit much if all you want to do is parse XML. It
will, however, give you a good grounding in the theory of REs, various
classes of grammar, different parsing techniques, etc., after which
writing an XML parser will seem like quite a trivial task. :-)

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 22 02:13:02 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 21:13:02 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
Message-ID: <>

[Eric S. Raymond]
> ...
> Lexers are painful in Python.

This is so.

> They hit the language in a weak spot created by the immutability of
> strings.

But you lost me here -- I don't see a connection between immutability and
either ease of writing lexers or speed of lexers.  Indeed, lexers are (IME)
more-or-less exactly as painful and slow written in Perl, where strings are

Seems more to me that lexing is convenient and fast only when expressed in a
language specifically designed for writing lexers, and backed by a
specialized execution engine that knows a great deal about fast
state-machine implementation.  Like, say, Flex.  Lexing is also clumsy and
slow in SNOBOL4 and Icon, despite that they excel at higher-level
pattern-matching tasks.  IOW, lexing is in a world by itself, almost nothing
is good at it, and the few things that shine at it don't do anything else.

    y'rs  - tim

From  Thu Aug 22 02:21:19 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 21:21:19 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <3D635DF8.9480.90CBCDDD@localhost>
Message-ID: <>

[Gordon McMillan]
> mxTextTools lets (encourages?) you to break all
> the rules about lex -> parse. If you can (& want to)
> put a good deal of the "parse" stuff into the scanning
> rules, you can get a speed advantage. You're also
> not constrained by the rules of BNF, if you choose
> to see that as an advantage :-).
> My one successful use of mxTextTools came after
> using SPARK to figure out what I actually needed
> in my AST, and realizing that the ambiguities in the
> grammar didn't matter in practice, so I could produce
> an almost-AST directly.

I don't expect anyone will have much luck writing a fast lexer using
mxTextTools *or* Python's regexp package unless they know quite a bit about
how each works under the covers, and about how fast lexing is accomplished
by DFAs.  If you know both, you can build a DFA by hand and painfully
instruct mxTextTools in the details of its construction, and get a very fast
tokenizer (compared to what's possible with re), regardless of the number of
token classes or the complexity of their definitions.  Writing to
mxTextTools directly is a lot like writing in an assembly language for a
character-matching machine, with all the pains and potential joys that
implies.  If I were Eric, I'd use Flex <wink>.

From  Thu Aug 22 02:35:23 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 21:35:23 -0400
Subject: [Python-Dev] More pydoc questions
In-Reply-To: <>
Message-ID: <>

[David Abrahams]
>    docstring_ext

Note that pydoc contains this handy <wink> function:

def plain(text):
    """Remove boldface formatting from text."""
    return re.sub('.\b', '', text)

Note that it's important that the regexp *not* be a raw-string there (it
will do somthing quite amazingly different if it is).

From  Thu Aug 22 02:50:29 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 21:50:29 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
Message-ID: <>

> ...
> Um, the notation is '|' and '&', not 'or' and 'and', and those are
> what I learned in school.  Seems pretty conventional to me (Greg
> Wilson actually tried this out on unsuspecting newbies and found that
> while '+' worked okay, '*' did not -- read the PEP).

FYI, kjbuckets uses '+' (union) and '&' (intersection).  '*' is used for
graph composition.  It so happens that graph composition applied to sets
views each set as a graph of self-loops

    (1, 2, 7} -> {(1, 1), (2, 2), (7, 7)}

and the composition of two such self-loop graphs is the self-loop graph of
the sets' intersection.  So you can view '*' as being a set intersection
operation there.  It's more useful to compose a graph with a set, in which
case you get the subgraph all of whose start-arc nodes are in the set (set *
graph), or all of whose end-arc nodes are in the set (graph * set).  This is
all very handy if you do a lot of it, but getting comfortable with this
higher-level of view of things is at the other end of a learning curve.

From  Thu Aug 22 03:00:41 2002
From: (Greg Ewing)
Date: Thu, 22 Aug 2002 14:00:41 +1200 (NZST)
Subject: [Python-Dev] Parsing vs. lexing.
In-Reply-To: <>
Message-ID: <>

Alex Martelli <>:

> This reminds me of a long-ago interview with Borland's techies about
> how they had managed to create Turbo Pascal, which ran well in a 64K
> (K, not M-) one-floppy PC

Even more impressive was the earlier version of Turbo Pascal which ran
on 64K Z80-based CP/M systems!

I have great respect for that one, because in a previous life I used
it to develop a cross-compiler for a Modula-2-like language targeting
the National 32000 architecture.

My compiler consisted of 3 overlays (for parsing, declaration analysis
and code generation), wasn't very fast, and had so little memory left
for a symbol table that it could only compile very small modules. :-(

In hindsight, my undoing was probably my insistence that the language
not require forward declarations (it was my language, so I could make
it how I wanted). If I had relaxed that, I could have used a single-
pass design that would have simplified things considerably.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 22 03:18:39 2002
From: (Greg Ewing)
Date: Thu, 22 Aug 2002 14:18:39 +1200 (NZST)
Subject: [Python-Dev] Re: More pydoc questions
In-Reply-To: <>
Message-ID: <> (=?iso-8859-1?q?Fran=E7ois?= Pinard):

> It works pretty well, as many printer filters know how to interpret such
> overstrike when meant to bold or underline.

Interesting - I hadn't known that. I guess it's not
quite so silly as it might look, then.

Still, there ought to be a way of getting plain
output unadorned by such tricks from the help

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Thu Aug 22 03:47:55 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 22:47:55 -0400
Subject: [Python-Dev] Parsing vs. lexing.
In-Reply-To: <Pine.BSF.4.33.0208211601190.51337-100000@localhost>
Message-ID: <>

[Jonathan Riehl]
> As per Zach's comments, I think this is pretty funny.  I have just spent
> more time trying to expose pgen to the interpreter than I took to write
> a R-D parser for Python 1.3 (granted, once Fred's parser module came
> around, I felt a bit silly).

It seems a potential lesson went unlearned then <wink>.

> Considering the scope of my parser generator integration wishlist,
> having GCC move to a hand coded recursive descent parser is going to
> make my head explode.  Even TenDRA ( used a LL(n)
> parser generator, despite its highly tweaked lookahead code.  So now I'm
> going to have to extract grammars from call trees?  As if the 500
> languages problem isn't already intractable, there are going to be
> popular language implementations that don't even bother with an abstract
> syntax specificaiton!?  (Stop me from further hyperbole if I am
> incorrect.)

Anyone writing an R-D parser by hand without a formal grammer to guide them
is insane.  The formal grammar likely won't capture everything, though --
but then they never do.

> No wonder there are no killer software engineering apps.  Maybe I should
> just start writing toy languages for kids...

Parser generators are great for little languages!  They're painful for real
languages, though, because syntax warts accumulate and then tool rigidity
gets harder to live with.  Hand-crafted R-D parsers are wonderfully
tweakable in intuitive ways (staring at a mountain of parse-table conflicts
and divining how to warp the grammar to shut the tool up is a black art
nobody should regret not learning ...).

15 years of my previous lives were spent as a compiler jockey, working for
HW vendors.  The only time we used a parser generator was the time we used
one written by a major customer, and for political reasons far more than
technical ones.  It worked OK in the end, but it didn't really save any
time.  It did save us from one class of error.  I vividly recall a bug
report against the previous Fortran compiler, where this program line (an


apparently never got executed.  It appeared to be an optimization bug at a
fundamental level, as there was simply no code generated for this statement.
After too much digging, we found that the guy who wrote the Fortran parser
had done the equivalent of

    if not statement.has_label() and statement.startswith('CONT'):
        pass   # an unlabelled CONTINUE statement can be ignored

It's just that nobody had started a variable name with those 4 letters
before.  Yikes!  I was afraid to fly for a year after <wink>.

a-class-of-tool-most-appreciated-when-it's-least-needed-ly y'rs  - tim

From  Thu Aug 22 03:59:27 2002
From: (Aahz)
Date: Wed, 21 Aug 2002 22:59:27 -0400
Subject: [Python-Dev] Parsing vs. lexing.
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Thu, Aug 22, 2002, Greg Ewing wrote:
> Alex Martelli <>:
>> This reminds me of a long-ago interview with Borland's techies about
>> how they had managed to create Turbo Pascal, which ran well in a 64K
>> (K, not M-) one-floppy PC
> Even more impressive was the earlier version of Turbo Pascal which ran
> on 64K Z80-based CP/M systems!


I believe you could actually theoretically start it up with only 32K of
memory, but you couldn't do any real work.
Aahz (           <*>

Project Vote Smart:

From  Thu Aug 22 04:39:34 2002
From: (Tim Peters)
Date: Wed, 21 Aug 2002 23:39:34 -0400
Subject: [Python-Dev] Re: [Python-checkins]
In-Reply-To: <>
Message-ID: <>

>> Do you expect that to be an issue?  When I built a database from 2=
>> messages, the whole thing fit in a Python dict consuming about 10M=

[Eric S. Raymond]
> Hm, that's a bit smaller than I would have thought, but the order o=
> magnitude I was expecting.

It's even smaller than that <wink>.  The dict here maps strings to in=
of a Python class (WordInfo).  The latter uses new-in-2.2 __slots__, =
those give major memory efficiences over old-style classes, but there=
still subtantial memory overhead compared to what's possible in C.  I=
addition, there are memory overheads for the Python objects stored in=
WordInfo instances, including a Python float object in each record re=
the time.time() of last access by the scoring method.

IOW, there are tons of memory overheads here, yet the size was still =
So I have no hesitation leaving this part in Python, and coding this =
part up
was a trivial finger exercise.  You know all that, though!  It makes =
decision to use C from the start hard to fathom.

> ...
> Recognition features should age!  Wow!  That's a good point!  With =
> age counter being reset when they're recognized.

For concreteness, here's the comment from the Python code, which I be=
is accurate:

    # (*)atime is the last access time, a UTC time.time() value.  It'=
s the
    # most recent time this word was used by scoring (i.e., by spampr=
    # not by training via learn()); or, if the word has never been us=
ed by
    # scoring, the time the word record was created (i.e., by learn()=
    # One good criterion for identifying junk (word records that have=
    # value) is to delete words that haven't been used for a long tim=
    # Perhaps they were typos, or unique identifiers, or relevant to =
    # once-hot topic or scam that's fallen out of favor.  Whatever, i=
    # a word is no longer being used, it's just wasting space.

Besides the space-saving gimmick, there may be practical value in exp=
older words that are getting used, but less frequently over time.  Th=
would be evidence that the nature of the world is changing, and more
aggressively expiring the model for how the world *used* to be may sp=
adaptation to the new realities.  I'm not saving enough info to do th=
though, and it's unclear whether it would really help.  Against it, w=
hile I
see new spam gimmicks pop up regularly, the old ones never seem to go=
(e.g., I don't do anything to try to block spam on my email accounts,=
the bulk of the spam I get is still easily recognized from the subjec=
t line
alone).  However, because it's all written in Python <wink>, it will =
be very
easy to set up experiments to answer such questions.

BTW, the ifile FAQ gives a little info about the expiration scheme if=
uses.  Rennie's paper gives more:

    Age is defined as the number of e-mail messages which have been
    added to the model since frequency statistics have been kept for
    the word.  Old, infrequent words are to be dropped while young wo=
    and old, frequent words should be kept.  One way to quantify this
    is to say that words which occur fewer than log2(age)-1 times
    should be discarded from the model.  For example, if =93baseball=
    occurred in the 1st document and occurred 5 or fewer times in the
    next 63 documents, the word and its corresponding statistics woul=
    be eliminated from the model=92s database.  This feature selectio=
    cutoff is used in ifile and is found to significantly improve
    efficiency without noticeably affecting classification performanc=

I'm not sure how that policy would work with Graham's scheme (which h=
as many
differences from the more conventional scheme ifile uses).  Our Pytho=
n code
also saves a count of the number of times each word makes it into Gra=
"best 15" list, and I expect that to be a direct measure of the value=
getting out of keeping a word ("word" means whatever the tokenizer pa=
sses to
the classifier -- it's really any hashable and (for now) pickleable P=

[on Judy]
> I thought part of the point of the method was that you get
> sorting for free because of the way elements are inserted.

Sure, if range-search or final sorted order is important, it's a grea=
benefit.  I was only wondering about why you'd expect spatial localit=
y in
the input as naturally ordered.

> ...
> No, but think about how the pointer in a binary search moves.  It's
> spatially bursty, Memory accesses frequencies for repeated binary
> searches will be a sum of bursty signals, analogous to the way
> network traffic volumes look in the time domain.  In fact the
> graph of memory adress vs. number of accesses is gonna win up
> looking an awful lot like 1/f noise, I think.  *Not* evenly
> distributed; something there for LRU to weork with.

Judy may or may not be able to exploit something here; I don't know, =
and I'd
need to know a lot more about Judy's implementation to even start to =
Plain binary search has horrid cache behavior, though.  Indeed, most =
research papers on B-Trees suggest that binary search doesn't buy any=
in even rather large B-Tree nodes, because the cache misses swamp the
reduced instruction count over what a simple linear search does.  Wor=
that linear search can be significantly faster if the HW is smart eno=
ugh to
detect the regular address access pattern of linear search and do som=
helpful prefetching for you.  More recent research on B-Trees is on w=
ays to
get away from bad binary search behavior; two current lines are using
explicit prefetch instructions to minimize the stalls, and using a mo=
cache-friendly data structure inside a B-Tree node.  Out-guessing mod=
ern VM
implementations is damned hard.

With a Python dict you're likely to get a cache miss per lookup.  If =
a disaster, time.clock() isn't revealing it <wink>.

> ...
> What I'm starting to test now is a refactoring of the program where=
> spawn a daemon version of itself first time it's called.  The daemo=
> eats the wordlists and stays in core fielding requests from subsequ=
> program runs.  Basically an answer to "how you call bogofilter 1K
> times a day from procmail without bringing your disks to their knee=
> problem" -- persistence on the cheap.
> Thing is that the solution to this problem is very generic.  Might
> turn into a Python framework.

Sounds good!  Just don't call it "compilerlike" <wink>.

From  Thu Aug 22 04:57:10 2002
From: (Raymond Hettinger)
Date: Wed, 21 Aug 2002 23:57:10 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
References: <>
Message-ID: <008301c24990$12449d00$47b53bd0@othello>

> [Guido]
> > Um, the notation is '|' and '&', not 'or' and 'and', and those are
> > what I learned in school.  Seems pretty conventional to me (Greg
> > Wilson actually tried this out on unsuspecting newbies and found that
> > while '+' worked okay, '*' did not -- read the PEP).

> FYI, kjbuckets uses '+' (union) and '&' (intersection).  '*' is used for

FTI, ISETL uses '+' and '*' as synonyms for the spelled-out 'inter' and 'union' operators.

Playing with a sample session for possible inclusion in the tutorial, I've found that '|' is not nearly as clear in its intention
as '+'.

Raymond Hettinger


from sets import Set

engineers = Set(['John', 'Jane', 'Jack', 'Janice'])
programmers = Set(['Jack', 'Sam', 'Susan', 'Janice'])
management = Set(['Jane', 'Jack', 'Susan', 'Zack'])

employees = engineers | programmers | management    # more clear with '+'
engineering_management = engineers & programmers
fulltime_management = management - engineers - programmers

print engineers, 'Look, Marvin was added'
print employees.issuperset(engineers), 'There is a problem'
print employees, 'Hmm, employees needs an update'
print employees, 'Looks fine now'

for group in [engineers, programmers, management, employees]:
    print group

From  Thu Aug 22 05:30:09 2002
From: (Guido van Rossum)
Date: Thu, 22 Aug 2002 00:30:09 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Wed, 21 Aug 2002 23:57:10 EDT."
References: <>
Message-ID: <>

> Playing with a sample session for possible inclusion in the
> tutorial, I've found that '|' is not nearly as clear in its
> intention as '+'.

It's way too early to say that.  I actually like the fact that | and
& are a new vocabulary (for containers at least) so they provide an
additional hint that we're dealing with a different kind of

For sequences, a+b != b+a (unless a==b).  For sets, a|b == b|a.
That's a useful distinction.

+ is already used for two distinct purposes: for numbers (where it
is symmetric) and for sequences (where it is not).  But numbers and
sequences are unlikely to be confused because they are used so
differently.  Sets and lists are both containers, and I think it's
useful that their vocabularies don't overlap much.

--Guido van Rossum (home page:

From  Thu Aug 22 05:49:08 2002
From: (Tim Peters)
Date: Thu, 22 Aug 2002 00:49:08 -0400
Subject: [Python-Dev] Re: [Python-checkins]
In-Reply-To: <>
Message-ID: <>

[Eric S. Raymond]
> Your users' mailers would have two delete buttons -- spam and nonspam.
> On each delete the message would be shipped to bogofilter, which would
> would merge the content into its token lists.

I want to raise a caution here.  Graham pulled his formulas out of thin air,
and one part of the scoring step is quite dubious.  This requires detail to
understand.  Where X means "word x is present" and similarly for Y, and S
means "it's spam" and "not-S" means "it's not spam", and sticking to just
the two-word case for simplicity:

P(S | X and Y) = [by Bayes]

P(X and Y | S) * P(S) / P(X and Y) = [by the usual expanded form of Bayes]

P(X and Y | S) * P(S) / (P(S)*P(X and Y | S) + P(not-S)*P(X and Y | not-S))

All that is rigorously correct so far.  Now we make the simplifying
assumption that puts the "naive" in "naive Bayes", that the probability of X
is independent of the probability of Y, so that the conjoined probabilities
can be replaced by multiplication of non-conjoined probabilities.  This

P(S)*P(X|S)*P(Y|S) + P(not-S)*P(X|not-S)*P(Y|not-S)

Then, unlike a "normal" formulation of Bayesian classification, Graham's
scheme simply doesn't know anything about P(X|S) and P(Y|S) etc.  It only
knows about probabilities in the other direction (P(S|X) etc).  It takes 3
more applications of Bayes to get what we want from what we know.  That is,

P(X|S) = [again by Bayes]

P(S|X) * P(X) / P(S)

Plug that in, mutatis mutandis, in six places, to get

P(S)*P(S|X)*P(X)/P(S)*P(S|Y)*P(Y)/P(S) + ...

The factor P(X)*P(Y) cancels out of numerator and denominator, leaving

P(S)*P(S|X)/P(S)*P(S|Y)/P(S) + ...

and simplifying some P(whatever)/P(whatever) instances away gives

P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S)

This isn't what Graham computes, though:  the P(S) and P(not-S) terms are
missing in his formulation.  Given that P(not-S) = 1-P(S), and
P(not-S|whatever) = 1-P(S|whatever), what he actually computes is

P(S|X)*P(S|Y) + P(not-S|X)*P(not-S|Y)

This is the same as the Bayesian result only if P(S) = 0.5 (in which case
all the instances of P(S) and P(not-S) cancel out).  Else it's a distortion
of the naive Bayesian result.

For this reason, it's best that the number of spam msgs fed into your
database be approximately equal to the number of non-spam msgs fed into it:
that's the only way to make P(S) ~= P(not-S), so that the distortion doesn't
matter.  Indeed, it may be that Graham found he had to multiply his "good
counts" by 2 in order to make up for that in real life he has twice as many
non-spam messages as spam messages in his inbox, but that the final scoring
implicitly assumes they're of equal number (and so overly favors the "it's
spam" outcome unless the math is fudged elsewhere to make up for that).

It would likely be better still to train the database with a proportion of
spam to not-spam messages reflecting what you actually get in your inbox,
and change the scoring to use the real-life P(S) and P(not-S) estimates.  In
that case the "mystery bias" of 2 may actively hurt, overly favoring the
"not spam" outcome.

Note that Graham said:

    Here's a sketch of how I do statistical filtering.  I start with one
    corpus of spam and one of nonspam mail.  At the moment each one has
    about 4000 messages in it.

That's consistent with all the above, although it's unclear whether Graham
intended "about the same" to be a precondition for using this formulation,
or whether fudging elsewhere was introduced empirically to make up for the
scoring formula neglecting P(S) and P(not-S) by oversight.

From  Thu Aug 22 06:38:22 2002
From: (Tim Peters)
Date: Thu, 22 Aug 2002 01:38:22 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <008301c24990$12449d00$47b53bd0@othello>
Message-ID: <>

[Raymond Hettinger]
> FTI, ISETL uses '+' and '*' as synonyms for the spelled-out
> 'inter' and 'union' operators.

You realize that reads as if they used '+' for 'inter' and '*' for 'union',

> Playing with a sample session for possible inclusion in the
> tutorial, I've found that '|' is not nearly as clear in its intention
> as '+'.
> engineers = Set(['John', 'Jane', 'Jack', 'Janice'])
> programmers = Set(['Jack', 'Sam', 'Susan', 'Janice'])
> management = Set(['Jane', 'Jack', 'Susan', 'Zack'])
> employees = engineers | programmers | management    # more clear with '+'

I haven't made time to play with the new sets module yet, but it was
instantly clear to me just as it was.  I think Guido makes a very good point
about "+" making it much more confusable with a sequence or numeric
operation too.  OTOH, I'm rarely a fan of overloaded operators, and suspect
I'll tend to use whatever .method() names the module supports (the set
modules I've written for my own use never overloaded operators, btw).

One thing did strike me as odd later!  If I were to ask this company's HR
director what kinds of employees they had, I bet the answer I'd hear is

    well, mostly we have engineers and programmers and management

It seems far less likely I'd hear

    well, mostly we have engineers or programmers or management

and I read "|" as "or".  If I heard

    well, mostly we have engineers vertical-bar programmers
                                   vertical-bar management

I'd beg to work there for free <wink>.

From  Thu Aug 22 09:44:24 2002
From: (Martin =?ISO-8859-1?Q?Sj=F6gren?=)
Date: 22 Aug 2002 10:44:24 +0200
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <>
Message-ID: <1030005864.561.1.camel@winterfell>

Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

tor 2002-08-22 klockan 01.20 skrev Greg Ewing:
> Martin =3D?ISO-8859-1?Q?Sj=3DF6gren?=3D <>:
> > Uhm, what about + and juxtaposition? They are quite common at least
> > here in Sweden, for boolean algebra.
> They're not normally used for sets, though, in my
> experience (despite the fact that set theory is
> a boolean algebra:-).

Nope, not for sets. I have rarely seen anything but \cup and \cap for
sets. But they are quite often used when working with boolean algebras
in general.


Content-Type: application/pgp-signature; name=signature.asc
Content-Description: Detta =?ISO-8859-1?Q?=E4r?= en digitalt signerad

Version: GnuPG v1.0.7 (GNU/Linux)



From  Thu Aug 22 09:55:37 2002
From: (Martin =?ISO-8859-1?Q?Sj=F6gren?=)
Date: 22 Aug 2002 10:55:37 +0200
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
References: <>
Message-ID: <1030006537.561.4.camel@winterfell>

Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

tor 2002-08-22 klockan 03.13 skrev Tim Peters:
> Seems more to me that lexing is convenient and fast only when expressed i=
n a
> language specifically designed for writing lexers, and backed by a
> specialized execution engine that knows a great deal about fast
> state-machine implementation.  Like, say, Flex.  Lexing is also clumsy an=
> slow in SNOBOL4 and Icon, despite that they excel at higher-level
> pattern-matching tasks.  IOW, lexing is in a world by itself, almost noth=
> is good at it, and the few things that shine at it don't do anything else=

I've actually found that Haskell's pattern matching and lazy evaluation
makes it pretty easy to write lexers. Too bad it's too hard to use
Haskell together with other languages :(

But then, writing a R-D parser in Haskell is piece-of-cake too :-)


Content-Type: application/pgp-signature; name=signature.asc
Content-Description: Detta =?ISO-8859-1?Q?=E4r?= en digitalt signerad

Version: GnuPG v1.0.7 (GNU/Linux)



From  Thu Aug 22 10:15:26 2002
From: (Michael Hudson)
Date: 22 Aug 2002 10:15:26 +0100
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Greg Ewing's message of "Thu, 22 Aug 2002 11:25:07 +1200 (NZST)"
References: <>
Message-ID: <>

Greg Ewing <> writes:

> "Eric S. Raymond" <>
> Subject: Re: [Python-Dev] Re: PEP 218 (sets):
> > Is it + for disjunction and juxtaposition for conjunction, or the other
> > way around?
> + is 'or' and juxtaposition (or sometimes a dot) is 'and'
> (I prefer those words because they're shorter than
> 'disjunction' and 'conjunction', and I can remember which
> is which:-).

I've always thought "meet" and "join" to be quite cute terms for
lattice operations.  Guaranteed to confuse the average user of
-- let's go for it!


  The use of COBOL cripples the mind; its teaching should, therefore,
  be regarded as a criminal offence.
           -- Edsger W. Dijkstra, SIGPLAN Notices, Volume 17, Number 5

From  Thu Aug 22 10:31:07 2002
From: (Michael Hudson)
Date: 22 Aug 2002 10:31:07 +0100
Subject: [Python-Dev] q about default args
In-Reply-To: Stepan Koltsov's message of "Wed, 21 Aug 2002 22:24:51 +0400"
References: <>
Message-ID: <>

Stepan Koltsov <> writes:

> Hi, Guido, other Python developers and other subscribers.
> First of all, if this question was discussed here or somewhere
> else 8086 times, please direct me to discussion archives.

I doubt it's ever been discussed on python-dev.  Most people here know
a non-starter when they see one.

Hmm.  Well, they know this one's a non-starter :)

> I couldn't guess the keywords to search for in the python-dev archives
> as I haven't found the search page where to enter these keywords :-)

Just use google.  If python-dev is the most relavent place for the
discussion, the archives will be near the top of the results.

> The question is: To be or^H^H^H^H^H^H^H^H^H Why not evaluate default
> parameters of a function at THE function call, not at function def
> (as is done currenly)? For example, C++ (a nice language, isn't it? ;-)


> ) evaluates default parameters at function call.
> Implementation details:
> Simple... 
> Add a flag to the code object, that means "evaluate default args".
> Compile default args to small code objects and store them where values
> for default args are stored in current Python (i.e. co_consts).

That's not where they're stored.

> When a function is called, evaluate the default args (if the above
> flag is set) in the context of that function. 

This could break code, you realise:

a = 1
def f(a, b=a):
    print a, b


I could go on, but I'm running out of steam...

> An alternative way to go (a little example... LOOK ON, PERSONALY, I

Fortunately or unfortunately, that makes little difference to the
direction of python development.

> ---
> def f(x=12+[]):
> 	stmts
> ===
> compiled into something like:
> 0: LOAD_CONST 1 (12)
> 3: STORE_FAST 0 (x)
> 4: # here code of stmts begin
> in the case if 'x' was specfied, the code is executed instruction 4
> onword This should work perfectly, ideologically correct and I think
> even faster then current interpreter implementation.

You'd have fun with:

def f(a=1,b=2):
    print a, b


here, no?

> Motivation (he-he, the most difficult part of this letter):
> 1. Try to import this module:
> import math
> def func(a = map(lambda x: math.sqrt(x)):
> 	pass
> # there is no call to func
> ===
> This code does nothing but define a single function,
> but look at the execution time...

So don't do something that thick, then!

> 2. Currently, default arguments are like static function variables,
> defined in the function parameter list! That is wrong.

Says you.

> 4. Again: I dislike code like
> ---
> def f(l=None):
>     if l is None:
>         l = []
>     ...

Who elected you style guru of the universe?  

> 5. I asked my friend (also big Python fan): why the current
> behaviour is correct?  his answer was: "the curren behaviour is
> correct, becausethat is the way it was done in the first place :-)
> ..." I don't see any advantages of the current style, and lack of
> advantages is advantage of new style :-)

For better of for worse, people *do* write code that depends on
default function arguments being evaluated once, usually as a lazy way
of precomputing things, or as a cache.

> I hope, that the current state of things is a result of laziness (or is
> it "business"), not sabotage :-) .  and not an ideological decision. It
> isn't late to fix Python yet :-) 

Two points: 

1) I'm unconvinced this is a "fix"
2) I think it probably is too late.


  You can lead an idiot to knowledge but you cannot make him 
  think.  You can, however, rectally insert the information, 
  printed on stone tablets, using a sharpened poker.        -- Nicolai

From  Thu Aug 22 12:03:44 2002
From: (Fredrik Lundh)
Date: Thu, 22 Aug 2002 13:03:44 +0200
Subject: [Python-Dev] Parsing vs. lexing.
References: <>
Message-ID: <001601c249cb$98830fb0$0900a8c0@spiff>

tim wrote:

> Parser generators are great for little languages!  They're painful for =
> languages, though, because syntax warts accumulate and then tool =
> gets harder to live with.  Hand-crafted R-D parsers are wonderfully
> tweakable in intuitive ways (staring at a mountain of parse-table =
> and divining how to warp the grammar to shut the tool up is a black =
> nobody should regret not learning ...).


"For me and C++, [using a parser generator] was a bad mistake."


From  Thu Aug 22 15:24:32 2002
From: (Magnus Lie Hetland)
Date: Thu, 22 Aug 2002 16:24:32 +0200
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>; from on Tue, Aug 20, 2002 at 11:47:11PM -0400
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <> <> <> <> <>
Message-ID: <>

Guido van Rossum <>:
> no hope that this will ever complete in finite time, but does that
> mean it shouldn't start?  I could write 1L<<e and avoid the issue, but
> then I'd be paying for long ops that I'll only ever need in a case
> that's only of theoretical importance.

How about lazy sets? E.g. a CartesianProduct could delegate to its two
underlying (concrete) sets when checking for membership, and a
PowerSet could perform individual member cheks for each element in a
given subset... Etc.

I guess this might be too specific for the library -- subclassing
ImmutableSet and overriding the accessors shouldn't be too hard...

(The nice thing about including it in the library is that you could
produce these things as results from operations on Set and
ImmutableSet, e.g. 2**some_set could give a power set or whatever...)

Magnus Lie Hetland                                  The Anygui Project                        

From  Thu Aug 22 15:25:01 2002
From: (Magnus Lie Hetland)
Date: Thu, 22 Aug 2002 16:25:01 +0200
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>; from on Wed, Aug 21, 2002 at 05:41:11PM +1200
References: <> <>
Message-ID: <>

Greg Ewing <>:
> Oh, no. Someone is bound to want set comprehensions, now...

That's in the PEP, isn't it?

Magnus Lie Hetland                                  The Anygui Project                        

From  Thu Aug 22 16:12:57 2002
From: (Guido van Rossum)
Date: Thu, 22 Aug 2002 11:12:57 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Thu, 22 Aug 2002 16:24:32 +0200."
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <> <> <> <> <>
Message-ID: <>

> [snip]
> > no hope that this will ever complete in finite time, but does that
> > mean it shouldn't start?  I could write 1L<<e and avoid the issue, but
> > then I'd be paying for long ops that I'll only ever need in a case
> > that's only of theoretical importance.
> How about lazy sets? E.g. a CartesianProduct could delegate to its two
> underlying (concrete) sets when checking for membership, and a
> PowerSet could perform individual member cheks for each element in a
> given subset... Etc.

Have you got a use case for membership tests of a cartesian product?

> I guess this might be too specific for the library -- subclassing
> ImmutableSet and overriding the accessors shouldn't be too hard...
> (The nice thing about including it in the library is that you could
> produce these things as results from operations on Set and
> ImmutableSet, e.g. 2**some_set could give a power set or whatever...)

Use case?

--Guido van Rossum (home page:

From  Thu Aug 22 17:14:32 2002
From: (Charles Cazabon)
Date: Thu, 22 Aug 2002 10:14:32 -0600
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>; from on Thu, Aug 22, 2002 at 12:49:08AM -0400
References: <> <>
Message-ID: <>

This brings up a couple of questions, one related to the theory behind this
Bayesian spam filtering, and one about Python optimization ... apologies in
advance for the long post.

Tim Peters <> wrote:
> I want to raise a caution here.  Graham pulled his formulas out of thin air,
> and one part of the scoring step is quite dubious.  This requires detail to
> understand.

[detail deleted]
>                P(S|X)*P(S|Y)/P(S)
> ---------------------------------------------------
> P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S)
> This isn't what Graham computes, though:  the P(S) and P(not-S) terms are
> missing in his formulation.  Given that P(not-S) = 1-P(S), and
> P(not-S|whatever) = 1-P(S|whatever), what he actually computes is
>            P(S|X)*P(S|Y)
> -------------------------------------
> P(S|X)*P(S|Y) + P(not-S|X)*P(not-S|Y)
> This is the same as the Bayesian result only if P(S) = 0.5 (in which case
> all the instances of P(S) and P(not-S) cancel out).  Else it's a distortion
> of the naive Bayesian result.

Is there an easy fix to this problem?  I implemented this in Python after
reading about it on the weekend, and it might explain why my results are not
quite as fabulous as the author noted (I'm getting more false positives than
he claimed he was).  Note that I'm not so good with the above notation; I'm
more at home with plain algebraic stuff :).

But the more interesting Python question:  I'm running into some performance
problems with my implementation.  Details:

The analysis stage of my implementation (I'll refer to it as "spamalyzer" for
now) stores the "mail corpus" and term list on disk.  The mail corpus is two
dictionaries (one for spam, one for good mail), each of which contains two
further dictionaries -- one is the filenames of analyzed messages (one key per
filename, values ignored and stored as 0), and the other is a dictionary
mapping terms to the number of occurrences.  The terms list is a single
dictionary mapping terms to a pair of floats (probability of being spam and
distance from 0.5).

My first try at this used cPickle to store these items, but loading them back
in was excruciatingly slow.  From a lightly loaded P3-500/128MB running Linux
2.2.x, each of these is a separate run of a benchmarking Python script:

Loading corpus

pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 65.190000000000 seconds.
pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 64.790000000000 seconds.
pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 65.010000000000 seconds.

Loading terms

pickle method: got 12986 terms in 3.460000000000 seconds.
pickle method: got 12986 terms in 3.470000000000 seconds.
pickle method: got 12986 terms in 3.450000000000 seconds.

For a lark, I decided to try an alternative way of storing the data (and no, I
haven't tried the marshal module directly).  I wrote a function to write the
contents of the dictionary to a text file in the form of Python source, so
that you can re-load the data with a simple "import" command.  To my surprise,
this was significantly faster!  The first import, of course, takes a while, as
the interpreter compiles the .py file to .pyc format, but subsequent runs are
an order of magnitude faster than cPickle.load():

Loading corpus

[charlesc@charon spamalyzer]$ rm mail_corpus.pyc 

custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 194.210000000000 seconds.
custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 3.500000000000 seconds.
custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 3.260000000000 seconds.
custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms)
in 3.260000000000 seconds.

Loading terms

[charlesc@charon spamalyzer]$ rm terms.pyc

custom method: got 12986 terms in 3.110000000000 seconds.
custom method: got 12986 terms in 0.210000000000 seconds.
custom method: got 12986 terms in 0.210000000000 seconds.
custom method: got 12986 terms in 0.210000000000 seconds.

So the big question is, why is my naive "o = __import__ (f, {}, {}, [])" so
much faster than the more obvious "o = cPickle.load (f)"?  And what can I do
to make it faster :).

Charles Cazabon                           <>
GPL'ed software available at:

From  Thu Aug 22 17:33:25 2002
From: (Skip Montanaro)
Date: Thu, 22 Aug 2002 11:33:25 -0500
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <>
Message-ID: <15717.4693.455504.793456@gargle.gargle.HOWL>

    Charles> So the big question is, why is my naive "o = __import__ (f, {},
    Charles> {}, [])" so much faster than the more obvious "o = cPickle.load
    Charles> (f)"?  And what can I do to make it faster :).

Try dumping in the binary format, e.g.:

    s = cPickle.dumps(obj, 1)


From  Thu Aug 22 20:37:26 2002
From: (Tim Peters)
Date: Thu, 22 Aug 2002 15:37:26 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
Message-ID: <>

[Charles Cazabon]
> Is there an easy fix to this problem?

I don't know that there is "a problem".  The step is dubious, but other
steps are also dubious, and naive Bayes itself is dubious (the assumption
that word pobabilities are independent is simply false in this application).
But outside of, perhaps, quantum chromodynamics, all models of reality are
more or less dubious, and it's darned hard to say whether the fudging needed
to make them appear to work is lucky or principled, robust or brittle.  The
more gross deviations there are from a model, though, the less one can
appeal to the model for guidance.  In the limit, you can end up with a pile
of arbitrary tricks, with no idea which gimmicks matter anymore (given
enough free parameters to fiddle, you can train even a horribly
inappropriate model to fit a specific blob of data exactly).

> I implemented this in Python after reading about it on the weekend, and it
> might explain why my  results are not quite as fabulous as the author
> (I'm getting more false positives than he claimed he was).

How many lines of code do you have?  That's a gross lower bound on the
number of places it might have screwed up <wink>.

> Note that I'm not so good with the above notation; I'm more at home with
> plain algebraic stuff :).

It's all plain-- and simple ---algebra, it's just long-winded.  You may be
confusing yourself, e.g., by reading P(S|X) as if it's a complex expression
in its own right.  But it's not -- given an incarnation of the universe, it
denotes a fixed number.  Think of it as "w" instead <wink>.

Let's get concrete.  You have a spam corpus with 1000 messages.  100 of them
contain the word x, and 500 contain the word y.  Then

    P(X|S) = 100/1000 = 1/10
    P(Y|S) = 500/1000 = 1/2

You also have a non-spam corpus with 2000 messages.  100 of them contain x
too, and 500 contain y.  Then

    P(X|not-S) = 100/2000 = 1/20
    P(Y|not-Y) = 500/2000 = 1/4

This is the entire universe, and it's all you know.  If you pick a message
at random, what's P(S) = the probability that it's from the spam corpus?
It's trivial:

    P(S) = 1000/(1000+2000) = 1/3
    P(not-S) = 2/3

Now *given that* you've picked a message at random, and *know* it contains
x, but don't know anything else, what's the probability it's spam (== what's
P(S|X)?).  Well, it has to be one of the 100 spam messages that contains x,
or one of the 100 non-spam messages that contains x.  They're all equally
likely, so

    P(S|X) = (100+100)/200 = 1/2
    P(S|Y) = (500+500)/500 = 1/2

too by the same reasoning.  P(not-S|X) and P(not-S|Y) are also 1/2 each.

So far, there's nothing a reasonable person can argue with.  Given that this
is our universe, these numbers fall directly out of what reasonable people
agree "probability" means.

When it comes to P(S|X and Y), life is more difficult.  If we *agree* to
assume that word probabilities are independent (which is itself dubious, but
has the virtue of appearing to work pretty well anyway), then the number of
messages in the spam corpus we can expect to contain both X and Y is

    P(X|S)*P(Y|S)*number_spams = (1/10)*(1/2)*1000 = 50

Similarly the # of non-spam messages we can expect to contain both X and Y

    (1/20)*(1/4)*2000 = 25

Since that's all the messages that contain both X and Y, the probability
that a message containing both X and Y is spam is

    P(S | X and Y) = 50/(50 + 25) = 2/3

Note that this agrees with the formula whose derivation I spelled out from
first principles:

 --------------------------------------------------- =
 P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S)

                (1/2)*(1/2)/(1/3)         2
 -------------------------------------- = -
 (1/2)*(1/2)/(1/3)  + (1/2)*(1/2)/(2/3)   3

It's educational to work through Graham's formulation on the same example.
To start with, P(S|X) is approximated by a different means, and fudging the
"good count" by a factor of 2, giving

    P'(S|X) = (100/1000) / (100/1000 + 2*100/2000) = 1/2

and similarly for P'(S|Y).  These are the same probabilities I gave above,
but the only reason they're the same is because I deliberately picked a spam
corpus size exactly half the size of the non-spam corpus, knowing in advance
that this factor-of-2 fudge would make the results the same in the end.  The
only difference in what's computed then is in the scoring step, where
Graham's formulation computes

    P'(S | X and Y) =  (1/2)*(1/2)/((1/2)*(1/2)+(1/2)*(1/2)) = 1/2

instead of the 2/3 that's actually true in this universe.  If the corpus
sizes diverge more, the discrepancies at the end grow too, and the way of
computing P(S|X) at the start also diverges.

Is that good or bad?  I say no more now than that it's dubious <wink>.

> But the more interesting Python question:  I'm running into some
> performance problems with my implementation.  Details:

English never helps with these.  Whittle it down and post actual code to
comp.lang.python for help.  Or study the sandbox code in the CVS repository
(see the Subject line of this msg) -- it's not having any speed problems.

From  Thu Aug 22 22:55:55 2002
From: (Tim Peters)
Date: Thu, 22 Aug 2002 17:55:55 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
Message-ID: <>

> ...
>     P(S|X) = (100+100)/200 = 1/2
> and
>     P(S|Y) = (500+500)/500 = 1/2
> too by the same reasoning.  P(not-S|X) and P(not-S|Y) are also 1/2 each.
> So far, there's nothing a reasonable person can argue with.

And note that only an unreasonable person would argue that (100+100)/200 =
1, so don't even think about it <wink>.  Of course those should have been

     P(S|X) = 100/(100+100) = 1/2
     P(S|Y) = 500/(500+500) = 1/2

There are probably other glitches like that.  Work out the example
yourself -- it's much easier to figure it out than to transcribe all the
tedious numbers into a msg.

From  Thu Aug 22 23:27:37 2002
From: (Eric S. Raymond)
Date: Thu, 22 Aug 2002 18:27:37 -0400
Subject: [Python-Dev] q about default args
In-Reply-To: <>
References: <>
Message-ID: <>

Stepan Koltsov <>:
> The question is: To be or^H^H^H^H^H^H^H^H^H Why not evaluate default
> parameters of a function at THE function call, not at function def
> (as is done currenly)? For example, C++ (a nice language, isn't it? ;-)
> ) evaluates default parameters at function call.

Among other things, because that choice (what old LISP hackers like me
call `dynamic scoping') turns out to be far more difficult to model
mentally than Python's lexical scoping.   Forty years of LISP experience
says Python does the right thing.
		<a href="">Eric S. Raymond</a>

From  Thu Aug 22 23:46:56 2002
From: (Eric S. Raymond)
Date: Thu, 22 Aug 2002 18:46:56 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <> <> <> <> <> <> <>
Message-ID: <>

Guido van Rossum <>:
> > And not necessary.  Base64 spam invariably has telltales that Bayesian
> > amalysis will pick up in the headers and MIME cruft.  A rather large
> > percentage of it is either big5 or images.
> I'd be curious to know if that will continue to be true in the future.
> At least one of my non-tech friends sends email that's exclusively
> HTML (even though the content is very lightly marked-up plain text),
> from a hotmail account.  Spam could easily have the same origin, but
> the HTML contents would be very different.

Well, consider.  If your friend were to send you base64 mail, it 
probaby would *not* come from one of the spamhaus addresses in 
bogofilter's wordlists.

The presence of base64 content is neutral.  That means that about the only
way not decoding it could lead to a false positive is if the headers 
contained spam-correlated tokens which decoding the body would have 
countered with words having a higher non-spam loading.
		<a href="">Eric S. Raymond</a>

From  Thu Aug 22 23:38:11 2002
From: (Eric S. Raymond)
Date: Thu, 22 Aug 2002 18:38:11 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
References: <> <>
Message-ID: <>

Tim Peters <>:
> > They hit the language in a weak spot created by the immutability of
> > strings.
> But you lost me here -- I don't see a connection between immutability and
> either ease of writing lexers or speed of lexers.

It's an implementation problem.  You find yourself doing a lot of 
string accessing and pasting, creating several new objects per
input char.
		<a href="">Eric S. Raymond</a>

From  Thu Aug 22 23:33:34 2002
From: (Eric S. Raymond)
Date: Thu, 22 Aug 2002 18:33:34 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <> <> <> <> <> <> <>
Message-ID: <>

Barry A. Warsaw <>:
> You need some kind of list admin oversight or your system is open to
> attack vectors on individual posters.

Interesting point!
		<a href="">Eric S. Raymond</a>

From  Fri Aug 23 03:38:32 2002
From: (Greg Ewing)
Date: Fri, 23 Aug 2002 14:38:32 +1200 (NZST)
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
Message-ID: <>

"Eric S. Raymond" <>:
> Tim Peters <>:
> > But you lost me here -- I don't see a connection between immutability and
> > either ease of writing lexers or speed of lexers.
> It's an implementation problem.  You find yourself doing a lot of 
> string accessing and pasting, creating several new objects per
> input char.

Not necessarily! Plex manages to do it without any
of that.

The trick is to leave all the characters in the input
buffer and just *count* how many characters make up
the next token. Once you've decided where the token
ends, one slice gives it to you.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug 23 04:17:08 2002
From: (Tim Peters)
Date: Thu, 22 Aug 2002 23:17:08 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
Message-ID: <>

[Greg Ewing]
> Not necessarily! Plex manages to do it without any
> of that.
> The trick is to leave all the characters in the input
> buffer and just *count* how many characters make up
> the next token. Once you've decided where the token
> ends, one slice gives it to you.

Plex is very nice!  It doesn't pass my "convient and fast" test only because
the DFA at the end still runs at Python speed, and one character at a time
is still mounds slower than it could be in C.  Hmm.  But you can also
generate pretty reasonable C code from Python source now too!  You're going
to solve this yet, Greg.

Note that mxTextTools also computes slice indices for "tagging", rather than
build up new string objects.  Heck, that's also why Guido (from the start)
gave the regexp and string match+search gimmicks optional start-index and
end-index arguments too, and why one of the "where did this group match?"
flavors returns slice indices.  I think Eric has spent too much time
debugging C lately <wink>.

From  Fri Aug 23 04:21:55 2002
From: (Greg Ewing)
Date: Fri, 23 Aug 2002 15:21:55 +1200 (NZST)
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <>
Message-ID: <>

> Hmm.  But you can also generate pretty reasonable C code from Python
> source now too!  You're going to solve this yet, Greg.

Yes, probably the first serious use I make of Pyrex
will be to re-implement the inner loop of Plex so it
runs at C speed.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug 23 04:25:25 2002
From: (Andrew P. Lentvorski)
Date: Thu, 22 Aug 2002 20:25:25 -0700 (PDT)
Subject: [Python-Dev] q about default args
In-Reply-To: <>
Message-ID: <>

On Thu, 22 Aug 2002, Eric S. Raymond wrote:

> Stepan Koltsov <>:
> > The question is: To be or^H^H^H^H^H^H^H^H^H Why not evaluate default
> > parameters of a function at THE function call, not at function def
> Among other things, because that choice (what old LISP hackers like me
> call `dynamic scoping') turns out to be far more difficult to model
> mentally than Python's lexical scoping.

That statement sounds like someone spent a lot of time doing research on
it.  Is there a reference I could go look up?


From  Fri Aug 23 04:29:59 2002
From: (Tim Peters)
Date: Thu, 22 Aug 2002 23:29:59 -0400
Subject: [Python-Dev] RE: [Python-checkins] python/dist/src/Lib,1.26,1.27
In-Reply-To: <>
Message-ID: <>

> Update of /cvsroot/python/python/dist/src/Lib
> In directory usw-pr-cvs1:/tmp/cvs-serv15469
> Modified Files:
> Log Message:
> Rewritten using the tokenize module, which gives us a real tokenizer
> rather than a number of approximating regular expressions.
> Alas, it is 3-4 times slower.  Let that be a challenge for the
> tokenize module.

Was this just for purity, or did it fix a bug?  The regexps there were close
to being heroically careful, and even so it was somtimes uncomfortably slow
using the class browser in IDLE (based on pyclbr), and even on a fast
machine.  A factor of 3 or 4 might make that unbearable.

If it was for purity, note that tokenize is also based on mounds of regexp
tricks <wink>.

From  Fri Aug 23 06:06:45 2002
From: (Eric S. Raymond)
Date: Fri, 23 Aug 2002 01:06:45 -0400
Subject: [Python-Dev] q about default args
In-Reply-To: <>
References: <> <>
Message-ID: <>

Andrew P. Lentvorski <>:
> > Among other things, because that choice (what old LISP hackers like me
> > call `dynamic scoping') turns out to be far more difficult to model
> > mentally than Python's lexical scoping.
> That statement sounds like someone spent a lot of time doing research on
> it.  Is there a reference I could go look up?

It's sort of a folk theorem derived from painful experience.  Nobody 
has proposed a new LISP dialect with lexical scoping since the mid-1980s.
Scheme and Common LISP, both lexically scoped, pretty much settled the
		<a href="">Eric S. Raymond</a>

From  Fri Aug 23 07:37:56 2002
From: (Eric Tiedemann)
Date: Thu, 22 Aug 2002 23:37:56 -0700 (PDT)
Subject: [Python-Dev] q about default args
In-Reply-To: <>
Message-ID: <>

Eric S. Raymond discourseth:
> Andrew P. Lentvorski <>:
> > > Among other things, because that choice (what old LISP hackers like me
> > > call `dynamic scoping') turns out to be far more difficult to model
> > > mentally than Python's lexical scoping.
> > 
> > That statement sounds like someone spent a lot of time doing research on
> > it.  Is there a reference I could go look up?
> It's sort of a folk theorem derived from painful experience.  Nobody 
> has proposed a new LISP dialect with lexical scoping since the mid-1980s.
> Scheme and Common LISP, both lexically scoped, pretty much settled the
> controversy. has some good
coverage of this.

When it comes to the original topic (the handling of default
arguments), I think it's possible to separate time of evaluation and
scope of evaluation.  Call-time and static-scoping seem like good
choices to me.  Being able to refer to parameters to the left of the
one you're defaulting can be especially handy.


From  Fri Aug 23 08:56:43 2002
From: (Fredrik Lundh)
Date: Fri, 23 Aug 2002 09:56:43 +0200
Subject: [Python-Dev] Re: Automatic flex interface for Python?
References: <>
Message-ID: <007b01c24a7a$a081c580$0900a8c0@spiff>

greg wrote:

> > It's an implementation problem.  You find yourself doing a lot of=20
> > string accessing and pasting, creating several new objects per
> > input char.
> Not necessarily! Plex manages to do it without any
> of that.
> The trick is to leave all the characters in the input
> buffer and just *count* how many characters make up
> the next token.

you can do that without even looking at the characters?


From  Fri Aug 23 14:28:52 2002
From: (Guido van Rossum)
Date: Fri, 23 Aug 2002 09:28:52 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: Your message of "Thu, 22 Aug 2002 18:46:56 EDT."
References: <> <> <> <> <> <> <> <>
Message-ID: <>

> Well, consider.  If your friend were to send you base64 mail, it 
> probaby would *not* come from one of the spamhaus addresses in 
> bogofilter's wordlists.

Yeah, but not every spammer sends from a well-known spammer's address.

> The presence of base64 content is neutral.  That means that about the only
> way not decoding it could lead to a false positive is if the headers 
> contained spam-correlated tokens which decoding the body would have 
> countered with words having a higher non-spam loading.

Graham mentions the possibility that spammers can develop ways to make
their headers look neutral.  When I receive a base64-encoded HTML
message from Korea whose subject is "Hi", it could be from a Korean
Python hacker (there were 700 of those at a conference Christian
Tismer attended in Korea last year, so this is a realistic example),
or it could be Korean spam.  Decoding the base64 would make it
obvious.  The headers usually give some clues, but based on what makes
it through SpamAssassin (which we've been running for all
mail since February or so), base64 encoding scores high on the list of
false negatives.

--Guido van Rossum (home page:

From  Fri Aug 23 14:29:55 2002
From: (Guido van Rossum)
Date: Fri, 23 Aug 2002 09:29:55 -0400
Subject: [Python-Dev] q about default args
In-Reply-To: Your message of "Thu, 22 Aug 2002 18:27:37 EDT."
References: <>
Message-ID: <>

> Stepan Koltsov <>:
> > The question is: To be or^H^H^H^H^H^H^H^H^H Why not evaluate default
> > parameters of a function at THE function call, not at function def
> > (as is done currenly)? For example, C++ (a nice language, isn't it? ;-)
> > ) evaluates default parameters at function call.
> Among other things, because that choice (what old LISP hackers like me
> call `dynamic scoping') turns out to be far more difficult to model
> mentally than Python's lexical scoping.   Forty years of LISP experience
> says Python does the right thing.

Dynamic scoping has nothing to do with it.

Nevertheless, there's no chance in hell this will ever change, so
let's drop the subject.

--Guido van Rossum (home page:

From  Fri Aug 23 14:40:47 2002
From: (Eric S. Raymond)
Date: Fri, 23 Aug 2002 09:40:47 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

Guido van Rossum <>:
> The headers usually give some clues, but based on what makes
> it through SpamAssassin (which we've been running for all
> mail since February or so), base64 encoding scores high on the list of
> false negatives.

Noted.  I'll take account of that in my planning.
		<a href="">Eric S. Raymond</a>

From  Fri Aug 23 15:40:20 2002
From: (Ka-Ping Yee)
Date: Fri, 23 Aug 2002 09:40:20 -0500 (CDT)
Subject: [Python-Dev] More pydoc questions
In-Reply-To: <0d6101c24936$57d6c5f0$>
Message-ID: <>

On Wed, 21 Aug 2002, David Abrahams wrote:
> Now I get (well, I'm not sure how this will show up in your mailer, but for
> me it's full of control characters):
>     docstring_ext
> So my question is, is there a way to dump the text help for a module
> without prompting and without any extra control characters?

Hi -- sorry it took a couple of days to reply (i'm out of town).
The pydoc module contains a function for precisely this purpose --
just run the string through pydoc.plain().

    % pydoc pydoc.plain
    Python Library Documentation: function plain in pydoc

        Remove boldface formatting from text.

-- ?!ng

From  Fri Aug 23 15:39:09 2002
From: (Guido van Rossum)
Date: Fri, 23 Aug 2002 10:39:09 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib,1.26,1.27
In-Reply-To: Your message of "Thu, 22 Aug 2002 23:29:59 EDT."
References: <>
Message-ID: <>

> > Rewritten using the tokenize module, which gives us a real tokenizer
> > rather than a number of approximating regular expressions.
> > Alas, it is 3-4 times slower.  Let that be a challenge for the
> > tokenize module.
> Was this just for purity, or did it fix a bug?  The regexps there
> were close to being heroically careful, and even so it was somtimes
> uncomfortably slow using the class browser in IDLE (based on
> pyclbr), and even on a fast machine.  A factor of 3 or 4 might make
> that unbearable.
> If it was for purity, note that tokenize is also based on mounds of
> regexp tricks <wink>.

It was for purity, with an eye towards future improvements (I want to
teach it more about packages and import-aliasing).  While tokenize
uses regexp tricks, they are much closer to 100% correct than those in
pyclbr.  E.g. the pyclbr regexps don't cope with continuation
backslashes (which often occur in long import statements), or comments
or expressions inside the list of superclasses.  It also didn't cope
well with 'import M as N' which is showing up more and more
frequently.  I think there are still bugs in that area, but they will
be much simpler to fix now.

I was going to use this as an excuse to learn how to use the hotshot
profiler to find out if there are any bottlenecks in the tokenize

pyclbr.readmodule_ex('Tkinter') takes under 1.2 seconds on my home
machine now.  I find that acceptable (it's a lot quicker than IDLE
takes to colorize :-).

--Guido van Rossum (home page:

From  Fri Aug 23 17:33:13 2002
From: (Nathan Clegg)
Date: Fri, 23 Aug 2002 09:33:13 -0700
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

This discussion appears about over, but I haven't seen any solutions
via inheritance.  Other languages that lack true interfaces use
abstract base classes.  Python supports multiple inheritance.  Isn't
this enough?

If the basic types are turned into abstract base classes and inserted
into the builtin name space, and library and user-defined classes are
reparented to the appropriate base class, then isinstance becomes the
test for category inclusion.

A partial example:

class FileType():
      def __init__(*args, **kwargs):
	  raise AbstractClassError, \
		"You cannot instantiate abstract classes"

      def readline(*args, **kwargs):
	  raise NotImplementedError, \
		"Methods must be overridden by their children"

All "file-like" objects, beginning with file itself and StringIO, can
extend FileType (or AbstractFile or File or whatever).  A function
expecting a file-like object or a filename can test the parameter to
see if it is an instance of FileType rather than seeing if it has a
readline method.

Type hierarchies could be obvious (or endlessly debated):

Object --> Collection --> Sequence --> List
      \              \            \--> Tuple
      \              \            \--> String
      \              \--> Set
      \              \--> Mapping  --> Dict
      \--> FileLike   --> File
                     \--> StringIO
      \--> Number     --> Complex
                     \--> Real     --> Integer --> Long
                                  \--> Float
      \--> Iterator
      \--> Iterable


The hierarchy could be further complicated with mutability
(MutableSequence (e.g. list), ImmutableSequence (e.g. tuple, string)),
or perhaps mutability could be a property of classes or even objects
(allowing runtime marking of objects read-only? by contract? enforced?).

This seems to be a library (not language) solution to the problem
posed.  Can the low level types implemented completely in C still
descend from a python parent class without any performance hit? Can
someone please point out the inferiority or infeasibility of this
method?  Or is it just "ugly"?

Nathan Clegg

From  Fri Aug 23 17:53:02 2002
From: (Tim Peters)
Date: Fri, 23 Aug 2002 12:53:02 -0400
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <007b01c24a7a$a081c580$0900a8c0@spiff>
Message-ID: <>

[attribution lost]
>>> It's an implementation problem.  You find yourself doing a lot of
>>> string accessing and pasting, creating several new objects per
>>> input char.

[Greg Ewing]
>> Not necessarily! Plex manages to do it without any
>> of that.
>> The trick is to leave all the characters in the input
>> buffer and just *count* how many characters make up
>> the next token.

> you can do that without even looking at the characters?

1-character strings are shared; string[i] doesn't create a new object except
for the first time that character is seen.  string_item() in particular uses
the character found to index into an array of 1-character string objects.

From  Fri Aug 23 18:15:09 2002
From: (Jonathan Riehl)
Date: Fri, 23 Aug 2002 12:15:09 -0500 (CDT)
Subject: [Python-Dev] PEP 269 Implementation, rev.0
Message-ID: <Pine.BSF.4.33.0208231203560.44028-100000@localhost>

As per earlier discussions, I am going to take a whopping huge
intermission that will run August right out, and end my yearly yernings to
expose pgen to the Python public.  Therefore, I've submitted a provisional
patch for parser people to play with until next August (*smirk*).  Get it
while it's hot (ID 599331), and still in sync with CVS (not that there
are any radical changes):

Comments are requested.


From  Fri Aug 23 18:15:27 2002
From: (Guido van Rossum)
Date: Fri, 23 Aug 2002 13:15:27 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Fri, 23 Aug 2002 09:33:13 PDT."
References: <>
Message-ID: <>

> This discussion appears about over, but I haven't seen any solutions
> via inheritance.  Other languages that lack true interfaces use
> abstract base classes.  Python supports multiple inheritance.  Isn't
> this enough?

I haven't given up the hope that inheritance and interfaces could use
the same mechanisms.  But Jim Fulton, based on years of experience in
Zope, claims they really should be different.  I wish I understood why
he thinks so.

> If the basic types are turned into abstract base classes and inserted
> into the builtin name space, and library and user-defined classes are
> reparented to the appropriate base class, then isinstance becomes the
> test for category inclusion.
> A partial example:
> class FileType():
>       def __init__(*args, **kwargs):
> 	  raise AbstractClassError, \
> 		"You cannot instantiate abstract classes"
>       def readline(*args, **kwargs):
> 	  raise NotImplementedError, \
> 		"Methods must be overridden by their children"

Except that the readline signature should be shown here.

> All "file-like" objects, beginning with file itself and StringIO, can
> extend FileType (or AbstractFile or File or whatever).  A function
> expecting a file-like object or a filename can test the parameter to
> see if it is an instance of FileType rather than seeing if it has a
> readline method.
> Type hierarchies could be obvious (or endlessly debated):

Endlessly debated is more like it.  Do you need separate types for
readable files and writable files?  For seekable files?  For text
files?  Etc.

> Object --> Collection --> Sequence --> List
>       \              \            \--> Tuple
>       \              \            \--> String

Is a string really a collection?

>       \              \--> Set
>       \              \--> Mapping  --> Dict

How about readonly mappings?  Should every mapping support keys()?
values()?  items()?  iterkeys(), itervalues(), iteritems()?  

>       \--> FileLike   --> File
>                      \--> StringIO
>       \--> Number     --> Complex
>                      \--> Real     --> Integer --> Long

Where does short int go?

>                                   \--> Float
>       \--> Iterator
>       \--> Iterable
> etc.
> The hierarchy could be further complicated with mutability
> (MutableSequence (e.g. list), ImmutableSequence (e.g. tuple, string)),
> or perhaps mutability could be a property of classes or even objects
> (allowing runtime marking of objects read-only? by contract? enforced?).

Exactly.  Endless debate will be yours. :-)

> This seems to be a library (not language) solution to the problem
> posed.  Can the low level types implemented completely in C still
> descend from a python parent class without any performance hit?

Not easily, no, but it would be possible to put most of the abstract
hierarchy in C.

> Can someone please point out the inferiority or infeasibility of
> this method?  Or is it just "ugly"?

Agreeing on an ontology seems the hardest part to me.

--Guido van Rossum (home page:

From  Fri Aug 23 21:39:04 2002
From: (Jack Jansen)
Date: Fri, 23 Aug 2002 22:39:04 +0200
Subject: [Python-Dev] [development doc updates]
In-Reply-To: <>
Message-ID: <>

On vrijdag, augustus 23, 2002, at 07:24 , Fred L. Drake wrote:

> The development version of the documentation has been updated:
> Add documentation for the new "sets" module (thanks, Raymond!).
> Various minor additions and clarifications.

how much work would it be to make at least the html tarfile 
available too under

I'm looking at making the documentation friendly to the Mac help 
viewer (actually, Bill Fancher donated the code), and it would 
help the build process is there was a fixed URL based on the 
version number where I could always find the latest docs for the 
current version.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Fri Aug 23 21:54:46 2002
From: (Tim Peters)
Date: Fri, 23 Aug 2002 16:54:46 -0400
Subject: [Python-Dev] Questions about
Message-ID: <>

1. BaseSet contains 3 blobs like this:

    def __or__(self, other):
        """Return the union of two sets as a new set.

        (I.e. all elements that are in either set.)
        if not isinstance(other, BaseSet):
            return NotImplemented
        result = self.__class__(self._data)
        return result

    def union(self, other):
        """Return the union of two sets as a new set.

        (I.e. all elements that are in either set.)
        return self | other

Is there a good reason not to write the latter as just

    union = __or__


2. Is there a particular reason for not coding issuperset as

    def issuperset(self, other):
        """Report whether this set contains another set."""
        return other.issubset(self)

?  Given that issubset exists, issuperset is of marginal value anyway.

3. BaseSet._update is a darned cute example of exploiting that the iterator
returned by iter() isn't restartable!.  That isn't a question, it's just a
giggle <wink>.

4. Who believes that the __le__, __lt__, __ge__, and __gt__ methods are a
good idea?  If anything, I'd expect s <= t to mean "is subset", and "s < t"
to mean "is proper subset".  Getting the lexicographic ordering of the
underlying dicts instead doesn't seem to be of any use, except perhaps to
prevent sorting lists of sets from blowing up.  Fine by me if that blows up,

5. It's curious enough that we avoid dict.copy() in

    def copy(self):
        """Return a shallow copy of a set."""
        result = self.__class__([])
        return result

that if there's a reason to avoid it a comment would help.

6. It seems that doing


in various places instead of


wastes time without reason (it builds a unique empty list each time, the
__init__ function then does a useless iteration dance over that, and finally
the list object is torn apart again).  If the intent is to communicate that
we're creating an empty set, IMO the latter spelling is even a bit clearer
about that (I see "[]" and keep wondering what it's trying to accomplish).

7. union_update, intersection_update, symmetric_difference_update, and
difference_update return self despite mutating in-place.  That makes them
unique among mutating container methods (e.g., list.append, list.insert,
list.remove, dict.update, list.sort, ..., return None).  Is the
inconsistency worth it?  Chaining mutating set operations isn't common, and
with names like "symmetric_difference_update()" it's a challenge to fit more
than one on a line anyway <wink>.

If it's thought that chaining mutating operations is somehow a good idea for
sets when it's not for other containers, then we really have to be
consistent about it in the sets module.  For example, then Set.add() should
return self too; indeed, set.add(elt1).add(elt2) may even be pleasant at

Or if the point was merely to create "nice names" for __ior__ etc, then,
e.g., the existing union_update should be renamed to __ior__, and
union_update defined as

    def union_update(self, other):
        self |= other

and let it return None.  In a sense this is the opposite of question #1,
where the extra code block *is* supplied but without an apparent need.

8. If there's something still valuable in _test(), I think it ought to be
moved into  "Self-testing modules" can be convenient when
developing, but after modules are deployed in the std library the embedded
tests are never run again (with the exception of module doctests, which can
easily be run via a regrtest-flavor test_xyz test, and which are so

From Samuele Pedroni" <  Fri Aug 23 22:00:10 2002
From: Samuele Pedroni" < (Samuele Pedroni)
Date: Fri, 23 Aug 2002 23:00:10 +0200
Subject: [Python-Dev] type categories
Message-ID: <010801c24ae8$11d16980$6d94fea9@newmexico>

some thoughts of mine

> Agreeing on an ontology seems the hardest part to me.

Why does it seem such a daunting task?

i) much of python code depends concretely on interfaces with the granularity
from one to a bunch of methods, especially code expecting base-type-like
ii) people appreciate to be able to implement just the minimal subset of
methods that makes things work

[Obviously here I'm not talking about large frameworks like Zope]

We are not in vacuum. There is Python code out there, and progammers with ideas
about what is like programming in Python.

[Maybe I just restate the obvious and repeat myself but it seems that for some
people not only type checking but in general explicitness about types is a tabu
for Python code.
OTOH there _exists_ Python code that  as input depends/requires subclasses of
some _specific_ abstract classes. And even Smalltalk has - when "reasonable"
and "necessary" - a kind of interface notion, in the form of isFoo method
defined on Object and overridden to return "yes sir" down the hierarchy]

It seems to me that there is no route to the advantages of type categories
without some explicitness.

Now to the point.
[Here my target is more dispatching and "distinguishing" by type categories
than type checking, PEP 246 issue is far from orthogonal but here I will ignore
it because I don't want my head to explode <wink>]

Maybe it is obvious but anyway type categories' "problem" should be framed wrt:
- what kind of coding style we want to enable (or maybe respectively
- what problems are we solving?
- what kind of code will not work anymore?
- what "migration path" for such code, or to the new style?

Let's consider an exemplar fragment:

if hasattr(f,'write'):
 ... # needs just f.write
else: # f is not file-like

possible future styles:


if doesimplement(f,FileLike):

problems: the "exponential" ontology problem, or the problem, if we limit
ourselves only to large granularity interfaces and we interpret them
strictly[*], that the programmer must implement more methods than strictly

[*] the point is whether an interface should be interpreted as "all the
signatures implemented" . Here is not relevant this is checked or enforced at
class definition time.


if doesimplement(f,Category('write')):

if we allow for such on-the-fly constructed interfaces (I hope one can
extrapolate what I mean with that) we maybe solve the "exponential" ontology
problem, but this code is not really an improvement over the code using
hasattr; what we are interested in is not whether f has some kind of write
method but whether f has a file-like write method. Through interfaces one wants
to check and convey commitment more than at the signature level.

Can we do better?

It would be nice to find some kind of middle-ground between hasattr(.,'write')
and large granularity strict interfaces (LGSI).

My humble ideas: these are just two points picked in that entire range
(hasattr - LSGI), other variations are maybe useful, necessary, or reasonable.

a) As a "workaround" it should be possible to declare that a class implements
an interface partially, that means some subset of it.
Then it is an open issue whether there should be ways to check both for strict
and non-strict implementation, or some general control to tweak all checkings
and/or enable warnings. [Do Zope interfaces already allow for this?]

b) OK with a) but if I want potentially to be strict and still deal with:

"b) much of python code concretely depends on interfaces with the granularity
from one to a bunch of methods"

we should let people be precise about what subset of an interface they are
implementing, like

- I'm implemeting a subset of FileLike so consider the corresponding matching
- or, with finer control, I'm implementing FileLike 'write' and - no  - my
'tell' has nothing to do with file-like

and more importantly it should be possible to check for such subsets:

if doesimplement(f,PartCategory(FileLike,['write'])):
   # Yup, I know, this is ugly and begs for sugar

[And mildly interestingly such code can degrade to just check for
hasattr(.,'write') for "migration" and possibly emit warnings]

regards, Samuele Pedroni.

PS: I know, here I have not dealt with implementation or performance problems
and the killing details.

From  Fri Aug 23 22:05:27 2002
From: (Guido van Rossum)
Date: Fri, 23 Aug 2002 17:05:27 -0400
Subject: [Python-Dev] utf8 issue
Message-ID: <>

This might beling on SF, except it's already been solved in Python
2.3, and I need guidance about what to do for Python 2.2.2.

In 2.2.1, a lone surrogate encoded into utf8 gives an utf8 string that
cannot be decode back.  In 2.3, this is fixed.  Should this be fixed
in 2.2.2 as well?

I'm asking because it caused problems with reading .pyc files: if
there's a Unicode literal containing a lone surrogate, reading the
.pyc file causes an exception:

UnicodeError: UTF-8 decoding error: unexpected code byte

It looks like revision 2.128 fixed this for 2.3, but that patch
doesn't cleanly apply to the 2.2 maintenance branch.  Can someone

--Guido van Rossum (home page:

From  Fri Aug 23 22:21:16 2002
From: (Guido van Rossum)
Date: Fri, 23 Aug 2002 17:21:16 -0400
Subject: [Python-Dev] Questions about
In-Reply-To: Your message of "Fri, 23 Aug 2002 16:54:46 EDT."
References: <>
Message-ID: <>

> 1. BaseSet contains 3 blobs like this:
>     def __or__(self, other):
>         """Return the union of two sets as a new set.
>         (I.e. all elements that are in either set.)
>         """
>         if not isinstance(other, BaseSet):
>             return NotImplemented
>         result = self.__class__(self._data)
>         result._data.update(other._data)
>         return result
>     def union(self, other):
>         """Return the union of two sets as a new set.
>         (I.e. all elements that are in either set.)
>         """
>         return self | other
> Is there a good reason not to write the latter as just
>     union = __or__
> ?

Yes.  It was written like that before!  But in order to be a good
citizen in the world of binary operators, __or__ should not raise
TypeError; it should return NotImplemented if the other argument is
not a set (since the other argument might implement __ror__ and know
how to or itself with a set).

But union(), which is a normal function, should not return
NotImplemented, which would confuse the user.  So they have to be
different.  I thought it would be best for union() to use the |
operator so that if the other argument implements __ror__, union()
will acquire this ability.

> 2. Is there a particular reason for not coding issuperset as
>     def issuperset(self, other):
>         """Report whether this set contains another set."""
>         self._binary_sanity_check(other)
>         return other.issubset(self)
> ?  Given that issubset exists, issuperset is of marginal value anyway.

The original code didn't have issuperset(), I added it for symmetry.
Spelling it out saves two calls: one to _binary_sanity_check(), one to

> 3. BaseSet._update is a darned cute example of exploiting that the iterator
> returned by iter() isn't restartable!.  That isn't a question, it's
> just a giggle <wink>.

Yes, I like it. :-)

> 4. Who believes that the __le__, __lt__, __ge__, and __gt__ methods are a
> good idea?  If anything, I'd expect s <= t to mean "is subset", and
> "s < t" to mean "is proper subset".  Getting the lexicographic
> ordering of the underlying dicts instead doesn't seem to be of any
> use, except perhaps to prevent sorting lists of sets from blowing
> up.  Fine by me if that blows up, though.

Greg Wilson added these when he made the class inherit from dict,
presumably because without any further measures, sets would be
comparable to dicts using the default dict comparison.

That design choice was later undone by Alex (at my suggestion), but he
fixed the comparisons rather than removing them.

I think using <=, < etc. to spell issubset and isstrictsubset would be

> 5. It's curious enough that we avoid dict.copy() in
>     def copy(self):
>         """Return a shallow copy of a set."""
>         result = self.__class__([])
>         result._data.update(self._data)
>         return result
> that if there's a reason to avoid it a comment would help.

Raymond asked me whether to use copy() or update().  I looked at the
code and found that they execute almost the same code, with about one
instruction more per item for update().  But for small sets, copy()
allocates a new dict (and the old one is thrown away).  I thought that
that might be more important than saving an instruction per item.

> 6. It seems that doing
>     self.__class__([])
> in various places instead of
>     self.__class__()
> wastes time without reason (it builds a unique empty list each time,
> the __init__ function then does a useless iteration dance over that,
> and finally the list object is torn apart again).  If the intent is
> to communicate that we're creating an empty set, IMO the latter
> spelling is even a bit clearer about that (I see "[]" and keep
> wondering what it's trying to accomplish).

That was when ImmutableSet() required an argument.  It can be left out

> 7. union_update, intersection_update, symmetric_difference_update,
> and difference_update return self despite mutating in-place.  That
> makes them unique among mutating container methods (e.g.,
> list.append, list.insert, list.remove, dict.update, list.sort, ...,
> return None).  Is the inconsistency worth it?  Chaining mutating set
> operations isn't common, and with names like
> "symmetric_difference_update()" it's a challenge to fit more than
> one on a line anyway <wink>.
> If it's thought that chaining mutating operations is somehow a good
> idea for sets when it's not for other containers, then we really
> have to be consistent about it in the sets module.  For example,
> then Set.add() should return self too; indeed,
> set.add(elt1).add(elt2) may even be pleasant at times.
> Or if the point was merely to create "nice names" for __ior__ etc,
> then, e.g., the existing union_update should be renamed to __ior__,
> and union_update defined as
>     def union_update(self, other):
>         """yadda"""
>         self |= other
> and let it return None.  In a sense this is the opposite of question #1,
> where the extra code block *is* supplied but without an apparent need.

You guessed right.  That's the best solution IMO.

> 8. If there's something still valuable in _test(), I think it ought
> to be moved into  "Self-testing modules" can be
> convenient when developing, but after modules are deployed in the
> std library the embedded tests are never run again (with the
> exception of module doctests, which can easily be run via a
> regrtest-flavor test_xyz test, and which are so invoked).

Please toss it.

--Guido van Rossum (home page:

From  Fri Aug 23 22:31:47 2002
From: (Tim Peters)
Date: Fri, 23 Aug 2002 17:31:47 -0400
Subject: [Python-Dev] Questions about
In-Reply-To: <>
Message-ID: <>

[Guido, answers Tim's set questions]

Thanks!  It was enlightening, I believe I understood it all without the urge
to fignt back <wink>, and I'll make changes accordingly (maybe not today,
but before Monday).

I forgot to say this the first time:  it's a very nice module!  Kudos to
Greg, Alex, you and Raymond.

From  Fri Aug 23 23:31:23 2002
From: (Fred L. Drake, Jr.)
Date: Fri, 23 Aug 2002 18:31:23 -0400
Subject: [Python-Dev] [development doc updates]
In-Reply-To: <>
References: <>
Message-ID: <>

Jack Jansen writes:
 > how much work would it be to make at least the html tarfile 
 > available too under

Are you looking for the tarfile or for an online documentation set?

 > I'm looking at making the documentation friendly to the Mac help 
 > viewer (actually, Bill Fancher donated the code), and it would 
 > help the build process is there was a fixed URL based on the 
 > version number where I could always find the latest docs for the 
 > current version.

Is there online documentation for the Mac OS help viewer?  I don't
know anything about it.


Fred L. Drake, Jr.  <fdrake at>
PythonLabs at Zope Corporation

From  Sat Aug 24 03:52:53 2002
From: (Jeremy Hylton)
Date: Fri, 23 Aug 2002 22:52:53 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

>>>>> "GvR" == Guido van Rossum <> writes:

  >> This discussion appears about over, but I haven't seen any
  >> solutions via inheritance.  Other languages that lack true
  >> interfaces use abstract base classes.  Python supports multiple
  >> inheritance.  Isn't this enough?

  GvR> I haven't given up the hope that inheritance and interfaces
  GvR> could use the same mechanisms.  But Jim Fulton, based on years
  GvR> of experience in Zope, claims they really should be different.
  GvR> I wish I understood why he thinks so.

Here's a go at explaining why the mechanisms need to be separate.  I'm
loathe to channel Jim, but I think he'd agree.

We'd like to use interfaces to make fairly strong claims.  If a class
A implements an interface I, then we should be able to use an instance
of A anywhere that an I is needed.  This is just the straightforward
notion of substitutability.  I'm saying this is a strong claim because 
we want an A to behave like an I.  By behave, I mean that the
interface I can describe behavior beyond just a method name or

Why can't we use the current inheritance mechanism to implement the
interface concept?  Because the inheritance mechanism is too general.
If we take the class A, anyone can create a subclass of it, regardless
of whether that subclass implements I.

Say you wanted to write LBYL code that tests whether an object
implements an interface.  If you use a marker class and isinstance()
for the test, the inheritance rules makes it impossible to express
some relationships.  In particular, it is impossible to write a class
B that inherits from A, but does not implement I.  Since our test is
isinstance(), any subclass of A will appear to implement I.  This is
unfortunate, because inheritance is a great implementation trick that
shouldn't have anything to do with the interface.

If we think about it briefly in terms of types.  (Python doesn't have
the explicit types, but sometimes we reason about programs as if they
did.)  Strongly typed OO languages have to deal in some way with
subclasses that are not subtypes.  Some type systems require
covariance or contravariance or invariance.  In some cases, you can
write a class that is a subclass but is not a subtype.  The latter is
what we're hoping to achieve with interfaces.

If we imagined an interface statement that was explicit and no
inherited, then we'd be okay.

class A(SomeBase):
    implements I

class B(A):
    implements J

Now we've got a class A that implements I and a subclass B that
implements J.  The test isinstance(B(), A) is true, but the test
implements(B(), I) is not.  It's quite helpful to have the
implements() predicate which uses a rule different from isinstance().

If we don't have the separate interface concept, the language just
isn't as expressive.  We would have to establish a convention to
sacrifice one of -- a) being able to inherit from a class just for
implementation purposes or b) being able to reason about interfaces
using isinstance().  a) is error prone, because the language wouldn't
prevent anyone from making the mistake.  b) is unfortunate, because
we'd have interfaces but no formal way to reason about them.


From  Sat Aug 24 05:44:16 2002
From: (Eric S. Raymond)
Date: Sat, 24 Aug 2002 00:44:16 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <>
Message-ID: <>

Tim Peters <>:
>                P(S|X)*P(S|Y)/P(S)
> ---------------------------------------------------
> P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S)
> This isn't what Graham computes, though:  the P(S) and P(not-S) terms are
> missing in his formulation.  Given that P(not-S) = 1-P(S), and
> P(not-S|whatever) = 1-P(S|whatever), what he actually computes is
>            P(S|X)*P(S|Y)
> -------------------------------------
> P(S|X)*P(S|Y) + P(not-S|X)*P(not-S|Y)

> This is the same as the Bayesian result only if P(S) = 0.5 (in which case
> all the instances of P(S) and P(not-S) cancel out).  Else it's a distortion
> of the naive Bayesian result.

OK.  So, maybe I'm just being stupid, but this seems easy to solve.
We already *have* estimates of P(S) and P(not-S) -- we have a message
count associated with both wordlists.  So why not use the running
ratios between 'em?

As long as we initialize with "good" and "bad" corpora that are approximately
the same size, the should work no worse than the equiprobability assumption.  
The ratios will correct in time based on incoming traffic.

Oh, and do you mind if I use your algebra as part of bogofilter's
		<a href="">Eric S. Raymond</a>

From  Sat Aug 24 06:26:12 2002
From: (Tim Peters)
Date: Sat, 24 Aug 2002 01:26:12 -0400
Subject: [Python-Dev] Re: [Python-checkins]
In-Reply-To: <>
Message-ID: <>

>                P(S|X)*P(S|Y)/P(S)
> ---------------------------------------------------
> P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S)

[Eric S. Raymond]
> ...
> OK.  So, maybe I'm just being stupid, but this seems easy to solve.
> We already *have* estimates of P(S) and P(not-S) -- we have a message
> count associated with both wordlists.  So why not use the running
> ratios between 'em?

a. There are other fudges in the code that may rely on this fudge
   to cancel out, intentionally or unintentionally.  I'm loathe to
   type more about this instead of working on the code, because I've
   already typed about it.  See a later msg for a concrete example of
   how the factor-of-2 "good count" bias acts in part to counter the
   distortion here.  Take one away, and the other(s) may well become
   "a problem".

b. Unless the proportion of spam to not-spam in the training sets
   is a good approximation to the real-life ratio of spam to not-
   spam, it's also dubious to train the system with bogus P(S) and
   P(not-S) values.

c. I'll get back to this when our testing infrastructure is trustworthy.
   At the moment I'm hosed because the spam corpus I pulled off the
   web turns out to be trivial to recognize in contrast to Barry's
   corpus of good msgs from mailing lists:  every msg in the
   spam corpus has stuff about the fellow who collected the spam in the
   headers, while nothing in the corpus does; contrarily,
   every msg in the corpus has header info not in
   the spam corpus headers.  This is an easy way to get 100% precision
   and 100% recall, but not particularly realistic -- the rules it's
   learning are of the form "it's spam if and only if it's addressed to
   bruceg"; "t's not spam if and only if the headers contain
   'List-Unsubscribe'"; etc.  The learning can't be faulted, but the
   teacher can <wink>.

d. I only exposed the math for the two-word case above, and the
   generalization to n words may not be clear from the final result
   (although it's clear enough if you back off a few steps).  If there
   are n words, w[0] thru w[n-1]:

   prod1 <- product for i in range(n) of P(S|w[i])/P(S)
   prod2 <- product for i in range (n) of (1-P(S|w[i])/(1-P(S))
   result <- prod1*P(S) / (prod1*P(S) + prod2*(1-P(S)))

   That's if you're better set up to experiment now.  If you do this,
   the most interesting thing to see is whether results get better or
   worse if you *also* get rid of the artificial "good count" boost by
   the factor of 2.

> As long as we initialize with "good" and "bad" corpora that are
> approximately the same size, the should work no worse than the
> equiprobability assumption.

"not spam" is already being given an artificial boost in a couple of ways.
Given that in real life most people still get more not-spam than spam,
removing the counter-bias in the scoring math may boost the false negative

> The ratios will correct in time based on incoming traffic.

Depends on how training is done.

> Oh, and do you mind if I use your algebra as part of bogofilter's
> documentation?

Not at all, although if you wait until we get our version of this ironed
out, you'll almost certainly be able to pull an anally-proofread version out
of a plain-text doc file I'll feel compelled to write <wink>.

From  Sat Aug 24 07:44:27 2002
From: (Guido van Rossum)
Date: Sat, 24 Aug 2002 02:44:27 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Fri, 23 Aug 2002 22:52:53 EDT."
References: <> <> <>
Message-ID: <>

> If we don't have the separate interface concept, the language just
> isn't as expressive.  We would have to establish a convention to
> sacrifice one of -- a) being able to inherit from a class just for
> implementation purposes or b) being able to reason about interfaces
> using isinstance().  a) is error prone, because the language
> wouldn't prevent anyone from making the mistake.  b) is unfortunate,
> because we'd have interfaces but no formal way to reason about them.

So the point is that it's possible to have a class D that picks up
interface I somewhere in its inheritance chain, by inheriting from a
class C that implements I, where D doesn't actually satisfy the
invariants of I (or of C, probably).

I can see that that is a useful feature.  But it shouldn't have to
preclude us from using inheritance for interfaces, if there was a way
to "shut off" inheritance as far as isinstance() (or issubclass())
testing is concerned.  C++ does this using private inheritance.  Maybe
we can add a similar convention to Python for denying inheritance from
a given class or interface.

Why do keep arguing for inheritance?  (a) the need to deny inheritance
from an interface, while essential, is relatively rare IMO, and in
*most* cases the inheritance rules work just fine; (b) having two
separate but similar mechanisms makes the language larger.

For example, if we ever are going to add argument type declarations to
Python, it will probably look like this:

    def foo(a: classA, b: classB):

It would be convenient if this could be *defined* as

    assert isinstance(a, classA) and isinstance(b, classB)

so that programs that have a simple class hierarchy can use their
classes directly as argument types, without having to go through the
trouble of declaring a parallel set of interfaces.

I also think that it should be possible to come up with a set of
standard "abstract" classes representing concepts like number,
sequence, etc., in which the standard built-in types are nicely

--Guido van Rossum (home page:

From  Sat Aug 24 10:03:51 2002
From: (Eric S. Raymond)
Date: Sat, 24 Aug 2002 05:03:51 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <>
Message-ID: <>

Tim Peters <>:
> a. There are other fudges in the code that may rely on this fudge
>    to cancel out, intentionally or unintentionally.  I'm loathe to
>    type more about this instead of working on the code, because I've
>    already typed about it.  See a later msg for a concrete example of
>    how the factor-of-2 "good count" bias acts in part to counter the
>    distortion here.  Take one away, and the other(s) may well become
>    "a problem".

I was thinking of shooting that "goodness bias" through the head and seeing
what happens, actually. I've been unhappy with that fudge in Paul's original
formula from the beginning.
> b. Unless the proportion of spam to not-spam in the training sets
>    is a good approximation to the real-life ratio of spam to not-
>    spam, it's also dubious to train the system with bogus P(S) and
>    P(not-S) values.

Right -- which is why I want to experiment with actually *using* the
real life running ratio.

> c. I'll get back to this when our testing infrastructure is trustworthy.
>    At the moment I'm hosed because the spam corpus I pulled off the
>    web turns out to be trivial to recognize in contrast to Barry's
>    corpus of good msgs from mailing lists: 

Ouch.  That's a trap I'll have to watch out for in handling other
peoples' corpora.
		<a href="">Eric S. Raymond</a>

From  Sat Aug 24 10:31:45 2002
From: (Alex Martelli)
Date: Sat, 24 Aug 2002 11:31:45 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

On Saturday 24 August 2002 08:44 am, Guido van Rossum wrote:
> For example, if we ever are going to add argument type declarations to
> Python, it will probably look like this:
>     def foo(a: classA, b: classB):
>         ...body...
> It would be convenient if this could be *defined* as
>     assert isinstance(a, classA) and isinstance(b, classB)

I was hoping this could be defined as:

    a = adapt(a, classA)
    b = adapt(b, classB)

but I fully agree that, if we have an elegant way to say "I'm inheriting JUST
for implementation, don't let that be generally known" (C++'s private
inheritance is not a perfect mechanism, because in C++ 'private' affects
_accessibility_, not _visibility_, sigh), it can indeed be handier and more
productive to have interfaces and classes merge into each other rather
than be completely separate.  Substantial experience with C++ (merged)
and Java (separate) suggests that to me.  From the point of view of
the hypothetical 'adapt', that suggests the 'protocol' argument should
be allowed to be a type (class), rather than an 'interface' that is a
different entity than a type, and also the useful implication:

    isinstance(i, T) ==> adapt(i, T) is i

> so that programs that have a simple class hierarchy can use their
> classes directly as argument types, without having to go through the
> trouble of declaring a parallel set of interfaces.

The "parallel set of interfaces" (which I had to do in Java) *was*
indeed somewhat of a bother.  Any time you need to develop and
maintain two separate but strongly parallel trees (here, one of
interfaces, and a separate parallel one of typical/suggested partial
or total implementations to be used e.g. in inner classes that
supply those interfaces), you're in for a spot of trouble.  I even did
some of that with a hand-kludged "code generator" which read a
single description file and generated both the interface AND the class
from it (but then of course I ended up editing the generated code
and back in trouble again when maintenance was needed -- seems
to happen regularly to me with code generators).  Surely making the target 
language directly able to digest a unified description would be nicer.

> I also think that it should be possible to come up with a set of
> standard "abstract" classes representing concepts like number,
> sequence, etc., in which the standard built-in types are nicely
> embedded.

If you manage to pull that off, it will be a WONDERFUL trick indeed.


From David Abrahams" <  Sat Aug 24 12:33:41 2002
From: David Abrahams" < (David Abrahams)
Date: Sat, 24 Aug 2002 07:33:41 -0400
Subject: [Python-Dev] type categories
References: <> <> <>              <>  <>
Message-ID: <007501c24b62$1ae61640$>

From: "Guido van Rossum" <>

> > If we don't have the separate interface concept, the language just
> > isn't as expressive.  We would have to establish a convention to
> > sacrifice one of -- a) being able to inherit from a class just for
> > implementation purposes or b) being able to reason about interfaces
> > using isinstance().  a) is error prone, because the language
> > wouldn't prevent anyone from making the mistake.  b) is unfortunate,
> > because we'd have interfaces but no formal way to reason about them.
> So the point is that it's possible to have a class D that picks up
> interface I somewhere in its inheritance chain, by inheriting from a
> class C that implements I, where D doesn't actually satisfy the
> invariants of I (or of C, probably).


>     def foo(a: classA, b: classB):
>         ...body...
> It would be convenient if this could be *defined* as
>     assert isinstance(a, classA) and isinstance(b, classB)
> so that programs that have a simple class hierarchy can use their
> classes directly as argument types, without having to go through the
> trouble of declaring a parallel set of interfaces.
> I also think that it should be possible to come up with a set of
> standard "abstract" classes representing concepts like number,
> sequence, etc., in which the standard built-in types are nicely
> embedded.

Ah, but not all numbers are created equal! Can I write:

    x << 1

? Not if x is a float. Somebody will eventually want to categorize numeric
types more-finely, e.g. Monoid, Euclidean Ring, ...

It sounds to my C++ ear like you're trying to make this analogous to
runtime polymorphism in C++. I think Python's polymorphism is a lot closer
to what we do at compile-time in C++, and it should stay that way: no
inheritance relationship needed... at least, not on the surface. Here's
why: people inevitably discover new type categories in the objects and
types they're already using. In C++ this happened when Stepanov et al
discovered that built-in pointers matched his mental model of random-access
iterators. A similar thing will happen in Python when you make all numbers
inherit from Number but someone wants to impose the real mathematical
categories (or heck: Integer vs. Fractional) on them.

What Stepanov's crew did did was to invent iterator traits, which decouple
the type's category from the type itself. Each category is represented by a
class, and those classes do have an inherticance relationship (i.e. every
random_access_iterator IS-A bidirectional_iterator). Actually, I have no
problem with collecting type category info from an object's MRO: as Guido
implies, that will often be the simplest way to do it. However, I think
there ought to be a parallel mechanism which allows additional
categorization non-intrusively, and it was my understanding that the PEP
Alex has been promoting does that.


           David Abrahams * Boost Consulting *

From Samuele Pedroni" <  Sat Aug 24 14:00:57 2002
From: Samuele Pedroni" < (Samuele Pedroni)
Date: Sat, 24 Aug 2002 15:00:57 +0200
Subject: [Python-Dev] it seems we need types-sig back:)
Message-ID: <007c01c24b6e$49bffd80$6d94fea9@newmexico>


From  Sat Aug 24 14:14:22 2002
From: (David Abrahams)
Date: Sat, 24 Aug 2002 09:14:22 -0400
Subject: [Python-Dev] convertibility and "Pythonicity"
Message-ID: <00ba01c24b70$304035d0$>

[Is "Pythonicity" the right word?]

I'm interested in getting some qualitative feedback about something I'm
doing in Boost.Python. The questions are,
    1. How well does this behavior match up with what Python users have
probably come to expect?
    2. (related, I hope!) How close is it to the intended design of Python?

When wrapping a C++ function that expects a float argument, I thought it
would be bizarre if people couldn't pass a Python int. Well, Python ints
have a lovely __float__ function which can be used to convert them to
floats. Following that idea to its "logical" conclusion led me to where I
am today: when matching a formal argument corresponding to one of the
built-in Python types, first use the corresponding conversion slot.

That could lead to some surprising behaviors:

    char index(const char* s, int n); // wrapped using Boost.Python

    >>> index('foobar', 2)    # ok

    >>> index(3.14, 1.2)      # Wierd (floats have __str__)
    >>> index([1, 3, 5], 0.0) # Super wierd (everything has __str__)

So I went back and tried some "obvious" test in Python 2.2.1:

    >>> 'foobar'[3.0]
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    TypeError: sequence index must be integer

Well, I had expected this to work, so I'm beginning to re-think my "liberal
conversion" policy. It seems like Python itself isn't using these slots to
do "implicit conversion". But then:

    >>> 'foobar'[3L]

[The int/long unification I've heard about hasn't happened yet, has it?]


    >>> range(3.3, 10.3)
    [3, 4, 5, 6, 7, 8, 9]


    >>> range('1', '5')
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    TypeError: an integer is required

Now I note that strings don't have __int__, so I guess the int type handles
int('42') itself using special knowledge about strings. I suppose that's to
keep strings from seeming to be numbers, since the nb_int slot fills in the


    >>> class zero(object):
    ...     def __int__(self): return 0
    >>> range(zero(), 5)
    [0, 1, 2, 3, 4]

So, is there any general practice, (even if it's not universal)? Do Python
functions usually tend to coerce their arguments into the types they're
expecting? I'm guessing the answer is no...

           David Abrahams * Boost Consulting *

From Samuele Pedroni" <  Sat Aug 24 15:00:28 2002
From: Samuele Pedroni" < (Samuele Pedroni)
Date: Sat, 24 Aug 2002 16:00:28 +0200
Subject: [Python-Dev] convertibility
Message-ID: <00b901c24b76$9aa4a220$6d94fea9@newmexico>

FYI , in _Jython_ given the java class

public class A {

public void fi(int i) {}

public void fd(double d) {}

public void fs(String s) {}


import A


one can call with a Python integer or long

a.fd with a Python integer or long or float

a.fs only with a Python string

[yes with type categories or adapt we could do better,
but the design prefers to minimize unexpected behaviour,
and in practice is not too much constraining]


From  Sat Aug 24 15:30:26 2002
From: (Magnus Lie Hetland)
Date: Sat, 24 Aug 2002 16:30:26 +0200
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>; from on Wed, Aug 21, 2002 at 05:05:27AM -0400
References: <> <> <> <1029917815.581.3.camel@winterfell> <> <1029918967.582.13.camel@winterfell> <>
Message-ID: <>

Eric S. Raymond <>:
> Makes sense.  Hardware designers care a lot about reduction to disjunctive
> normal form. Much more than logicians do, actually.
> > Hmm, I just realized that I've also seen it in an American book on
> > discrete maths, so it's not just us Swedes ;)
> Odd that I haven't encountered it.

Indeed. I thought this was quite standard when working with digital
circuits etc...

And -- I don't quite see why we're talking about Boolean algebra in
general here, when we're specifically looking for set operators... Oh,

Magnus Lie Hetland                                  The Anygui Project                        

From  Sat Aug 24 15:33:08 2002
From: (Magnus Lie Hetland)
Date: Sat, 24 Aug 2002 16:33:08 +0200
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>; from on Tue, Aug 20, 2002 at 08:56:05PM -0400
References: <Pine.SOL.4.44.0208191321290.27171-100000@death.OCF.Berkeley.EDU> <> <> <> <> <> <>
Message-ID: <>

Eric S. Raymond <>:
> Guido van Rossum <>:
> > Um, the notation is '|' and '&', not 'or' and 'and', and those are
> > what I learned in school.  Seems pretty conventional to me (Greg
> > Wilson actually tried this out on unsuspecting newbies and found that
> > while '+' worked okay, '*' did not -- read the PEP).
> +1 on preferring | and & to `or' and `and'.  To me, `or' and `and' say
> that what's being composed are predicates, not sets.

I concur completely. Using 'or' and 'and' seems close to overriding
'is' (although that's impossible, of course) to me. To me, the

  set1 and set2

should return the first set, if empty, or the second set, if the first
one is empty. Suddenly having their intersection would be very
surprising, I think. For

  set1 & set2

to return their intersection, however, is very consistent with

  int1 & int2

Magnus Lie Hetland                                  The Anygui Project                        

From  Sat Aug 24 15:38:48 2002
From: (Magnus Lie Hetland)
Date: Sat, 24 Aug 2002 16:38:48 +0200
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>; from on Thu, Aug 22, 2002 at 11:12:57AM -0400
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

Guido van Rossum <>:
> Have you got a use case for membership tests of a cartesian product?

Not that I can think of at the moment, no :-)

I guess the idea was to use lazy sets for some such operations. Then
you could build complex expressions through cartesian products,
unions, intersections, set differences, set comprehensions etc.
without actually constructing the full set. Checking for membership or
iterating over (or even constructing, after all the operations have
been applied) such a set might be useful, I'm sure... You could
implement joins with cartesian products without terrible performance
penalties etc...

But I guess this sort of thing might as well go into some other module
somewhere (probably outside the libs). It was just a thought.

Magnus Lie Hetland                                  The Anygui Project                        

From  Sat Aug 24 15:44:36 2002
From: (Oren Tirosh)
Date: Sat, 24 Aug 2002 10:44:36 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <> <> <>
Message-ID: <>

On Sat, Aug 24, 2002 at 02:44:27AM -0400, Guido van Rossum wrote:
> Why do keep arguing for inheritance?  (a) the need to deny inheritance
> from an interface, while essential, is relatively rare IMO, and in
> *most* cases the inheritance rules work just fine; (b) having two
> separate but similar mechanisms makes the language larger.

Inheriting the implementation without implementing the same interfaces
is only one reason to want an interface mechanism that is not 100% 
tied to inheritance.  Objects written by different authors on two sides 
of the globe often implement the same protocol without actually inheriting 
the definition from a common module. I can pass these objects to a method 
expecting this protocol and it will work just fine (most of the time...)

I would like to be able to declare that I need an object with a specific 
interface even if the object was written long before and I don't want to 
modify an existing library just to make it conform to my interface names.

Strictly defined named interfaces like Zope interfaces are also important
but most of the interfaces I use in everyday programming are more ad-hoc
in nature and are often defined retroactively.

> For example, if we ever are going to add argument type declarations to
> Python, it will probably look like this:
>     def foo(a: classA, b: classB):
>         ...body...
> It would be convenient if this could be *defined* as
>     assert isinstance(a, classA) and isinstance(b, classB)

In your Optional Static Typing presentation slides you define "type
expressions".  If the isinstance method accepted a type expression 
object as its second argument this assertion would work for interfaces 
that are not defined by strict hierarchical inheritance.

> so that programs that have a simple class hierarchy can use their
> classes directly as argument types, without having to go through the
> trouble of declaring a parallel set of interfaces.

...and classes could be used too. They are just type expressions that
match a single class.

BTW, isinstance already supports a simple form of this: a tuple is
interpreted as an "OR" type expression. You can say that isinstance 
returns True if the object is an instance of one of the types matched 
by the type expression.


From  Sat Aug 24 16:00:52 2002
From: (Magnus Lie Hetland)
Date: Sat, 24 Aug 2002 17:00:52 +0200
Subject: [Python-Dev] Set naming
Message-ID: <>

By naming the new set module sets and the class Set the parallel to
array module is broken. I guess that's not a problem -- I just thought
I'd mention it. Naming the module "set" would be more analogous to
"array", and having "set" as an alias for "Set" would let people
switch to a possible future type with the same name by commenting out
their import statements...

But then again, I guess my little mind is infested with hobgoblins ;)

Magnus Lie Hetland                                  The Anygui Project                        

From  Sat Aug 24 16:14:12 2002
From: (Eric S. Raymond)
Date: Sat, 24 Aug 2002 11:14:12 -0400
Subject: [Python-Dev] Set naming
In-Reply-To: <>
References: <>
Message-ID: <>

Magnus Lie Hetland <>:
> By naming the new set module sets and the class Set the parallel to
> array module is broken. I guess that's not a problem -- I just thought
> I'd mention it. Naming the module "set" would be more analogous to
> "array", and having "set" as an alias for "Set" would let people
> switch to a possible future type with the same name by commenting out
> their import statements...

Hmmm...I think I agree with this objection, and I have another.  It's
not consistently so, but usually the classes that are simpler and
closer to the system core aren't capitalized.  The name "Set" has
a misleading hint in it.
		<a href="">Eric S. Raymond</a>

From  Sat Aug 24 16:15:56 2002
From: (Jeremy Hylton)
Date: Sat, 24 Aug 2002 11:15:56 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

Good point, Oren.  We now have two requirements for interfaces that
are different than the standard inheritance mechanism.  It should be
possible to:     

  - inherit from a class without implementing that class's interfaces

  - declare that a class implements an interface outside the class

It's harder to support the second requirement using the current
inheritance mechanism.


From  Sat Aug 24 16:32:09 2002
From: (Alex Martelli)
Date: Sat, 24 Aug 2002 17:32:09 +0200
Subject: [Python-Dev] convertibility
In-Reply-To: <00b901c24b76$9aa4a220$6d94fea9@newmexico>
References: <00b901c24b76$9aa4a220$6d94fea9@newmexico>
Message-ID: <>

On Saturday 24 August 2002 04:00 pm, Samuele Pedroni wrote:
> [yes with type categories or adapt we could do better,
> but the design prefers to minimize unexpected behaviour,
> and in practice is not too much constraining]

As a happy user of Jython (albeit, so far, in modest amounts,
and not yet in production-code), I want to add an unsolicited 
testimonial -- most of the time, the rules Jython applies "do 
what feels right" and prove (to me) unsurprising and unconstraining.

After studying the rules in detail, particularly with overload resolution in 
mind, I was afraid of many possible mishaps.  In practice, I find that
it seems the rules don't get in my way and don't trip me up either.
Whatever it is, there IS something right in those rules (perhaps just
in conjunction with typical Java libraries, or perhaps more generally).


From  Sat Aug 24 16:37:36 2002
From: (Alex Martelli)
Date: Sat, 24 Aug 2002 17:37:36 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

On Saturday 24 August 2002 05:15 pm, Jeremy Hylton wrote:
> Good point, Oren.  We now have two requirements for interfaces that
> are different than the standard inheritance mechanism.  It should be
> possible to:
>   - inherit from a class without implementing that class's interfaces
>   - declare that a class implements an interface outside the class
>     statement
> It's harder to support the second requirement using the current
> inheritance mechanism.

The second requirement is a good part of what adaptation is meant
to do.  As I understand, that's exactly what Zope3 already provides
for its interfaces.  You don't just "declare" the fact -- you register
an adapter that can provide whatever is needed to make it so.  I.e.,
if object X does already implement interface Y without ANY need
for tweaking/renaming/whatever, I guess the registered adapter
can just return the object X it receives as an argument.  More often,
the adapter will return some (hopefully thin) wrapper over X that
deals with renaming, signature-adaptation, and the like.

That's how it works in Zope3 (at least as I understood from several
discussions with Jim Fulton and Guido -- haven't studied Zope3 yet), and I 
think that such "external adaptation" functionality, however dressed up, 
should definitely be a part of whatever Python ends up with.


From  Sat Aug 24 16:53:44 2002
From: (Andrew Koenig)
Date: 24 Aug 2002 11:53:44 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

Guido> For example, if we ever are going to add argument type declarations to
Guido> Python, it will probably look like this:

Guido>     def foo(a: classA, b: classB):
Guido>         ...body...

Guido> It would be convenient if this could be *defined* as

Guido>     assert isinstance(a, classA) and isinstance(b, classB)

Guido> so that programs that have a simple class hierarchy can use their
Guido> classes directly as argument types, without having to go through the
Guido> trouble of declaring a parallel set of interfaces.

Guido> I also think that it should be possible to come up with a set of
Guido> standard "abstract" classes representing concepts like number,
Guido> sequence, etc., in which the standard built-in types are nicely
Guido> embedded.

I agree completely.  Any use of inheritance that satisfies Liskov
substitutability will satisfy interface inheritance too, and although
it is possible to think of uses of inheritance that arent substutitable,
they're unusual enough that they should probably require (syntactic)
special pleading, if only to alert the reader.

Andrew Koenig,,

From  Sat Aug 24 16:53:46 2002
From: (Samuele Pedroni)
Date: Sat, 24 Aug 2002 17:53:46 +0200
Subject: [Python-Dev] type categories
Message-ID: <00f901c24b86$6e58a800$6d94fea9@newmexico>

[Jeremy Hylton]
>  - inherit from a class without implementing that class's interfaces
> - declare that a class implements an interface outside the class
>   statement

I would like to add and restate my proposal to allow also for refering to
anonymous super-interfaces of an interface in terms of the interface plus a
subset of its signatures, also e.g. FileLike and just 'write'.

[that means an interface can be thought to correspond to a set of
(tag,signature) tuples, where tag identifies the interface, and one can also
just consider subsets of it]

I really think that such a feature would allow interfaces to better mix and
match with how currently Python code is written. Or at least ease the
transition from an interfaces-less world.

This may seem YAGNI, but I clearly remember people stating (on types-sig) the
need to refer to an interface of just the granularity of just file-like 'read'
or just __getitem__. Having to name them is overkill, having to implement all
the methods of an interface corresponding to a base Python type also.

It is a burden to implement and may seem complex, but I feel, it matches how we
code in Python - implementing e.g. just subsets of  interfaces corresponding to
a base Python type - and still allowing to have interface checking precision.


From  Sat Aug 24 16:59:07 2002
From: (Andrew Koenig)
Date: 24 Aug 2002 11:59:07 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <007501c24b62$1ae61640$>
References: <>
Message-ID: <>

David> It sounds to my C++ ear like you're trying to make this
David> analogous to runtime polymorphism in C++. I think Python's
David> polymorphism is a lot closer to what we do at compile-time in
David> C++, and it should stay that way: no inheritance relationship
David> needed... at least, not on the surface. Here's why: people
David> inevitably discover new type categories in the objects and
David> types they're already using. In C++ this happened when Stepanov
David> et al discovered that built-in pointers matched his mental
David> model of random-access iterators. A similar thing will happen
David> in Python when you make all numbers inherit from Number but
David> someone wants to impose the real mathematical categories (or
David> heck: Integer vs. Fractional) on them.

David> What Stepanov's crew did did was to invent iterator traits,
David> which decouple the type's category from the type itself. Each
David> category is represented by a class, and those classes do have
David> an inherticance relationship (i.e. every random_access_iterator
David> IS-A bidirectional_iterator).

In other words, there *is* an inheritance relationship in C++'s
compile-time polymorphism, and iterator traits are one way of
expressing that polymorphism.

So we have two desirable properties:

        1) Guido's suggestion that interface specifications are
           close enough to classes that they should be classes,
           and should be inherited like classes, possibly with
           a way of hiding that inheritance for special cases;

        2) Dave's suggestion that people other than a class
           author might wish to make claims about the interface
           that the class supports.

I now remember that in one of my earlier messages, I said something
related to (2) as well.

Is there a way of merging these two ideas?

Andrew Koenig,,

From  Sat Aug 24 17:09:28 2002
From: (David Abrahams)
Date: Sat, 24 Aug 2002 12:09:28 -0400
Subject: [Python-Dev] convertibility
References: <00b901c24b76$9aa4a220$6d94fea9@newmexico> <>
Message-ID: <013b01c24b89$4c1d8140$>

From: "Alex Martelli" <>

> On Saturday 24 August 2002 04:00 pm, Samuele Pedroni wrote:
> ...
> > [yes with type categories or adapt we could do better,
> > but the design prefers to minimize unexpected behaviour,
> > and in practice is not too much constraining]
> As a happy user of Jython (albeit, so far, in modest amounts,
> and not yet in production-code), I want to add an unsolicited
> testimonial -- most of the time, the rules Jython applies "do
> what feels right" and prove (to me) unsurprising and unconstraining.
> After studying the rules in detail, particularly with overload resolution
> mind, I was afraid of many possible mishaps.  In practice, I find that
> it seems the rules don't get in my way and don't trip me up either.
> Whatever it is, there IS something right in those rules (perhaps just
> in conjunction with typical Java libraries, or perhaps more generally).

Hmm. When did Java acquire overload resolution?

I was surprised to see it here:

I was thinking of taking advantage of these rules for Boost.Python (and
Python itself), but I'm a little worried about the applicability of the
final part of the rules:

    if any method still under consideration has parameter
    types that are assignable to another method that's also still in
    play, then the other method is removed from consideration. This
    process is repeated until no other method can be eliminated. If the
    result is a single "most specific" method, then that method is
    called. If there's more than one method left, the call is ambiguous.

This rule is similar to the one used in C++ for partial ordering of
function templates. The problem is that my convertibility criteria examine
the actual objects involved in a conversion, not just their types. This
allows us to overload on sequence-of-float vs. sequence-of-string, for
example. Substitutability of argument types can't be tested without
exemplars of those types to work with.


           David Abrahams * Boost Consulting *

From  Sat Aug 24 17:30:39 2002
From: (Nathan Clegg)
Date: Sat, 24 Aug 2002 09:30:39 -0700
Subject: [Python-Dev] type categories
In-Reply-To: <>
Message-ID: <>

>>>>> "Oren" == Oren Tirosh <> writes:

    Oren> I would like to be able to declare that I need an object
    Oren> with a specific interface even if the object was written
    Oren> long before and I don't want to modify an existing library
    Oren> just to make it conform to my interface names.

class InterfaceWrapper(ExistingClass, AbstractInterfaceClass):

I'm not saying this is a good idea :), but I believe this problem is
already solvable in the current language.  The wrapper class should
pass the test of isinstance for the interface class, but the existing
class as the first parent should implement all of the calls.

Note that most other languages that actually support proper interfaces
(i.e. Java) would have similar trouble adding an interface to a prior
existing class without modifying its definition.  Python actually
provides a much simpler solution than others might, it seems to me.

Nathan Clegg

From David Abrahams" <  Sat Aug 24 17:45:01 2002
From: David Abrahams" < (David Abrahams)
Date: Sat, 24 Aug 2002 12:45:01 -0400
Subject: [Python-Dev] type categories
References: <>
Message-ID: <016001c24b8d$984536e0$>

From: "Nathan Clegg" <>

> >>>>> "Oren" == Oren Tirosh <> writes:
>     Oren> I would like to be able to declare that I need an object
>     Oren> with a specific interface even if the object was written
>     Oren> long before and I don't want to modify an existing library
>     Oren> just to make it conform to my interface names.
> class InterfaceWrapper(ExistingClass, AbstractInterfaceClass):
>       pass
> I'm not saying this is a good idea :), but I believe this problem is
> already solvable in the current language.  The wrapper class should
> pass the test of isinstance for the interface class, but the existing
> class as the first parent should implement all of the calls.
> Note that most other languages that actually support proper interfaces
> (i.e. Java) would have similar trouble adding an interface to a prior
> existing class without modifying its definition.  Python actually
> provides a much simpler solution than others might, it seems to me.

The problem is that we want to use ExistingClass *objects* where
AbstractInterfaceClass is required.

If someone else has written a module containing:

    def some_fantastic_function(AbstractInterfaceClass: x)

And I have written a function:

    def my_func(generator)
        for x in generator:

If there's a generator lying about which produces ExistingClass, I ought to
be able to pass it to my_func.

           David Abrahams * Boost Consulting *

From  Sat Aug 24 17:59:07 2002
From: (Oren Tirosh)
Date: Sat, 24 Aug 2002 19:59:07 +0300
Subject: [Python-Dev] type categories
In-Reply-To: <>; from on Sat, Aug 24, 2002 at 11:15:56AM -0400
References: <> <> <> <> <> <> <>
Message-ID: <>

On Sat, Aug 24, 2002 at 11:15:56AM -0400, Jeremy Hylton wrote:
> Good point, Oren.  We now have two requirements for interfaces that
> are different than the standard inheritance mechanism.  It should be
> possible to:     
>   - inherit from a class without implementing that class's interfaces
>   - declare that a class implements an interface outside the class
>     statement
> It's harder to support the second requirement using the current
> inheritance mechanism.

I want to go a step further.  I don't want to declare that a class 
implements an interface outside the class statement. I don't want to 
declare *anything* about classes.

My approach centers on the user of the class rather than the provider.
The user can declare what he *expects* from the class and the inteface 
checking will verify that the class meets these requirements. In a way 
this is what you already do in Python - you use the object and if it 
doesn't meet your expectations it raises an exception. Exceptions are
raised for both bad form and bad content. Bad content will still trigger
an exception when you try to use it but bad form can be detected much


I originally developed this for rulebases in security applications. I am
now porting it to Python and cleaning it up. I think it should be an
effective way to write assertions about the form of class objects based on
methods, call signatures, etc.  If/when type checking is added to Python
it should also be possible to specify specific types for arguments and
return values.


From  Sat Aug 24 18:06:31 2002
From: (Oren Tirosh)
Date: Sat, 24 Aug 2002 20:06:31 +0300
Subject: [Python-Dev] type categories
In-Reply-To: <>; from on Sat, Aug 24, 2002 at 05:37:36PM +0200
References: <> <> <> <>
Message-ID: <>

On Sat, Aug 24, 2002 at 05:37:36PM +0200, Alex Martelli wrote:
> On Saturday 24 August 2002 05:15 pm, Jeremy Hylton wrote:
> > Good point, Oren.  We now have two requirements for interfaces that
> > are different than the standard inheritance mechanism.  It should be
> > possible to:
> >
> >   - inherit from a class without implementing that class's interfaces
> >
> >   - declare that a class implements an interface outside the class
> >     statement
> >
> > It's harder to support the second requirement using the current
> > inheritance mechanism.
> The second requirement is a good part of what adaptation is meant
> to do.  

I am not talking about situations where the object does not meet your
expectations and needs to be adapted - I'm talking about situations where 
it actually does and the only problem is how to describe that fact
Adaptation is cool, but I don't see it as a replacement for anything that
interfaces are supposed to achieve. Effective adaptation requires some
kind of interface definition mechanism to work on top of.


From  Sat Aug 24 18:33:01 2002
From: (Andrew Koenig)
Date: 24 Aug 2002 13:33:01 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <>
Message-ID: <>

>>>>>> "Oren" == Oren Tirosh <> writes:
Oren> I would like to be able to declare that I need an object
Oren> with a specific interface even if the object was written
Oren> long before and I don't want to modify an existing library
Oren> just to make it conform to my interface names.

Nathan> class InterfaceWrapper(ExistingClass, AbstractInterfaceClass):
Nathan>       pass

Nathan> I'm not saying this is a good idea :), but I believe this problem is
Nathan> already solvable in the current language.

Not quite.  You are creating a new class with the desired property,
but it can sometimes be desirable to assert properties about
types that already exist.

For example, suppose I invent a GroupUnderPlus property for
types for which the + operator has group properties.  I would
like to be able to say that int has that property, and not
have to derive a new class from int in order to do so.

Andrew Koenig,,

From  Sat Aug 24 19:30:56 2002
From: (Gerhard =?iso-8859-1?Q?H=E4ring?=)
Date: Sat, 24 Aug 2002 20:30:56 +0200
Subject: [Python-Dev] Why no math.fac?
Message-ID: <20020824183056.GA1859@lilith.ghaering.test>

Any reason why there isn't any factorial function in the math module? I
could easily implement one in C (for ints and longs only, right?)

This sig powered by Python!
Außentemperatur in München: 22.3 °C      Wind: 1.2 m/s

From  Sat Aug 24 19:41:16 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 24 Aug 2002 14:41:16 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <>
Message-ID: <>

[Magnus Lie Hetland]

> I guess the idea was to use lazy sets for some such operations. Then
> you could build complex expressions through cartesian products, unions,
> intersections, set differences, set comprehensions etc. without actually
> constructing the full set.

Allow me some random thoughts.  (Aren't they always random, anyway? :-)

When I saw some of the suggestions on this list for "generating" elements of
a cartesian product, despite sometimes elegant, I thought "Too much done,
too soon.".  But the truth is that I did not give the thing a serious try,
I'm not sure I would be able to offer anything better.

One nice thing, with a dict or a set, is that we can quickly access how many
entries there are in there.  Is there some internal way to efficiently fetch
the N'th element, from the order in which the keys would be naturally listed?
If not, one could always pay some extra temporary memory to build a list
of these keys first.  If you have to "generate" a cartesian product for
N sets, you could set up a compound counter as a list of N indices, the
K'th meant to run from 0 up to the cardinality C[K] of the K'th set, and
devise simple recipes to yield the element of the product represented by
the counter, and to bump it.  Moreover, it would be trivial to equip this
generator with a `__len__' function able to predict the cardinality CCC of
the whole result, and quite easy being able to transform any KKK between
0 and NNN into an equivalent compound counter, and from there, access any
member of the cartesian product at constant speed, without generating it all.

All the above is pretty simple, and meant to introduce a few suggestions
that might solve once and for all, if we could do it well enough, a
re-occurring request on the Python list about how to produce permutations
and al.  We might try to rewrite the recipes behind a "generating" cartesian
product of many sets, illustrated above, into a similar generating function
able to produce all permutations of a single set.  So let's say:

   Set([1, 2, 3]).permutations()

would lazily produce the equivalent of:

   Set([(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)])

That generator could offer a `__len__' function predicting the cardinality
CCC of the result, and some trickery could be used to map integers from 0 to
CCC into various permutations.  Once on the road, it would be worth offering
combinations and arrangements just as well, and maybe also offering the
"power" set, I mean here the set of all subsets of a given set, all with
predictable cardinality and constant speed access to any member by index.

Yet, many questions or objections may rise.  Using tuples to represent
each element of a cartesian product of many sets is pretty natural, but
it is slightly less clear that a tuple is the best fit for representing
an ordered set as in permutations and arrangements, as tuples may allow
elements to be repeated, while an ordered set does not.  I think that sets
are to be preferred over tuples for returning combinations or subsets.

While it is natural to speak and think of subsets of a set, or permutations,
arrangements and combinations of a set, some people might prefer to stay
closer to an underlying implementations with lists (sublists of a list,
permutations, arrangements or combinations of a list), and would feel that
going through sets is an unwelcome detour for their applications.  Indeed,
what's the real use and justficiation for hashing keys and such things,
when one wants nothing else than arrangements from a list?

Another aspect worth some thinking is that permutations, in particular, are
mathematical objects in themselves: we can notably multiply permutations or
take the inverse of a permutation.  Arrangements are in fact permutations
over combinations elements.  Some thought is surely needed for properly
reflecting mathematical elegance into how the set API is extended for the
above, and not merely burying that elegance under practical concerns.

Some people may think that these are all problems which are orthogonal
to the design of a basic set feature, and which should be addressed in
separate Python modules.  On the other hand, I think that a better and
nicer integration might result if all these things were thought together,
and thought sooner than later.  Moreover, my intuition tells me that with
some care and luck (both are needed), these extra set features could be
small enough additions to the `sets' module to not be worth another one.
Besides, if appropriate, such facilities would surely add a lot of zest and
punch into the interest raised by the `sets' module when it gets published.

François Pinard

From  Sat Aug 24 20:29:03 2002
From: (Tim Peters)
Date: Sat, 24 Aug 2002 15:29:03 -0400
Subject: [Python-Dev] Re: [Python-checkins]
In-Reply-To: <>
Message-ID: <>

[Eric S. Raymond]
> (Copied to Paul Graham.  Paul, this is the mailing list of the Python
> maintainers.  I thought you'd find the bits about lexical analysis in
> bogofilter interesting.  Pythonistas, Paul is one of the smartest and
> sanest people in the LISP community, as evidenced partly by the fact
> that he hasn't been too proud to learn some lessons from Python :-).
> It would be a good thing for some bridges to be built here.)

Hi, Paul!  I believe Eric copied you on some concerns I had about the
underpinnings of the algorithm, specifically about the final "is it spam?"

Looking at your links, I bet you got the formula from here:

If so, the cause of the difficulty is that you inherited a subtle (because
unstated) assumption from that writeup:

    I would suggest that we assume symmetry between "y" and "n".  In
    other words, assume that probability of predicting correctly is the
    same regardless of whether the correct answer is "y" or "n".

That part's fine.

    This implies p0=p7, p1=p6, p2=p5, and p3=p4,

But that part doesn't follow from *just* the stated assumptions:  note that
those four equalities imply that

    p0+p2+p4+p6 = p7+p5+p3+p1

But the left-hand side of that is the probability that event X does not
occur (it's all the rows with 'n' in the 'R' column), and the right-hand
side is the probability that event X does occur (it's all the rows with 'y'
in the 'R' column).  In other words, this derivation also makes the
stronger-- and unstated --assumption that X occurs with probability 1/2.
The ultimate formula given on that page is correct if P(X)=0.5, but turns
out it's wrong if P(X) isn't 0.5.

Reality doesn't care how accurate Smith and Jones are, X occurs with its own
probability regardless of what they think.  Picture an extreme:  suppose
reality is such that X *always* occurs.  Then p0 must be 0, and so must p2,
p4 and p6 (the rows with 'n' in the R column can never happen if R is always
'y').  But then p0+p2+p4+p6 is 0 too, and the equality above implies
p7+p5+p3+p1 is also 0.  We reach the absurd conclusion that if X always
occurs, the probability that X occurs is 0.  As approximations to 1 go, 0
could stand some improvement <wink>.

The math is easy enough to repair, but it may percolate into other parts of
your algorithm.  Chiefly, I *suspect* you found you needed to boost the
"good count" by a factor of 2 because you actually have substantially more
non-spam than spam in your inbox, and the scoring step was favoring "spam"
more than it should have by virtue of neglecting to correct for that your
real-life P(spam) is significantly less than 0.5 (although your training
corpora had about the same number of spams as non-spams, so that P(spam)=0.5
was aprroximately true across your *training* data -- that's another

Makes sense?  Once our testing setup is trustworthy, I'll try it both ways
and report on results.  In the meantime, it's something worth pondering.

From Samuele Pedroni" <  Sat Aug 24 20:29:24 2002
From: Samuele Pedroni" < (Samuele Pedroni)
Date: Sat, 24 Aug 2002 21:29:24 +0200
Subject: [Python-Dev] Fw: Security hole in rexec?
Message-ID: <038e01c24ba4$8e78db00$6d94fea9@newmexico>

----- Original Message ----- 
From: Troels Therkelsen <>
Newsgroups: comp.lang.python
Sent: Saturday, August 24, 2002 6:42 PM
Subject: Security hole in rexec?

> Hello everybody,
> I have managed to stumble onto something with the rexec module that I
> do not quite understand.  As I understand it, the rexec framework is
> meant to create a sandbox area within the Python interpreter,
> technically with an instance of the rexec.RExec class.  It is supposed
> to be impossible to break out of this sandbox unless you do something
> careless like inserting non-rexec objects into the rexec namespace.
> Let me demonstrate with some code:
>   Python 2.2.1 (#1, Jun 27 2002, 10:29:04) 
>   [GCC 2.95.3 20010315 (release)] on linux2
>   Type "help", "copyright", "credits" or "license" for more
> information.
>   >>> import rexec
>   >>> r = rexec.RExec()
>   >>> r.r_exec("import sys; print sys.stdout")
>   Traceback (most recent call last):
>     File "<stdin>", line 1, in ?
>     File "/usr/local/lib/python2.2/", line 254, in r_exec
>       exec code in m.__dict__
>     File "<string>", line 1, in ?
>   AttributeError: 'module' object has no attribute 'stdout'
> This is as you'd expect, 'stdout' is not in the default ok_sys_names
> attribute of the rexec.RExec class, so you are not supposed to be able
> to see it from within the 'sandbox'.  But observe:
>   >>> r.r_exec("del __builtins__")
>   >>> r.r_exec("import sys; print sys.stdout")
>   <open file '<stdout>', mode 'w' at 0x80fe2a0>
> If __builtins__ is so critical to the operation of the 'sandbox' how
> is it possible to break it from within the 'sandbox'?  Have I stumbled
> across a bug in rexec?  Have I misunderstood something important?
> I've used the id() function to get the 'address' of the __builtins__
> object and I have verified that the new __builtins__ which gets
> re-added has a different id so it is definitely a different
> __builtins__ than the one I used del on.  It would appear that exec
> and family adds __builtins__ to the namespace it runs in if it doesn't
> exist.  But where does it get it from?  Why doesn't rexec deal with
> this quirk of exec?  Maybe it's a new feature/bug of exec?
> I'll stop with the questions now.  Suffice to say, I really need rexec
> :-)
> Best regards,
> Troels Therkelsen

From  Sat Aug 24 21:03:13 2002
From: (Tim Peters)
Date: Sat, 24 Aug 2002 16:03:13 -0400
Subject: [Python-Dev] Why no math.fac?
In-Reply-To: <20020824183056.GA1859@lilith.ghaering.test>
Message-ID: <>

[Gerhard H=E4ring]
> Any reason why there isn't any factorial function in the math modul=

math has traditionally just wrapped functions from the platform libm,
although it's gotten a *little* smarter than that in a few ways.

> I could easily implement one in C (for ints and longs only, right?)

A Python function is more suitable.  If n is small, fac goes fast no =
how it's written.  If n is large, the time spent in long-int multipli=
will overwhelming swamp whatever little savings you got from writing =
it in
C.  More, an intelligent unbounded-int fac function written in Python=
likely run much faster than anything you could bear to code in C, bec=
ause an
"intelligent" function for this would strive to balance the sizes of =
multiplicands along the way, and that requires bookkeeping that's pai=
nful in

For example, try these under current CVS Python:

=66rom heapq import heapreplace, heappop

def fac1(n):
    if n =3D=3D 0:
        return 1
    if n <=3D 2:
        return n
    partials =3D range(2, n+1)
    while len(partials) > 1:
        n1 =3D heappop(partials)
        n2 =3D partials[0]
        heapreplace(partials, n1*n2)
    return partials[0]

def fac2(n):
    if n =3D=3D 0:
        return 1
    if n <=3D 2:
        return n
    result =3D 2
    for i in xrange(3, n+1):
        result *=3D i
    return result

fac1 implements a simple balancing scheme that eventually manages to =
Karatsuba multiplication (new in 2.3; the heapq module is also new) i=
play.  For n=3D100000, fac2 takes 10 times longer to run on my box, a=
wouldn't be significantly faster than that if coded in C.

From  Sat Aug 24 21:55:04 2002
From: (Andrew P. Lentvorski)
Date: Sat, 24 Aug 2002 13:55:04 -0700 (PDT)
Subject: [Python-Dev] Why no math.fac?
In-Reply-To: <>
Message-ID: <>

On Sat, 24 Aug 2002, Tim Peters wrote:

> [Gerhard H=E4ring]
> > Any reason why there isn't any factorial function in the math module?

Since factorial is really a special case of the gamma function, wouldn't
it be better to put it in a separate module that handles such complex
mathematical functions? (orthonormal polynomials, implicitly defined
functions, etc.)


From  Sat Aug 24 21:59:16 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 24 Aug 2002 16:59:16 -0400
Subject: [Python-Dev] Re: Why no math.fac?
In-Reply-To: <>
References: <>
Message-ID: <>

[Tim Peters]

> [...] an "intelligent" function for this would strive to balance the
> sizes of the multiplicands along the way [...]

> from heapq import heapreplace, heappop
> def fac1(n):

Simple and clever.  A real pleasure to read! :-)

Wouldn't it make a wonderful example of `heapq' in the Library Reference?

François Pinard

From  Sat Aug 24 22:33:38 2002
From: (Alex Martelli)
Date: Sat, 24 Aug 2002 23:33:38 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

On Saturday 24 August 2002 07:06 pm, Oren Tirosh wrote:
> On Sat, Aug 24, 2002 at 05:37:36PM +0200, Alex Martelli wrote:
> > On Saturday 24 August 2002 05:15 pm, Jeremy Hylton wrote:
> > > Good point, Oren.  We now have two requirements for interfaces that
> > > are different than the standard inheritance mechanism.  It should be
> > > possible to:
> > >
> > >   - inherit from a class without implementing that class's interfaces
> > >
> > >   - declare that a class implements an interface outside the class
> > >     statement
> > >
> > > It's harder to support the second requirement using the current
> > > inheritance mechanism.
> >
> > The second requirement is a good part of what adaptation is meant
> > to do.
> I am not talking about situations where the object does not meet your
> expectations and needs to be adapted - I'm talking about situations where
> it actually does and the only problem is how to describe that fact
> properly.

Adaptation IS one way to "describe that fact properly", given that checks
are anyway constrained to happen at runtime.  You just install an
adapter from objects x of class X to protocol Y that receives x as
an argument and whose body is just "return x" -- that's all.

You may consider the adaptation mechanism too general to bend it to
this purpose, but I look at it differently, namely: what's the gain that
would justify a further, special-purpose mechanism that's usable only
(e.g.) when all of X's methods already have the right name and order
of parameters, but then we'd have to switch to another if there is
renaming or reordering to be done?  Unless some huge gain can be
shown to come from having multiple mechanisms, I'd rather have
just one -- "entities must not be multiplied without necessity".

Should some caller, for some weird reason, need to distinguish
whether object x was adapter to Y through an actual wrapper, or
without the need for one, the caller can test "if x is adapt(x, Y):" --
I can't easily think of actual use cases, but, if there are any,
they are covered anyway.

Incldentally, I consider the best compile-time equivalent of adaptation
I know to be Haskell's "instance" statement.  Don't let the name
mislead you -- Haskell is FP, not OO, and doesn't use "instance"
to talk about what in Python we'd call instances of a type.  Rather,
Haskell uses statement "instance" to assert that a type T is an
instance of a typeclass C.  A typeclass is Haskell's equivalent of
an interface (actually of a stateless abstract class, and then some,
but that's another issue), and "type T instances typeclass C" is
Haskell's way ot say "type T implements interface C".

Renaming IS generally necessary.  If you have an installation of
the Haskell interpreter HUGS (comes with many Linux distros,
for example -- can be downloaded from also
for Windows), have a look at demos/Lattice.hs -- you may find
it readable even without knowing Haskell, since Haskell uses
significant whitespace much like Python and has much notation
in common with maths and other FP languages,  Lattice.hs
defines a typeclass "Lattice", and asserts that Bool instances
Lattice (then goes on from there, of course, but let's stop to
this part).  But of course the key functions (would be methods
for us, in an OO language) in Lattice are called meet and join
(standard math terms, after all), while in Bool the corresponding
functionality is given by functions named && and ||.  No problem,
of course: the instance statement is (MUST be!) able to
"rename" -- to assert that, when using Bool as a Lattice,
meet means && and join means || .

Since instance is a compile-time thing, it doesn't need any
'wrapper' -- just some appendix to the compiler's symbol tables,
of course.  But if we want to remain OO, dynamic, and do name
dispatching of methods, we WOULD need a wrapper of some
kind to perform the same renaming.

A facility that is SO special-purpose that it doesn't let me say
"I have conceived this new interface Lattice, and existing class bool 
is an example of it" -- or forces me to distort Lattice's method names
away from standards such as meet and join in order to fit them to
the preexisting names of bool's methods/operators (and then how
will I go about asserting that OTHER classes are also lattices...?),
does not seem a good idea to me.

> Adaptation is cool, but I don't see it as a replacement for anything that
> interfaces are supposed to achieve. Effective adaptation requires some
> kind of interface definition mechanism to work on top of.

The latter is a widespread opinion, but one from which I disagree.

Using types as the "protocols" that adaptation works with is, IMHO,
quite workable.  And some of adaptation's aspects provide facilities,
such as "third-party adapters" also working for renaming and
similar issues, without which you could not achieve all "that interfaces
are supposed to achieve" -- and I don't think those aspects should
ALSO be duplicated by adding other mechanisms AS WELL AS
adaptation.  It seems to me Zope3 has it right in this respect (even
though I think I disagree on other design choices -- I won't know
for sure until I get a chance to try it out in production code), by
making adaptation a key part of the interfaces' mechanisms.


From  Sat Aug 24 23:00:40 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 24 Aug 2002 18:00:40 -0400
Subject: [Python-Dev] heapq method names
Message-ID: <>

Hi, people.

In the `heapq' module, I'm a little bothered by the fact modules names have
`heap' as a prefix in their name.  If the methods have been installed as
standard list methods, it would be quite understandable, but it has been
decided otherwise.

The most usual way of using a module is:

    import MODULE

rather than:

    from MODULE import METHOD

and we should name METHODs accordingly, not repeating the MODULE as prefix.
This is a rather common usage, almost everywhere in the Python library.

So my suggestion of changing now, before `heapq' gets released for real:

    heappush -> push
    heappop -> pop
    heapreplace -> replace

I guess that `heapify' is OK as it stands.

The example should be changed accordingly, that is, using `import heapq'
instead of `from heapq import such-and-such', using `heapq.push' instead of
`heappush' and `heapq.pop' instead of `heappop'.

François Pinard

From  Sat Aug 24 23:14:31 2002
From: (Oren Tirosh)
Date: Sat, 24 Aug 2002 18:14:31 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <> <>
Message-ID: <>

On Sat, Aug 24, 2002 at 11:33:38PM +0200, Alex Martelli wrote:
> > I am not talking about situations where the object does not meet your
> > expectations and needs to be adapted - I'm talking about situations where
> > it actually does and the only problem is how to describe that fact
> > properly.
> Adaptation IS one way to "describe that fact properly", given that checks
> are anyway constrained to happen at runtime.  You just install an
> adapter from objects x of class X to protocol Y that receives x as
> an argument and whose body is just "return x" -- that's all.

I don't take it as given that "checks are anyway constrain to happen at
runtime". I prefer a system that is future-proof enough to evolve into 
something that the compiler can use to do type inference. That is one 
of the reasons I don't want a typeclass / type category / interface /
/ type expression / whateveryouyouwannacallit to call any user-written
Python code. (I don't want Python to become of those languages where user 
code can execute at compile time :-)

> ... what's the gain that would justify a further, special-purpose 
> mechanism that's usable only (e.g.) when all of X's methods already have 
> the right name and order of parameters, but then we'd have to switch to 
> another if there is renaming or reordering to be done?  

Being able to eventually perform many type checks earlier - at compile 
time or at module load time. Renaming and reordering really does have to 
be done at runtime in a dynamically typed language. 


From  Sun Aug 25 00:55:00 2002
From: (Guido van Rossum)
Date: Sat, 24 Aug 2002 19:55:00 -0400
Subject: [Python-Dev] Fw: Security hole in rexec?
In-Reply-To: Your message of "Sat, 24 Aug 2002 21:29:24 +0200."
References: <038e01c24ba4$8e78db00$6d94fea9@newmexico>
Message-ID: <>

[rexec compromised by deleting __builtins__]

This has been known for a while, see

My recommendation is the same as always: don't trust rexec.

--Guido van Rossum (home page:

From  Sun Aug 25 01:03:59 2002
From: (Guido van Rossum)
Date: Sat, 24 Aug 2002 20:03:59 -0400
Subject: [Python-Dev] heapq method names
In-Reply-To: Your message of "Sat, 24 Aug 2002 18:00:40 EDT."
References: <>
Message-ID: <>

> In the `heapq' module, I'm a little bothered by the fact modules names have
> `heap' as a prefix in their name.  If the methods have been installed as
> standard list methods, it would be quite understandable, but it has been
> decided otherwise.
> The most usual way of using a module is:
>     import MODULE
>      ...
> rather than:
>     from MODULE import METHOD
>      ...
> and we should name METHODs accordingly, not repeating the MODULE as prefix.
> This is a rather common usage, almost everywhere in the Python library.
> So my suggestion of changing now, before `heapq' gets released for real:
>     heappush -> push
>     heappop -> pop
>     heapreplace -> replace
> I guess that `heapify' is OK as it stands.
> The example should be changed accordingly, that is, using `import heapq'
> instead of `from heapq import such-and-such', using `heapq.push' instead of
> `heappush' and `heapq.pop' instead of `heappop'.

-1.  The nmes 'push', 'pop' and 'replace' are too generic.  The module
seems to "invite" the ``from heapq import heappush, heappop'' syntax,
and I'd like to honor that.

--Guido van Rossum (home page:

From  Sun Aug 25 08:33:55 2002
From: (Alex Martelli)
Date: Sun, 25 Aug 2002 09:33:55 +0200
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <>
Message-ID: <02082509335500.13476@arthur>

On Sunday 25 August 2002 00:14, Oren Tirosh wrote:
> On Sat, Aug 24, 2002 at 11:33:38PM +0200, Alex Martelli wrote:
> > > I am not talking about situations where the object does not meet your
> > > expectations and needs to be adapted - I'm talking about situations
> > > where it actually does and the only problem is how to describe that
> > > fact properly.
> >
> > Adaptation IS one way to "describe that fact properly", given that
> > checks are anyway constrained to happen at runtime.  You just install
> > an adapter from objects x of class X to protocol Y that receives x as
> > an argument and whose body is just "return x" -- that's all.
> I don't take it as given that "checks are anyway constrain to happen at
> runtime". I prefer a system that is future-proof enough to evolve into
> something that the compiler can use to do type inference. That is one

A compiler able to do type inference had better be smart enough to
recognize the special-case pattern:

def noadapt(obj, proto): return obj
install_adapter(noadapt, someclass, someprotocol)

If that hypothetical compiler is unable to recognize this pattern (with
whatever change of names except for a built-in install_adapter), its
hypothetical type inference is FAR too puny for me to be happy to
pay any substantial price for it.

In particular, defining multiple mechanisms that partially overlap for
the same tasks, for the sole purpose of making it hypothetically and
marginally easier to draw the sole distinction of compile time versus
runtime, IS a substantial price to pay in term of language complication.

Conceptual distinctions between compile time and runtime are already
"a price".  One that may be worth paying, in general, for performance and 
in order to get error messages earlier.  But, I think, one we should be
quite wary to _extend_ -- particularly to extend to areas where we might
well get away WITHOUT paying it.

> time or at module load time. Renaming and reordering really does have to
> be done at runtime in a dynamically typed language.

Not necessarily, given _decent_ (hypothetical) type inference.  The
hypothetical decent type-inferring compiler would know about the
install_adapter builtin.  It could then hypothetically special-case
method renaming by recognizing in the adapter pure-renaming
patterns such as:

    def interfacemethod(self, *args): return self.obj.objmethod(*args)

and generate code suitably when it recognizes that a given adapter
does nothing but renaming.  Blue-sky to some extent, but that sort
of thing IS a good part of what type *inference* is about.


From  Sun Aug 25 13:00:16 2002
From: (Skip Montanaro)
Date: Sun, 25 Aug 2002 07:00:16 -0500
Subject: [Python-Dev] Weekly Python Bug/Patch Summary
Message-ID: <>

Bug/Patch Summary

273 open / 2793 total bugs (+4)
109 open / 1663 total patches (+3)

New Bugs

Empty genindex.html pages (2002-07-26)
import cycle in distutils (2002-08-19)
spawn*() doesn't handle errors well (2002-08-20)
exec*() doesn't handle errors well (2002-08-20)
compiler package and SET_LINENO (2002-08-20)
Core dump when using mmap. (2002-08-20)
import _tkinter python dumps core. (2002-08-21)
The KeyError message doesn't use repr on the key value reported (2002-08-21)
HTTPConnection memory leak (2002-08-22)
Python not handling cText (2002-08-22)
execfile() not show filename when IOErro (2002-08-23)
ext module generation problem (2002-08-23)
re searches don't work with 4-byte unico (2002-08-23)
Method resolution order in Py 2.2 - 2.3 (2002-08-23)
CRAM-MD5 module (2002-08-24)
SocketServer wrong about allow_reuse_add (2002-08-24)
sub[n] not working as expected. (2002-08-24)
httplib.connect broken in 2.1 branch (2002-08-25)
NameError value is not the name error (2002-08-25)

New Patches

Pure Python strptime() (PEP 42) (2001-10-23)
"simplification" to ceval.c (2002-08-19)
Oren Tirosh's fastnames patch (2002-08-20)
textwrap.dedent, inspect.getdoc-ish (2002-08-21)
Failure building the documentation (2002-08-22)
PEP 269 Implementation (2002-08-23)
Bugfix for (2002-08-25)

Closed Bugs

Summary: "BuildApplet can destory the source file on Mac OS X" (2002-01-18)
** in doc/current/lib/operator-map.html (2002-07-04)
bug in splituser(host) in urllib (2002-07-14)
imaplib: prefix-quoted strings (2002-07-30)
add main to py_pycompile (2002-07-30)
Mixin broken for new-style classes (2002-08-05)
comments taken as values in ConfigParser (2002-08-08)
string method bugs w/ 8bit, unicode args (2002-08-14)
pythonw has a console on Win98 (2002-08-15)
IDLE/Command Line Output Differ (2002-08-15)
pickle_complex in (2002-08-15)
popenN return only text mode pipes (2002-08-16)
textwrap has problems wrapping hyphens (2002-08-17)

Closed Patches

Alternative implementation of interning (2002-07-01)
new version of Set class (2002-07-13)
Update environ for (2002-08-15)
urllib.splituser(): '@' in usrname (2002-08-17)

From  Sun Aug 25 13:04:20 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 25 Aug 2002 08:04:20 -0400
Subject: [Python-Dev] Re: heapq method names
In-Reply-To: <>
References: <>
Message-ID: <>

[Guido van Rossum]

> > So my suggestion of changing now, before `heapq' gets released for real:
> > 
> >     heappush -> push
> >     heappop -> pop
> >     heapreplace -> replace

> -1.  The nmes 'push', 'pop' and 'replace' are too generic.  The module
> seems to "invite" the ``from heapq import heappush, heappop'' syntax,
> and I'd like to honor that.

May I invite you to reconsider?  We are going to live with that one for
a loong time, you know...

Quite granted, as it stands, the module invites the long form of import
(from MODULE import LIST-OF-NAMES).  This _is_ what I question.

Writing `heapq.heapXXX' is kind of ugly, people are going to spontaneously
avoid it, especially given that the documentation says to do so.  Yet,
the long import line is uselessly tedious to write.  I would not think the
author really wrote `heappush' and `heappop' with the intent that they
could sit in a module and be imported with the long form, but rather as
inlinable `def', or maybe rather as built-in methods for `list' objects.
That intent changing, the method names are then asking to be revised.

There are not much cases in the Python library where the `from ... import'
is forced upon users in practice.  The `BaseHTTPServer' module and friends
are the only examples that come to mind, and I find these import lines
especially cumbersome to write: hopefully, these are not to be used often.

The `heapq' module is different, as for some programmers, it might be used
often, and I do not see a real reason for making it tedious or different.

As for `push' etc. being too generic, there are used in the context of a
specialised module, which gives these word there specialised meaning, so
genericity is not a real argument.  Other modules already qualify simple
words.  Has it been a problem?  Even, would it be that some people really
want to write `from MODULE import *' or `from MODULE import SUCH-AND-SUCH'
for a lot of modules at global scope, something which is not to be encouraged
anyway, these users still have `from ... import ... as' to help them.

Please consider altering the current `heapq' module so to _not_ invite a
different importing style.  Make it more similar to the rest of the library,
there is probably no real need for a difference.  Let it be nicer to use!

François Pinard

From Samuele Pedroni" <  Sun Aug 25 14:51:09 2002
From: Samuele Pedroni" < (Samuele Pedroni)
Date: Sun, 25 Aug 2002 15:51:09 +0200
Subject: [Python-Dev] type categories
References: <> <> <> <>
Message-ID: <003401c24c3e$95df06e0$6d94fea9@newmexico>

This is straight from the library (xml.sax.saxutils)
[chosen more because so at least it's real code than because nobody can argue
that it is irrelevant. I find fascinating how much powerful is the argument
"nobody does that often" in discussions about language design]

def prepare_input_source(source, base = ""):
    """This function takes an InputSource and an optional base URL and
    returns a fully resolved InputSource object ready for reading."""

    if type(source) in _StringTypes:
        source = xmlreader.InputSource(source)
    elif hasattr(source, "read"):
        f = source
        source = xmlreader.InputSource()
        if hasattr(f, "name"):

the first problem is the "ontology" problem and how much we want to support
someone who want to strictly check
  (a)  "source has intentionally a file-like read" vs. just
  (b)  "source has some read method"...

This is indipendent of whether we have adapt or declarative interfaces or both.

The above could be written as:

  source = adapt(source,xmlreader.InputSource)

moving the code inside xmlreader.InputSource.__adapt__ .
But does this address (a)?

I would say no.

Then one could simply not implement __adapt__ but leave the burden to the users
to define my-type-with-a-good-read
to xmlreader.InputSource adaptations. Or put code the code inside
and susbitute

     elif hasattr(source, "read"):


   f = adapt(source,???)

so the "ontology" problem is back.

My point is not against adaptation, but that adaptation does not automagically
solve all our problems without further thinking.

I repeat, with both adaptation and interfaces, if one cares about contracts vs.
just signatures, the ontology problem is with us.

Adaptation is probably expressive enough. But the choice between

 - mechanisms to ask and declare whether object implements protocol
 - mechanisms to register adapter factories between protocol A and B
 [I think this is Zope3 model]


 - just adaptation

should be a choice also about convenience, readability, ...
both ways the ontology problem is there.


From  Sun Aug 25 15:47:43 2002
From: (Aahz)
Date: Sun, 25 Aug 2002 10:47:43 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes,1.7,1.8
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Sat, Aug 24, 2002, Tim Peters wrote:
> Given that in real life most people still get more not-spam than spam,
> removing the counter-bias in the scoring math may boost the false negative
> rate.

as of six months ago, i no longer believe this to be necessarily true
Aahz (           <*>

Project Vote Smart:

From  Sun Aug 25 15:50:14 2002
From: (Martin =?ISO-8859-1?Q?Sj=F6gren?=)
Date: 25 Aug 2002 16:50:14 +0200
Subject: [Python-Dev] Re: heapq method names
In-Reply-To: <>
References: <>
Message-ID: <1030287015.559.9.camel@winterfell>

Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

s=C3=B6n 2002-08-25 klockan 14.04 skrev Fran=C3=A7ois Pinard:
> As for `push' etc. being too generic, there are used in the context of a
> specialised module, which gives these word there specialised meaning, so
> genericity is not a real argument.  Other modules already qualify simple
> words.  Has it been a problem?  Even, would it be that some people really
> want to write `from MODULE import *' or `from MODULE import SUCH-AND-SUCH=
> for a lot of modules at global scope, something which is not to be encour=
> anyway, these users still have `from ... import ... as' to help them.
> Please consider altering the current `heapq' module so to _not_ invite a
> different importing style.  Make it more similar to the rest of the libra=
> there is probably no real need for a difference.  Let it be nicer to use!

I agree completely. And as for the names being too generic. Well, gee,
they *are* usually used for putting an element in the data structure,
removing an element from the data structure, et.c. Often regardless of
the data structure/ADT. I've always used push, pop (and peek et.c.) for
heaps, as well as stacks. If anything, I think that the pop method of
lists is confusing :-)


Content-Type: application/pgp-signature; name=signature.asc
Content-Description: Detta =?ISO-8859-1?Q?=E4r?= en digitalt signerad

Version: GnuPG v1.0.7 (GNU/Linux)



From  Sun Aug 25 17:53:24 2002
From: (Tim Peters)
Date: Sun, 25 Aug 2002 12:53:24 -0400
Subject: [Python-Dev] Re: heapq method names
In-Reply-To: <>
Message-ID: <>

Note that it's common to use the bisect module in the

    from bisect import bisect_right, bisect, insort

way too, rather than spell out bisect.bisect (etc) each time.  That's "the
other" module that (conceptually) adds new methods to lists.

If you want simpler names, I'm finding this little module quite pleasant to

import heapq

class Heap(list):
    def __init__(self, iterable=[]):
    push    = heapq.heappush
    popmin  = heapq.heappop
    replace = heapq.heapreplace
    heapify = heapq.heapify

That is, it creates a Heap type that's just a list with some extra methods.
Note that the "pop" method can't be named "pop"!  If you try, you'll soon
get unbounded recursion because the heapq functions need list.pop to access
the list meaning of "pop".

Guido suggested a long time ago that such a class could be added to heapq,
and I like it a lot in real life.

From  Sun Aug 25 22:14:07 2002
From: (Guido van Rossum)
Date: Sun, 25 Aug 2002 17:14:07 -0400
Subject: [Python-Dev] Re: heapq method names
In-Reply-To: Your message of "Sun, 25 Aug 2002 08:04:20 EDT."
References: <> <>
Message-ID: <>

> May I invite you to reconsider?  We are going to live with that one for
> a loong time, you know...

I know.  I have read and re-read your arguments, but I see nothing to
change my mind.  Somehow the short names you suggest just seem wrong
to me.  We can agree to disagree, but I feel strongly that the names
should not be changed.

--Guido van Rossum (home page:

From  Sun Aug 25 22:56:46 2002
From: (Guido van Rossum)
Date: Sun, 25 Aug 2002 17:56:46 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Sat, 24 Aug 2002 14:41:16 EDT."
References: <> <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

> Allow me some random thoughts.  (Aren't they always random, anyway? :-)

Maybe you could contribute some not-so-random code?  Words we've got
enough. :-)

> When I saw some of the suggestions on this list for "generating"
> elements of a cartesian product, despite sometimes elegant, I
> thought "Too much done, too soon.".  But the truth is that I did not
> give the thing a serious try, I'm not sure I would be able to offer
> anything better.
> One nice thing, with a dict or a set, is that we can quickly access
> how many entries there are in there.  Is there some internal way to
> efficiently fetch the N'th element, from the order in which the keys
> would be naturally listed?  If not, one could always pay some extra
> temporary memory to build a list of these keys first.  If you have
> to "generate" a cartesian product for N sets, you could set up a
> compound counter as a list of N indices, the K'th meant to run from
> 0 up to the cardinality C[K] of the K'th set, and devise simple
> recipes to yield the element of the product represented by the
> counter, and to bump it.  Moreover, it would be trivial to equip
> this generator with a `__len__' function able to predict the
> cardinality CCC of the whole result, and quite easy being able to
> transform any KKK between 0 and NNN into an equivalent compound
> counter, and from there, access any member of the cartesian product
> at constant speed, without generating it all.

Since the user can easily multiply the length of the input sets
together, what's the importance of the __len__?

And what's the use case for randomly accessing the members of a
cartesian product?  IMO, the Cartesian product is mostly useful for
abstract matehematical though, not for solving actual programming

> All the above is pretty simple, and meant to introduce a few
> suggestions that might solve once and for all, if we could do it
> well enough, a re-occurring request on the Python list about how to
> produce permutations and al.  We might try to rewrite the recipes
> behind a "generating" cartesian product of many sets, illustrated
> above, into a similar generating function able to produce all
> permutations of a single set.  So let's say:
>    Set([1, 2, 3]).permutations()
> would lazily produce the equivalent of:
>    Set([(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)])
> That generator could offer a `__len__' function predicting the
> cardinality CCC of the result, and some trickery could be used to
> map integers from 0 to CCC into various permutations.  Once on the
> road, it would be worth offering combinations and arrangements just
> as well, and maybe also offering the "power" set, I mean here the
> set of all subsets of a given set, all with predictable cardinality
> and constant speed access to any member by index.

Obviously you were inspired here by Eric Raymond's implementation of
the powerset generator...

But again I ask, what's the practical use of random access to
permutations?  (I know there are plenty of uses of permutations, but I
doubt the need for random access.)

> Yet, many questions or objections may rise.  Using tuples to
> represent each element of a cartesian product of many sets is pretty
> natural, but it is slightly less clear that a tuple is the best fit
> for representing an ordered set as in permutations and arrangements,
> as tuples may allow elements to be repeated, while an ordered set
> does not.  I think that sets are to be preferred over tuples for
> returning combinations or subsets.

You must have temporarily stopped thinking clearly there.  Seen as
sets, all permutations of a set's elements are the same!  The proper
output for a generator of permutations is a list; for generality, its
input should be any iterable (which includes sets).  If the input
contains duplicates, well, that's the caller's problem.

For combinations, sets are suitable as output, but again, I think it
would be just a suitable to take a list and generate lists -- after
all the lists are trivially turned into sets.

> While it is natural to speak and think of subsets of a set, or
> permutations, arrangements and combinations of a set, some people
> might prefer to stay closer to an underlying implementations with
> lists (sublists of a list, permutations, arrangements or
> combinations of a list), and would feel that going through sets is
> an unwelcome detour for their applications.  Indeed, what's the real
> use and justficiation for hashing keys and such things, when one
> wants nothing else than arrangements from a list?


> Another aspect worth some thinking is that permutations, in
> particular, are mathematical objects in themselves: we can notably
> multiply permutations or take the inverse of a permutation.

That would be a neat class indeed.  How useful it would be in practice
remains to be seen.  Do you do much ad-hoc permutation calculations?

> Arrangements are in fact permutations over combinations elements.
> Some thought is surely needed for properly reflecting mathematical
> elegance into how the set API is extended for the above, and not
> merely burying that elegance under practical concerns.

And, on the other hand, practicality beats purity.

> Some people may think that these are all problems which are
> orthogonal to the design of a basic set feature, and which should be
> addressed in separate Python modules.  On the other hand, I think
> that a better and nicer integration might result if all these things
> were thought together, and thought sooner than later.  Moreover, my
> intuition tells me that with some care and luck (both are needed),
> these extra set features could be small enough additions to the
> `sets' module to not be worth another one.  Besides, if appropriate,
> such facilities would surely add a lot of zest and punch into the
> interest raised by the `sets' module when it gets published.

I'd rather see the zest added to Python as a whole -- sets are a tiny
part, and if you read PEP 218, you'll see that the sets module is only
a modest first step of that PEP's program.

--Guido van Rossum (home page:

From  Mon Aug 26 01:08:03 2002
From: (Greg Ewing)
Date: Mon, 26 Aug 2002 12:08:03 +1200 (NZST)
Subject: [Python-Dev] Re: Automatic flex interface for Python?
In-Reply-To: <007b01c24a7a$a081c580$0900a8c0@spiff>
Message-ID: <>

> you can do that without even looking at the characters?

No, but the original complaint was that immutability
of strings made lexing difficult. I was pointing
out that it's possible to do it without mutating
anything per-character.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Mon Aug 26 01:37:22 2002
From: (Greg Ewing)
Date: Mon, 26 Aug 2002 12:37:22 +1200 (NZST)
Subject: [Python-Dev] Questions about
In-Reply-To: <>
Message-ID: <>

Guido van Rossum <>:

> But in order to be a good
> citizen in the world of binary operators, __or__ should not raise
> TypeError;
> if the other argument implements __ror__, union()
> will acquire this ability.

Another possible reason is so that if a subclass
overrides __or__, union() will get the new behaviour

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Mon Aug 26 01:58:10 2002
From: (Greg Ewing)
Date: Mon, 26 Aug 2002 12:58:10 +1200 (NZST)
Subject: [Python-Dev] Why no math.fac?
In-Reply-To: <20020824183056.GA1859@lilith.ghaering.test>
Message-ID: <>

Gerhard =?iso-8859-1?Q?H=E4ring?= <>:

> Any reason why there isn't any factorial function in the math
> module?

Probably because there isn't one in the C math library. :-)

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Mon Aug 26 08:57:30 2002
From: (Ka-Ping Yee)
Date: Mon, 26 Aug 2002 00:57:30 -0700 (PDT)
Subject: [Python-Dev] type categories
In-Reply-To: <>
Message-ID: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy>

On Fri, 23 Aug 2002, Guido van Rossum wrote:
> I haven't given up the hope that inheritance and interfaces could use
> the same mechanisms.  But Jim Fulton, based on years of experience in
> Zope, claims they really should be different.  I wish I understood why
> he thinks so.

If i may hazard a guess, i'd imagine that Jim's answer would simply be
that inheritance (of implementation) doesn't imply subtyping, and
subtyping doesn't imply inheritance.

That is, you may often want to re-use the implementation of one class
in another class, but this doesn't mean the new class will meet all of
the commitments of the old.  Conversely, you may often want to declare
that different classes adhere to the same set of commitments (i.e.
provide the same interface) even if they have different implementations.
(A common situation where the latter occurs is when the implementations
are written by different people.)

> Agreeing on an ontology seems the hardest part to me.

Indeed.  One of the advantages of separating inheritance and subtyping
is that this can give you a bit more flexibility in setting up the
ontology, which may make it easier to settle on something good.

-- ?!ng

From  Mon Aug 26 10:42:29 2002
From: (Jack Jansen)
Date: Mon, 26 Aug 2002 11:42:29 +0200
Subject: [Python-Dev] [development doc updates]
In-Reply-To: <>
Message-ID: <>

On Saturday, August 24, 2002, at 12:31 , Fred L. Drake, Jr. wrote:

> Jack Jansen writes:
>> how much work would it be to make at least the html tarfile
>> available too under
> Are you looking for the tarfile or for an online documentation set?

I'm looking for the tarfile. The documentation builder downloads it, 
the HTML files with some stuff and then feeds it through the Help 

The result can be searched and browsed by Apple Help Viewer (except for 
minor detail that we can't yet convince AHV that the Python documentation
actually exists:-)
>> I'm looking at making the documentation friendly to the Mac help
>> viewer (actually, Bill Fancher donated the code), and it would
>> help the build process is there was a fixed URL based on the
>> version number where I could always find the latest docs for the
>> current version.
> Is there online documentation for the Mac OS help viewer?  I don't
> know anything about it.

It's all a bit fragmented, but "providing user assistance with Apple 
Help" has
most of the highlevel info. "Apple Help Reference" has the API a program 
can use
to interface to the help manager (at least, some of it:-). On OSX these 
are online
if you've installed the developer tools. Otherwise you can find them on 
the Apple website too.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- Emma 
Goldman -

From  Mon Aug 26 11:31:53 2002
From: (Laszlo Asboth)
Date: Mon, 26 Aug 2002 11:31:53 +0100
Subject: [Python-Dev] Introduction of new project
Message-ID: <>

--=_mixed 00390300C1256C20_=
Content-Type: multipart/alternative; boundary="=_alternative 00390300C1256C20_="

--=_alternative 00390300C1256C20_=
Content-Type: text/plain; charset="us-ascii"

Dear all,

I would like to inform you that I started a new project. 
Please see attached file.

Many thanks and best regards,

Laszlo Asboth

--=_alternative 00390300C1256C20_=
Content-Type: text/html; charset="us-ascii"

<br><font size=2 face="sans-serif">Dear all,</font>
<br><font size=2 face="sans-serif">I would like to inform you that I started a new project. </font>
<br><font size=2 face="sans-serif">Please see attached file.</font>
<br><font size=2 face="sans-serif">Many thanks and best regards,<br>
Laszlo Asboth<br>
--=_alternative 00390300C1256C20_=--
--=_mixed 00390300C1256C20_=
Content-Type: text/plain; name="Takas Introduction.txt"
Content-Disposition: attachment; filename="Takas Introduction.txt"
Content-Transfer-Encoding: quoted-printable

"TAKAS" Project Introduction

The idea of this project was arisen already 10 years ago. At that time in H=
ungary mostly was used computers with Intel 386 processor and Novell 3.1 op=
eration system. Due to informatics development there was made a lot of new =
programs. I found a problem with them. For each task we used other standalo=
ne program, in most cases made by totally different supplier/programmer. Th=
e bottleneck of such kind of way of working is that each program needs to b=
e updated with the same data. This means waste of energy. Then came up an i=
dea, why we do not use only one system for all tasks, which can be done by =
computer within a company?
At my next workplace found the same situation. While this company was multi=
national company the decision about each important issue came from abroad. =
Here we used more different operating systems than before (Windows 3.1, Nov=
ell, AS/400).
In my existing workplace (started at 2nd of February 1998) the so called "d=
iffusion" had been grown, against that we use more ERP & CRM systems like B=
aan, SAP, etc. Fortunately between these systems a lot of interfaces were b=
uilt to share data (via automatic ftp jobs). After introducing the structur=
e of the informatics's system I found that we use these systems only partly=
. This shows for me that these systems can not fulfil fundamental (all) req=
uirements of a normal company.

After more than 4 yearly preparation and collection of experiences I decide=
d to start to build an own ERP & CRM system which applies the existing know=
ledge of mine and volunteer's who joins to this idea.

Mine main philosophy is that the computer should serve people and not oppos=
ite (as works nowadays).

Meaning the name of the project:
The name has no definite meaning. At my workplace I work together with a co=
lleague. When I discussed with him about this idea, he thought there is som=
ething in it too. We started to plan it. While we are already on way we nee=
d to call the project somehow. After some thinking and intuition the name w=
as found: TAKAS. This is coming from two slices: "TAK" and "AS". Both are c=
oming from our names (TAKacs, ASboth).

Purpose of the project:
The project has the following purposes:=20
* Build only one new - homogeneous - computer system, which helps operate t=
o maintain data of the company.
* Create a framework where possible to find and realise a solution on an ea=
sier, faster way.

* Robust, fast, easy to use
* Clasps all tasks of informatics of an arbitrary big organisation
* Decreases time of maladministration
* Increases the level of services within and between companies
* Much less bug possibilities (no any interfaces necessary)
* Each data needs to enter once and at the point of origin only (efficiency=
, more disengagedness)
* The user works in unitary surface (less learning time)
* Economical solution, while it can be modified easily

The project would like to reach: all data, which can be handled via compute=
rs, should be handled in this system only. That means integrate all tasks i=
n computer into this program (beside normal direction of the company the sy=
stem should handle the e-mails, fax, personal tasks, contacts, etc. too).

Philosophy of the project:
>From company's management point of view the purpose to use this system can =
be to build an effective, economical organisation with it. A system is good=
 when the user can faster and easier use it.

Other important issue is the cost of the programs. There are big difference=
s between people about. Most of the managers think that programs bought fro=
m software companies shares the responsibility of use. The consequence of t=
his kind of thinking is higher cost.
There are a lot of programs which costs very little or nothing (called shar=
eware, freeware or open source programs). My experience shows that program =
of both categories work fine.
At the beginning of the project should be decided which category this syste=
m will join. We already discussed about. It can be evident that experts can=
 be found with charge only. My opinion (was and) is now that it can be foun=
d good people who join to the project in behalf of creation and psychologic=
al appreciation only. On other hand I am sure that later on we will get bac=
k our efforts in material form too.

This means that this project will be on the open source category. Certainly=
 before the real implementation we have to find the concord how can the pro=
ject continue to exist and fluent grow.

This philosophy brought already big result according to many projects on th=
e Internet where a lot of experts are participated.

There is other point of view. The project has a purpose to help people to g=
row and develop themselves too. New ideas, proposals change the world into =
a better way as we go now. I think this issue will be much more concrete la=
ter on.

To be successful with this project it has to have very good background. Sho=
uld be defined which operating system, graphical interface, programming lan=
guage and database is the best for this project.

Operating system:
Beside Windows there are a lot of other operating system. Although there ar=
e a lot of good things in it, I think for us better to choose FreeBSD, beca=
use it is organised by a "core-team", works robustly, grows dynamically and=
 well documented.

Programming language:
In this area we can find a lot of them starting from the low level to high =
level languages. My opinion is that we need a high level programming langua=
ge for this project. After a long search on Internet I found Python program=
ming language. This language takes into account best the object-oriented ph=
ilosophy. The advantage of Python is that can be used as low level and high=
 level operation too. There are a lot of packages to it, which are very use=
ful to create arbitrary program. The possibility to use it on mostly used o=
perating systems is a very good aspect too.

Graphical interface:
We need a graphical interface too, while to my opinion a reliable system in=
 our world should work with it. There are more interfaces to Python too, I =
think for us the wxPython is the best choice.

Database manager:
This point is the most important for such a big system. There is 2 differen=
t type of database managers: relational and object-oriented. In the past I =
introduced relational databases only. After knowing Python I found Zope Obj=
ect DataBase (ZODB). The main advantage of it that it was written in Python=
. There is an "extension" for it Zope Enterprise Objects (ZEO) which provid=
es to connect to ZODB via network. My existing experience about ZEO that it=
 is not ready for handling more than 1 database in same time (I hope maybe =
this project gives the chance to go further in this way).

The above mentioned backgrounds are the fundamentals of the project. Certai=
nly each idea, which can help to make better solutions, is welcome. I know =
that there are a lot of experts who knows much more about these and other c=
omponents than I. This is the reason why I would like to ask anybody to joi=
n to the project.

Introdution of the actor:
I would like to introduce myself too. My name is L=E1szl=F3 Asb=F3th. I am =
38 years old. I have 2 children (my daughter D=F3ra 13, my son =C1d=E1m 7).=
 I live in Hungary, at 5 Kiss J. str, in S=E1rv=E1r. My professions are lan=
d surveyor, computer programmer and preventive parapsychologist.
My existing workplace is in Szombathely, called Phycomp Hungary Kft. I work=
 as a program engineer.
My first impression about computers started when the first IBM AT computers=
 came to Matav Rt (the biggest telecommunication's company of Hungary). I d=
id not know the command "dir" yet. I started to play with it, because it wa=
s very interesting. After some months I learned a lot about DOS, and starte=
d to make a program in Basic language. This was a "TOTO" program. Although =
we won some money on the second week we could not make a fortune. My boss s=
howed my interest for computers he assigned me to a new post. Then I starte=
d to learn programming, which I ended in 1993. I learned more computer lang=
uages (Clipper, Pascal, C). On my existing workplace in 1998 met with the U=
nix operating system and Baan ERP system. I learned to program in Baan as a=
Last year I started seriously search for components to the project.=20

In my spare time I am interested in self-development. I use and teach some =
methods in this area.

When you think this project has something for you, or you have interest to =
join, please contact me. When you can give some advises only they are welco=
me too.

Thank you very much for reading this introduction (excuse me for my english=

In S=E1rv=E1r, on 19th of august, 2002.

L=E1szl=F3 Asb=F3th

--=_mixed 00390300C1256C20_=--

From  Mon Aug 26 12:16:43 2002
From: (Michael Hudson)
Date: 26 Aug 2002 12:16:43 +0100
Subject: [Python-Dev] utf8 issue
In-Reply-To: Guido van Rossum's message of "Fri, 23 Aug 2002 17:05:27 -0400"
References: <>
Message-ID: <>

Guido van Rossum <> writes:

> This might beling on SF, except it's already been solved in Python
> 2.3, and I need guidance about what to do for Python 2.2.2.
> In 2.2.1, a lone surrogate encoded into utf8 gives an utf8 string that
> cannot be decode back.  In 2.3, this is fixed.  Should this be fixed
> in 2.2.2 as well?

I think this was discussed really quite a long time ago, like six
months or so.

> I'm asking because it caused problems with reading .pyc files: if
> there's a Unicode literal containing a lone surrogate, reading the
> .pyc file causes an exception:
> UnicodeError: UTF-8 decoding error: unexpected code byte
> It looks like revision 2.128 fixed this for 2.3, but that patch
> doesn't cleanly apply to the 2.2 maintenance branch.  Can someone
> help?

I think the reason this didn't get fixed in 2.2.1 is that it
necessitates bumping MAGIC.

I can probably dig up more references if you want.


34. The string is a stark data structure and everywhere it is
    passed there is much duplication of process.  It is a perfect
    vehicle for hiding information.
  -- Alan Perlis,

From  Mon Aug 26 13:24:49 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 26 Aug 2002 08:24:49 -0400
Subject: [Python-Dev] Re: heapq method names
In-Reply-To: <>
References: <>
Message-ID: <>

[Guido van Rossum]

> > May I invite you to reconsider?  We are going to live with that one for
> > a loong time, you know...

> I know.  I have read and re-read your arguments, but I see nothing to
> change my mind.  Somehow the short names you suggest just seem wrong
> to me.  We can agree to disagree, but I feel strongly that the names
> should not be changed.

As you did read my arguments (from your first reply, I thought you missed
them, and this is why I tried explaining them better), and that I have
nothing substantially new to offer, that closes this discussion...  We are
solidly set on the current documentation.  To be or not to be, ... etc. :-)

                                Have a good day, and keep happy!

François Pinard

From  Mon Aug 26 13:32:02 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 26 Aug 2002 08:32:02 -0400
Subject: [Python-Dev] Re: heapq method names
In-Reply-To: <>
References: <>
Message-ID: <>

[Tim Peters]

> Note that it's common to use the bisect module in the

>     from bisect import bisect_right, bisect, insort

> way too, rather than spell out bisect.bisect (etc) each time.  That's "the
> other" module that (conceptually) adds new methods to lists.

Wow!  You just put the finger on it...  I wondered a few times why this
module never attracted me! :-) :-)

> If you want simpler names, I'm finding this little module quite pleasant
> to use: [...]  That is, it creates a Heap type that's just a list with
> some extra methods.

Very elegant indeed.  Something like this was discussed earlier, but faded
out of my memory.  Thanks for the tip, Tim!

> Note that the "pop" method can't be named "pop"!  If you try, you'll soon
> get unbounded recursion because the heapq functions need list.pop to access
> the list meaning of "pop".

Sold!  `popmin' is adequate and clear.

François Pinard

From  Mon Aug 26 14:08:05 2002
From: (Jeremy Hylton)
Date: Mon, 26 Aug 2002 09:08:05 -0400
Subject: [Python-Dev] Re: heapq method names
In-Reply-To: <>
References: <>
Message-ID: <>

>>>>> "TP" == Tim Peters <> writes:

  TP> That is, it creates a Heap type that's just a list with some
  TP> extra methods.  Note that the "pop" method can't be named "pop"!
  TP> If you try, you'll soon get unbounded recursion because the
  TP> heapq functions need list.pop to access the list meaning of
  TP> "pop".

  TP> Guido suggested a long time ago that such a class could be added
  TP> to heapq, and I like it a lot in real life.

You can't use all of the regular list methods, right?  If I'd called
append() an a Heap(), it wouldn't maintain the heap invariant.  I
would think the same is true of insert() and lots of other methods.
If we add a Heap class, which seems quite handy, maybe we should
disable methods that don't work.

Interesting to note that if you disable the invalid methods of list,
then you've got a subclass of list that is not a subtype.


From  Mon Aug 26 15:05:20 2002
From: (Guido van Rossum)
Date: Mon, 26 Aug 2002 10:05:20 -0400
Subject: [Python-Dev] utf8 issue
In-Reply-To: Your message of "Mon, 26 Aug 2002 12:16:43 BST."
References: <>
Message-ID: <>

> Guido van Rossum <> writes:
> > This might beling on SF, except it's already been solved in Python
> > 2.3, and I need guidance about what to do for Python 2.2.2.
> > 
> > In 2.2.1, a lone surrogate encoded into utf8 gives an utf8 string that
> > cannot be decode back.  In 2.3, this is fixed.  Should this be fixed
> > in 2.2.2 as well?
> I think this was discussed really quite a long time ago, like six
> months or so.
> > I'm asking because it caused problems with reading .pyc files: if
> > there's a Unicode literal containing a lone surrogate, reading the
> > .pyc file causes an exception:
> > 
> > UnicodeError: UTF-8 decoding error: unexpected code byte
> > 
> > It looks like revision 2.128 fixed this for 2.3, but that patch
> > doesn't cleanly apply to the 2.2 maintenance branch.  Can someone
> > help?
> I think the reason this didn't get fixed in 2.2.1 is that it
> necessitates bumping MAGIC.
> I can probably dig up more references if you want.

Please do.  Bumping MAGIC is a no-no between dot releases.  But I
don't understand why that is necessary?

--Guido van Rossum (home page:

From  Mon Aug 26 15:09:45 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 26 Aug 2002 10:09:45 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <>
References: <>
Message-ID: <>

[Guido van Rossum]

> Since the user can easily multiply the length of the input sets together,
> what's the importance of the __len__?  And what's the use case for
> randomly accessing the members of a cartesian product?

For cartesian product, permutations, combinations, arrangements and power
sets, I do not see real use to __len__ or random accessing of members
besides explicit looping (or maybe saving an indice for later resumption).
In my own case, I guess iterators (or generators) would leave me happy enough
that I do not really need more.

> IMO, the Cartesian product is mostly useful for abstract matehematical
> though, not for solving actual programming problems.

One practical application pops to mind.  People might progressively use
and abuse the paradigm of looping over the members of a cartesian product,
instead of relying on nests of embedded loops over each of the set members.

> > So let's say:
> >    Set([1, 2, 3]).permutations()
> > would lazily produce the equivalent of:
> >    Set([(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)])
> > 
> > That generator could offer a `__len__' function predicting the
> > cardinality CCC of the result, and some trickery could be used to map
> > integers from 0 to CCC into various permutations.  Once on the road,
> > it would be worth offering combinations and arrangements just as well,
> > and maybe also offering the "power" set, I mean here the set of all
> > subsets of a given set, all with predictable cardinality and constant
> > speed access to any member by index.

> Obviously you were inspired here by Eric Raymond's implementation of
> the powerset generator...

I do not think so.  I was inspired by the remembering I have from the SSP
package (something like Social Science Package - a FORTRAN library - this
was before SPSS).  It was especially clever about enumerating permutations.

> But again I ask, what's the practical use of random access to permutations?

The only one I see is for resuming an interrupted enumeration.

> > Using tuples to represent each element of a cartesian product of many
> > sets is pretty natural, but it is slightly less clear that a tuple
> > is the best fit for representing an ordered set as in permutations
> > and arrangements, as tuples may allow elements to be repeated, while
> > an ordered set does not.  I think that sets are to be preferred over
> > tuples for returning combinations or subsets.

> You must have temporarily stopped thinking clearly there.

Maybe not, but I do not always express myself clearly.  Sorry.

> Seen as sets, all permutations of a set's elements are the same!

It's no use seeing each individual permutation as a set, of course.  But the
_set_ of all permutations, each of which being a tuple, is meaningful,
at least from the fact that permutations are conceptually unordered.
However, it might be (sometimes, maybe not that often) useful to enumerate
permutations in some canonical order.

> For combinations, sets are suitable as output, but again, I think it
> would be just a suitable to take a list and generate lists -- after
> all the lists are trivially turned into sets.

Quite agreed.

> > Another aspect worth some thinking is that permutations, in
> > particular, are mathematical objects in themselves: we can notably
> > multiply permutations or take the inverse of a permutation.

> That would be a neat class indeed.  How useful it would be in practice
> remains to be seen.  Do you do much ad-hoc permutation calculations?

A few common algorithms also make use of permutations without naming them.
`sort' applies a precise permutation, and it is often convenient to inverse
that permutation to recover the original order after having enriched the
resulting structure, say.  I quite often resorted to the above trick.
Notice that `string.translate' applies a permutation.

In another life, I wrote a program (unknown outside CDC Cyber space) that
was efficiently comparing possibly big files, and I remember I had to work
a lot with permutation arithmetic.  This was a bit specialised, however.

I once wrote a C application named `recode' that mainly does charset
conversions.  It does some arithmetic on permutations at a few places,
either for optimisation or while seeking reversibility, and this is really
nothing far stretched or un-natural.  I sometimes plan to parallel `recode'
with a Python implementation, because from experience, prototyping ideas in
C is rather painful, while I foresee it would be far lot easier in Python.
Undoubtedly then, I would formalise permutations.

> > Some thought is surely needed for properly reflecting mathematical
> > elegance into how the set API is extended for the above, and not
> > merely burying that elegance under practical concerns.

> And, on the other hand, practicality beats purity.

Only when both conflict without any hope of resolution.  But when
practicality and purity can coexist, that's even better.  So much better
in fact, that it's always worth very seriously trying to seek coexistence.

François Pinard

From  Mon Aug 26 15:20:35 2002
From: (Armin Rigo)
Date: Mon, 26 Aug 2002 16:20:35 +0200 (CEST)
Subject: [Python-Dev] SET_LINENO removal bugs
Message-ID: <>

Hello everybody,

A few core bugs with the line tracing in the new SET_LINENO-free world:


From  Mon Aug 26 15:27:53 2002
From: (Tim Peters)
Date: Mon, 26 Aug 2002 10:27:53 -0400
Subject: [Python-Dev] Re: heapq method names
In-Reply-To: <>
Message-ID: <>

[Jeremy Hylton]
> You can't use all of the regular list methods, right?  If I'd called
> append() an a Heap(), it wouldn't maintain the heap invariant.  I
> would think the same is true of insert() and lots of other methods.

Well, you *can* use all list methods, just as you can already call
heapq.heappop and pass any old list.  heapify exists so that you can repair
the heap invariant if you have reason to believe you may have broken it.

> If we add a Heap class, which seems quite handy, maybe we should
> disable methods that don't work.

If it were called PriorityQueue, definitely, but "a heap" is a specific
implementation and heap users can exploit the list representation in lots of
interesting ways.

> Interesting to note that if you disable the invalid methods of list,
> then you've got a subclass of list that is not a subtype.

It is an interesting case this way!  There's a related case that's perhaps
of more pressing interest:  Raymond Hettinger has pointed out that, e.g.,

    3 in Set

is much slower than

    3 in dict

This is because Set.__contains__ is a Python-level call.  I sped up almost
all the binary set operations yesterday by factors of 2 to 5, mostly via
just using the underlying dict.__contains__ under the covers instead of
appealing to Set.__contains__.

For "simple sets" (sets that don't magically try to convert mutable
objects-- like sets --into immutable objects), the speed of

    3 in Set

could be restored by subclassing from dict and inheriting its __contains__.
But Set is trying not to make any promises about representation, so this is
a clearer case for some form of "(only) implementation inheritance".  The
desire is driven only by speed, but that can be a legit concern too (I'm not
sure it's a killer concern in this particular case).

From  Mon Aug 26 15:45:28 2002
From: (Guido van Rossum)
Date: Mon, 26 Aug 2002 10:45:28 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Mon, 26 Aug 2002 00:57:30 PDT."
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy>
Message-ID: <>

> > I haven't given up the hope that inheritance and interfaces could use
> > the same mechanisms.  But Jim Fulton, based on years of experience in
> > Zope, claims they really should be different.  I wish I understood why
> > he thinks so.

> If i may hazard a guess, i'd imagine that Jim's answer would simply be
> that inheritance (of implementation) doesn't imply subtyping, and
> subtyping doesn't imply inheritance.

Well, yes, of course.  But I strongly believe that in *most* cases,
inheritance and subtyping go hand in hand.  I'd rather invent a
mechanism to deal with the exceptions rather than invent two parallel
mechanisms that must both be deployed separately to get the full
benefit out of them.

> That is, you may often want to re-use the implementation of one class
> in another class, but this doesn't mean the new class will meet all of
> the commitments of the old.  Conversely, you may often want to declare
> that different classes adhere to the same set of commitments (i.e.
> provide the same interface) even if they have different implementations.
> (A common situation where the latter occurs is when the implementations
> are written by different people.)

Nevertheless, these are exceptions to the general rule.

> > Agreeing on an ontology seems the hardest part to me.
> Indeed.  One of the advantages of separating inheritance and subtyping
> is that this can give you a bit more flexibility in setting up the
> ontology, which may make it easier to settle on something good.

Really?  Given that there are no inheritance relationships between the
existing built-in types, I would think that you could define an
ontology consisting entirely of abstract types, and then graft the
concrete types on it.  I don't see what having separate interfaces
would buy you.  But perhaps you can give an example that shows your

--Guido van Rossum (home page:

From  Mon Aug 26 15:55:15 2002
From: (Raymond Hettinger)
Date: Mon, 26 Aug 2002 10:55:15 -0400
Subject: [Python-Dev] Re: heapq method names
References: <>
Message-ID: <002901c24d10$97144a20$0d61accf@othello>

From: "Tim Peters" <>
> If you want simpler names, I'm finding this little module quite pleasant to
> use:
> """
> import heapq
> class Heap(list):
>     def __init__(self, iterable=[]):
>         self.extend(iterable)
>     push    = heapq.heappush
>     popmin  = heapq.heappop
>     replace = heapq.heapreplace
>     heapify = heapq.heapify
> """

And perhaps:
   def __iter__(self):
       while True:
            yield self.popmin()

Raymond Hettinger

From Samuele Pedroni" <  Mon Aug 26 16:14:11 2002
From: Samuele Pedroni" < (Samuele Pedroni)
Date: Mon, 26 Aug 2002 17:14:11 +0200
Subject: [Python-Dev] type categories
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy>  <>
Message-ID: <016b01c24d13$3bc6d900$6d94fea9@newmexico>

> [Ping]
> > If i may hazard a guess, i'd imagine that Jim's answer would simply be
> > that inheritance (of implementation) doesn't imply subtyping, and
> > subtyping doesn't imply inheritance.
> Well, yes, of course.  But I strongly believe that in *most* cases,
> inheritance and subtyping go hand in hand.  I'd rather invent a
> mechanism to deal with the exceptions rather than invent two parallel
> mechanisms that must both be deployed separately to get the full
> benefit out of them.

One exception being to able to declare conformance to an interface
after-the-fact in some sweet way.

> > > Agreeing on an ontology seems the hardest part to me.
> >
> > Indeed.  One of the advantages of separating inheritance and subtyping
> > is that this can give you a bit more flexibility in setting up the
> > ontology, which may make it easier to settle on something good.
> Really?  Given that there are no inheritance relationships between the
> existing built-in types, I would think that you could define an
> ontology consisting entirely of abstract types, and then graft the
> concrete types on it.  I don't see what having separate interfaces
> would buy you.  But perhaps you can give an example that shows your
> point?

my ideas of declaring partial conformance and of super-interfaces identified as
a base-interface plus a subset of signatures do not fit so well in a
just-abstract-classes model. But OTOH I insist, IMO, given how python code is
written now, they would be handy although complex.


From  Mon Aug 26 16:29:55 2002
From: (Barry A. Warsaw)
Date: Mon, 26 Aug 2002 11:29:55 -0400
Subject: [Python-Dev] type categories
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy>
Message-ID: <>

>>>>> "SP" == Samuele Pedroni <> writes:

    SP> One exception being to able to declare conformance to an
    SP> interface after-the-fact in some sweet way.

This is a very important use case, IMO.

I'm leary of trying to weave some interface taxonomy into the standard
library and types without having a lot of experience in using this for
real world applications.  Even then, it's possible <wink> that there
will be a lot of disagreement on the shape of the type hierarchy.

So one strategy would be to not classify the existing types and
classes ahead of time, but to provide a way for an application to
declare conformance to existing types in a way that makes sense for
the application (or library).  The downside of this is that it may
lead to a raft of incompatible interface declarations, but I also
think that eventually we'd see convergence as we gain more experience.

My guess would be that of all the interfaces that get defined and used
in the Python community, on a few that are commonly agreed on or
become ubiquitous idioms will migrate into the core.  I don't think we
need to solve this "problem" for the core types right away.  Let's
start by providing mechanism and not policy.


From  Mon Aug 26 16:31:34 2002
From: (Guido van Rossum)
Date: Mon, 26 Aug 2002 11:31:34 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Mon, 26 Aug 2002 17:14:11 +0200."
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <>
Message-ID: <>

> [GvR]
> > [Ping]
> > > If i may hazard a guess, i'd imagine that Jim's answer would simply be
> > > that inheritance (of implementation) doesn't imply subtyping, and
> > > subtyping doesn't imply inheritance.
> >
> > Well, yes, of course.  But I strongly believe that in *most* cases,
> > inheritance and subtyping go hand in hand.  I'd rather invent a
> > mechanism to deal with the exceptions rather than invent two parallel
> > mechanisms that must both be deployed separately to get the full
> > benefit out of them.

> One exception being to able to declare conformance to an interface
> after-the-fact in some sweet way.

I've heard of people who add mix-in base classes after the fact by
using assignment to __bases__.  (This is currently not supported by
new-style classes, but it's on my list of things to fix.)

If that's not acceptable (it certainly looks questionable to me :-), I
guess a separate registry may have to be created; ditto for deviations
in the other direction (implementation inheritance without interface

> E.g.
> my ideas of declaring partial conformance and of super-interfaces
> identified as a base-interface plus a subset of signatures do not
> fit so well in a just-abstract-classes model. But OTOH I insist,
> IMO, given how python code is written now, they would be handy
> although complex.

Yes, I'll have to think about that idea some more.  It's appealing
because it matches current Pythonic practice better than anything

OTOH I want a solution that can be verified at compile time.

--Guido van Rossum (home page:

From  Mon Aug 26 16:41:11 2002
From: (Guido van Rossum)
Date: Mon, 26 Aug 2002 11:41:11 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Mon, 26 Aug 2002 11:29:55 EDT."
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <016b01c24d13$3bc6d900$6d94fea9@newmexico>
Message-ID: <>

> I'm leary of trying to weave some interface taxonomy into the standard
> library and types without having a lot of experience in using this for
> real world applications.  Even then, it's possible <wink> that there
> will be a lot of disagreement on the shape of the type hierarchy.

That's what I said when I predicted it would be hard to come up with
an ontology. :-)

> So one strategy would be to not classify the existing types and
> classes ahead of time, but to provide a way for an application to
> declare conformance to existing types in a way that makes sense for
> the application (or library).  The downside of this is that it may
> lead to a raft of incompatible interface declarations, but I also
> think that eventually we'd see convergence as we gain more experience.

This is what Zope does.  One problem is that it has its own notion of
what makes a "sequence", a "mapping", etc. that don't always match the
Pythonic convention.

> My guess would be that of all the interfaces that get defined and used
> in the Python community, on a few that are commonly agreed on or
> become ubiquitous idioms will migrate into the core.  I don't think we
> need to solve this "problem" for the core types right away.  Let's
> start by providing mechanism and not policy.

Sure.  But does that mean the mechanism needs to be necessarily
separate from the inheritance mechanism?

--Guido van Rossum (home page:

From  Mon Aug 26 16:42:25 2002
From: (Amici Alessandro)
Date: Mon, 26 Aug 2002 17:42:25 +0200
Subject: [Python-Dev] Large file support for the mmap module?
Message-ID: <A183DF60AC72D5119B990002A5749CB302604A14@ROMADG-MAIL01>


while looking for efficient ways to manipulate large files (>2Gb) with
python i noted an artificial limitation in the mmap module present in the
standard library. right now mmap objects behave like an hybrid between a
file and a string, but their size is limited to 2Gb files on 32bit
architectures (the offset argument in the mmap call is always set to 0 and
several members of the structure have type size_t).

adding a rough implementation for 64bit offset in the mmap call is trivial
(i have done it, cutting and pasting from fileobject.c), but it is not
obvious how the file-like soul of the mmap object should be affected by the
offset. actually, it is not clear to me why the file-like behavior is
present at all.

is there any plan to add LFS to the mmap module?
are there known workaround?


From  Mon Aug 26 16:42:33 2002
From: (Michael McLay)
Date: Mon, 26 Aug 2002 11:42:33 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <>
Message-ID: <>

On Monday 26 August 2002 10:45 am, Guido van Rossum wrote:

> > Indeed.  One of the advantages of separating inheritance and subtyping
> > is that this can give you a bit more flexibility in setting up the
> > ontology, which may make it easier to settle on something good.
> Really?  Given that there are no inheritance relationships between the
> existing built-in types, I would think that you could define an
> ontology consisting entirely of abstract types, and then graft the
> concrete types on it.  I don't see what having separate interfaces
> would buy you.  But perhaps you can give an example that shows your
> point?

Several posts have expressed a need to support multiple ontologies for a given 
set of classes. This doesn't preclude using the ontology that is defined by 
the class hierarchy as one method for defining an ontology. That could be the 
default ontology. What is missing is the ability to also place a class in to 
an alternate ontology that may be specific to an application.  The problem 
could be solved if applications had the ability to add and delete references 
to the type interface definition that apply to a class.


Perhaps the interface definitions should also be able to add themselves to 
class definitions. That way common interface patterns that apply to standard 
libraries could be defined in the standard library. This would eliminate the 
repeated addition of interfaces to classes in each application.

Interface I:


Removing an interface from a class might not be possible, or it may require a 
second class implementation to be created at compile time, because usage of 
that interface may be required in some other module. I suspect having two 
implementations of the same class might be somewhat confusing to the user. 
Perhaps removal of a required interface would trigger an exception.

From  Mon Aug 26 16:50:36 2002
From: (Barry A. Warsaw)
Date: Mon, 26 Aug 2002 11:50:36 -0400
Subject: [Python-Dev] type categories
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy>
Message-ID: <>

>>>>> "GvR" == Guido van Rossum <> writes:

    >> I'm leary of trying to weave some interface taxonomy into the
    >> standard library and types without having a lot of experience
    >> in using this for real world applications.  Even then, it's
    >> possible <wink> that there will be a lot of disagreement on the
    >> shape of the type hierarchy.

    GvR> That's what I said when I predicted it would be hard to come
    GvR> up with an ontology. :-)

Actually, it'll be easy, that's why we'll do it a hundred times. :)

    >> So one strategy would be to not classify the existing types and
    >> classes ahead of time, but to provide a way for an application
    >> to declare conformance to existing types in a way that makes
    >> sense for the application (or library).  The downside of this
    >> is that it may lead to a raft of incompatible interface
    >> declarations, but I also think that eventually we'd see
    >> convergence as we gain more experience.

    GvR> This is what Zope does.  One problem is that it has its own
    GvR> notion of what makes a "sequence", a "mapping", etc. that
    GvR> don't always match the Pythonic convention.

Yep, that's a problem.  One approach might be to provide some blessed
or common interfaces in a module, but don't weave them into the
types.  OTOH, I suspect that big apps and frameworks like Zope may
want their own notion anyway, and hopefully it'll be fairly easy for
components that want to play with Zope to add the proper interface
conformance assertions.

    >> My guess would be that of all the interfaces that get defined
    >> and used in the Python community, on a few that are commonly
    >> agreed on or become ubiquitous idioms will migrate into the
    >> core.  I don't think we need to solve this "problem" for the
    >> core types right away.  Let's start by providing mechanism and
    >> not policy.

    GvR> Sure.  But does that mean the mechanism needs to be
    GvR> necessarily separate from the inheritance mechanism?

It definitely means that there has to be a way to separate them that
is largely transparent to all the code that checks, uses, asserts,
etc. interfaces.  IOW, if we allow all of inheritance, __implements__,
and a registry to assert conformance to an interface, the built-in
conformsto() -- or whatever we call it -- has to know about all these
accepted variants and should return True for any match.


From  Mon Aug 26 16:57:31 2002
From: (Andrew Koenig)
Date: 26 Aug 2002 11:57:31 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy>
Message-ID: <>

Guido> Well, yes, of course.  But I strongly believe that in *most* cases,
Guido> inheritance and subtyping go hand in hand.  I'd rather invent a
Guido> mechanism to deal with the exceptions rather than invent two parallel
Guido> mechanisms that must both be deployed separately to get the full
Guido> benefit out of them.

I think there are (at least) three important kinds of problems, not just two.

        1) You want to define a new class that inherit from a class
           that already exists, and you intend your class to be in all
           of the categories of its base class(es).  This case is
           probably the most common, so whatever mechanism we adopt
           should make it easy to write.

        2) You want to define a new class that inherits from a class
           that already exists, but you do *not* want your class to be
           in all of the categories of its base classes.  Perhaps you
           are inheriting for implementation purposes only.  I think
           we all agree that this case is relatively rare--but it is
           not completely unheard of, so there should be a way of
           coping with it.

        3) You want to define a new type category, and assert that
           some classes that already exist are members of this category.
           You do not want to modify the definitions of these classes
           in order to do so.  Defining new classes that inherit from
           the existing ones does not solve the problem, because you
           would then have to change code all over the place to make
           it use the new classes.

Here is an example of (3).  I might want to define a TotallyOrdered
category, together with a sort function that accepts only a container
with elements that are TotallyOrdered.  For example:

        def sort(x):
                if __debug__:
                    for i in x:
                        assert i in TotallyOrdered      # or whatever
                # continue with the sort here.

If someone uses my sort function to sort a container of objects that are
not of built-in types, I don't mind imposing the requirement on the
user to affirm that those types do indeed meet the TotallyOrdered
requirement.  What I do not want to do, however, is require that the
person making the claim is also the author of the class about which
the claim is being made.

Incidentally, it just occurred to me that if we regard categories
as claims about types (or, if you like, predicate functions with type
arguments), then it makes sense to include (Cartesian) product types.

What I mean by this is that the TotallyOrdered category is really a
category of pairs of types.  Note that the comparison operators generally
work on arguments of different type, so to make a really appropriate
claim about total ordering, I really need a way to say not just that
a single type (such as int) is totally ordered, but that all combinations
of types from a particular set are totally ordered (or not -- recall that
on most implementations it is possible to find three numbers x, y, z
such that total ordering fails, as long as you mix int and float
with sufficiently evil intent).

Andrew Koenig,,

From Samuele Pedroni" <  Mon Aug 26 16:59:59 2002
From: Samuele Pedroni" < (Samuele Pedroni)
Date: Mon, 26 Aug 2002 17:59:59 +0200
Subject: [Python-Dev] type categories
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <>             <016b01c24d13$3bc6d900$6d94fea9@newmexico>  <>
Message-ID: <01e301c24d19$a1aa2a00$6d94fea9@newmexico>

> [me]
> > E.g.
> > my ideas of declaring partial conformance and of super-interfaces
> > identified as a base-interface plus a subset of signatures do not
> > fit so well in a just-abstract-classes model. But OTOH I insist,
> > IMO, given how python code is written now, they would be handy
> > although complex.
> Yes, I'll have to think about that idea some more.  It's appealing
> because it matches current Pythonic practice better than anything
> else.

Thanks, I was under the impression nobody cared. In the end you could discard
the notion, the semantics are maybe too complex, but I think it is really worth
some thinking.

> OTOH I want a solution that can be verified at compile time.

Here I don't get what you are referring to. I have indicated some possible
sloppy interpretations but just in order to care for transitioning code. But
under the precise interpretation they are checkable
(maybe it is costly and complex to do so and that's your point?):

class Source:
  def read(self):

 # other methods

e.g. could declare to implement partially FileLike (that means
the matching subset of signatures), or be very precise
and declare that it implements   FileLike{read}

and FileLike{read} given FileLike has a very precise
interpretation even at compile-time.


From  Mon Aug 26 17:05:26 2002
From: (Guido van Rossum)
Date: Mon, 26 Aug 2002 12:05:26 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Mon, 26 Aug 2002 17:59:59 +0200."
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <016b01c24d13$3bc6d900$6d94fea9@newmexico> <>
Message-ID: <>

> > OTOH I want a solution that can be verified at compile time.
> Here I don't get what you are referring to.

Not specifically to your proposal.

> I have indicated some possible
> sloppy interpretations but just in order to care for transitioning code. But
> under the precise interpretation they are checkable
> (maybe it is costly and complex to do so and that's your point?):
> class Source:
>   def read(self):
>    ...
>  # other methods
> e.g. could declare to implement partially FileLike (that means
> the matching subset of signatures), or be very precise
> and declare that it implements   FileLike{read}
> and FileLike{read} given FileLike has a very precise
> interpretation even at compile-time.

That's great.

--Guido van Rossum (home page:

From  Mon Aug 26 17:47:35 2002
From: (Tim Peters)
Date: Mon, 26 Aug 2002 12:47:35 -0400
Subject: [Python-Dev] Re: heapq method names
In-Reply-To: <002901c24d10$97144a20$0d61accf@othello>
Message-ID: <>

[Raymond Hettinger, on
 import heapq

 class Heap(list):
     def __init__(self, iterable=[]):
     push    = heapq.heappush
     popmin  = heapq.heappop
     replace = heapq.heapreplace
     heapify = heapq.heapify

> And perhaps:
>    def __iter__(self):
>        while True:
>             yield self.popmin()

If we were trying to hide the list nature, yes, but I don't think so if the
intent is that a heapq heap *is* a list with a few extra methods.  For
example, I know I've already done


at least once for a Heap of this type, with the list meaning in mind, and it
would have been at best irritating if that had emptied the heap as a side
effect.  Renaming __iter__ to heapiter would be cool, though!

From  Mon Aug 26 19:40:52 2002
From: (Oren Tirosh)
Date: Mon, 26 Aug 2002 14:40:52 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <016b01c24d13$3bc6d900$6d94fea9@newmexico> <>
Message-ID: <>

On Mon, Aug 26, 2002 at 11:29:55AM -0400, Barry A. Warsaw wrote:
> I'm leary of trying to weave some interface taxonomy into the standard
> library and types without having a lot of experience in using this for
> real world applications.  Even then, it's possible <wink> that there
> will be a lot of disagreement on the shape of the type hierarchy.
> So one strategy would be to not classify the existing types and
> classes ahead of time, but to provide a way for an application to
> declare conformance to existing types in a way that makes sense for
> the application (or library).  The downside of this is that it may
> lead to a raft of incompatible interface declarations, but I also
> think that eventually we'd see convergence as we gain more experience.
> My guess would be that of all the interfaces that get defined and used
> in the Python community, on a few that are commonly agreed on or
> become ubiquitous idioms will migrate into the core.  I don't think we
> need to solve this "problem" for the core types right away.  Let's
> start by providing mechanism and not policy.

+1 for a non-creationist approach to type categories.


From  Mon Aug 26 19:45:22 2002
From: (=?ISO-8859-15?Q?Walter_D=F6rwald?=)
Date: Mon, 26 Aug 2002 20:45:22 +0200
Subject: [Python-Dev] To commit or not to commit
Message-ID: <>

I'm ready to commit the PEP293 implementation. I've added
LaTeX documentation in libcodecs.tex and libexcs.tex.

There are only a few minor open issues (reflecting exception
attribute modifications in args, PyString_DecodeEscape), but
I guess we'll fix/document those in time.

Any objections against committing the patch?

    Walter Dörwald

From  Mon Aug 26 19:47:18 2002
From: (Guido van Rossum)
Date: Mon, 26 Aug 2002 14:47:18 -0400
Subject: [Python-Dev] To commit or not to commit
In-Reply-To: Your message of "Mon, 26 Aug 2002 20:45:22 +0200."
References: <>
Message-ID: <>

> I'm ready to commit the PEP293 implementation. I've added
> LaTeX documentation in libcodecs.tex and libexcs.tex.
> There are only a few minor open issues (reflecting exception
> attribute modifications in args, PyString_DecodeEscape), but
> I guess we'll fix/document those in time.
> Any objections against committing the patch?

What do MvL and MAL say?

--Guido van Rossum (home page:

From  Mon Aug 26 20:03:22 2002
From: (Oren Tirosh)
Date: Mon, 26 Aug 2002 15:03:22 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <>
Message-ID: <>

On Mon, Aug 26, 2002 at 11:57:31AM -0400, Andrew Koenig wrote:
> Incidentally, it just occurred to me that if we regard categories
> as claims about types (or, if you like, predicate functions with type
> arguments), then it makes sense to include (Cartesian) product types.

Would such as product type be anything more than than a predicate about 
tuples?  Something like the (T1, T2, T3) case in Guido's static typing
presentation[1] where T1, T2 and T3 are type predicates rather than just 

[1] )


From  Mon Aug 26 20:08:26 2002
From: (=?ISO-8859-15?Q?Walter_D=F6rwald?=)
Date: Mon, 26 Aug 2002 21:08:26 +0200
Subject: [Python-Dev] To commit or not to commit
References: <> <>
Message-ID: <>

Guido van Rossum wrote:

>>I'm ready to commit the PEP293 implementation. I've added
>>LaTeX documentation in libcodecs.tex and libexcs.tex.
>>There are only a few minor open issues (reflecting exception
>>attribute modifications in args, PyString_DecodeEscape), but
>>I guess we'll fix/document those in time.
>>Any objections against committing the patch?
> What do MvL and MAL say?

AFAIK they're both on vacation.

    Walter Dörwald

From  Mon Aug 26 20:05:47 2002
From: (Guido van Rossum)
Date: Mon, 26 Aug 2002 15:05:47 -0400
Subject: [Python-Dev] To commit or not to commit
In-Reply-To: Your message of "Mon, 26 Aug 2002 21:08:26 +0200."
References: <> <>
Message-ID: <>

> > What do MvL and MAL say?
> AFAIK they're both on vacation.

Then wait until they're back.

--Guido van Rossum (home page:

From  Mon Aug 26 20:10:59 2002
From: (Andrew Koenig)
Date: Mon, 26 Aug 2002 15:10:59 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <> (message from Oren Tirosh on
 Mon, 26 Aug 2002 15:03:22 -0400)
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <> <>
Message-ID: <>

Oren> On Mon, Aug 26, 2002 at 11:57:31AM -0400, Andrew Koenig wrote:
>> Incidentally, it just occurred to me that if we regard categories
>> as claims about types (or, if you like, predicate functions with type
>> arguments), then it makes sense to include (Cartesian) product types.

Oren> Would such as product type be anything more than than a
Oren> predicate about tuples?

No, I don't think it would.

Indeed, ML completely unifies Cartesian product types and tuples in a
very, very cool way:

      Every function takes exactly one argument
      and yields exactly one result.

      However, the argument or result can be a tuple.

So in ML, when I write


that really means to bundle x and y into a tuple, and call f with that
tuple as its argument.  So, for example, if I write

      val xy = (x,y)

which defines a variable named xy and binds it to the tuple (x,y), then

      f xy

means exactly the same thing as


The parentheses are really tuple constructors, and ML doesn't require
parentheses for function calls at all.

However, if you're going to define predicates over tuples of (Python)
types, then you had better not try to define those predicates as part
of the tuples' class definitions, because they don't have one.

From  Mon Aug 26 20:26:34 2002
From: (Guido van Rossum)
Date: Mon, 26 Aug 2002 15:26:34 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Mon, 26 Aug 2002 15:10:59 EDT."
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <> <>
Message-ID: <>

> Indeed, ML completely unifies Cartesian product types and tuples in a
> very, very cool way:
>       Every function takes exactly one argument
>       and yields exactly one result.
>       However, the argument or result can be a tuple.
> So in ML, when I write
>       f(x,y)
> that really means to bundle x and y into a tuple, and call f with that
> tuple as its argument.  So, for example, if I write
>       val xy = (x,y)
> which defines a variable named xy and binds it to the tuple (x,y), then
>       f xy
> means exactly the same thing as
>       f(x,y)
> The parentheses are really tuple constructors, and ML doesn't require
> parentheses for function calls at all.

ABC did this, and very early Python did this, too (but Python always
required parentheses for calls).  However, adding optional arguments
caused trouble: after

  def f(a, b=1):
      print a*b

  t = (1, 2)

what should


mean?  It could mean either f((1, 2), 1) or f(1, 2).  So we had to get
rid of that.  I suppose ML doesn't have optional arguments (in the
sense of Python), so the problem doesn't occur there; that's why it
wasn't a problem in ABC.

--Guido van Rossum (home page:

From  Mon Aug 26 20:36:56 2002
From: (Oren Tirosh)
Date: Mon, 26 Aug 2002 15:36:56 -0400
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <> <> <>
Message-ID: <>

On Mon, Aug 26, 2002 at 03:10:59PM -0400, Andrew Koenig wrote:
> Oren> Would such as product type be anything more than than a
> Oren> predicate about tuples?
> No, I don't think it would.
(explanation about ML functions deleted)

Can you give a more concrete example of what could a cartesian product 
of type predicates actually stand for in Python?

> However, if you're going to define predicates over tuples of (Python)
> types, then you had better not try to define those predicates as part
> of the tuples' class definitions, because they don't have one.


Andrew, in reply to your "scribble in the margin" question about two 
weeks ago see


From  Mon Aug 26 20:46:46 2002
From: (Andrew Koenig)
Date: Mon, 26 Aug 2002 15:46:46 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <>
 (message from Guido van Rossum on Mon, 26 Aug 2002 15:26:34 -0400)
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <> <>
 <> <>
Message-ID: <>

Guido> ABC did this, and very early Python did this, too (but Python always
Guido> required parentheses for calls).  However, adding optional arguments
Guido> caused trouble: after

Guido>   def f(a, b=1):
Guido>       print a*b

Guido>   t = (1, 2)

Guido> what should

Guido>   f(t)

Guido> mean?  It could mean either f((1, 2), 1) or f(1, 2).  So we had to get
Guido> rid of that.  I suppose ML doesn't have optional arguments (in the
Guido> sense of Python), so the problem doesn't occur there; that's why it
Guido> wasn't a problem in ABC.

Right -- ML doesn't have optional arguments.
It does, however, have clausal definitions, which can serve a
similar purpose:

	fun f[a, b] = a*b
	  | f[a] = a

Here, the square brackets denote lists, much as they do in Python.
So you can call this function with a list that has one or two elements.
The list's arguments must be integers, because if you don't say what
type the operands of * are, it assumes int.  If you were to call this
function with a list with other than one or two elements, it would
raise an exception.

You can't do the analogous thing with tuples in ML:

       fun f(a, b) = a*b
         | f(a) = a

for a rather surprising reason:  The ML type inference mechanism sees
from the first clause (f(a, b) = a*b) that the argument to f must
be a 2-element tuple, which means that in the *second* clause,
`a' must also be a 2-element tuple.  Otherwise the argument of f
would not have a single, well-defined type.

But if `a' is a 2-element tuple, that means that the type of the
result of f is also a 2-element tuple.  That type is inconsistent with
the type of a*b, which is int.

So the compiler will complain about this definition because the
function f cannot return both an int and a tuple at the same time.

If we were to define it this way:

       fun f(a, b) = a*b
         | f(a) = 42

the compiler would now accept it.  However, it would give a warning
that the second clause is irrelevant, because there is no argument you
can possibly give to f that would cause the second clause to match
without first causing the first clause to match.

From  Mon Aug 26 20:51:13 2002
From: (Andrew Koenig)
Date: Mon, 26 Aug 2002 15:51:13 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <> (message from Oren Tirosh on
 Mon, 26 Aug 2002 15:36:56 -0400)
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <> <> <> <>
Message-ID: <>

Oren> Can you give a more concrete example of what could a cartesian
Oren> product of type predicates actually stand for in Python?

Consider my TotallyOrdered suggestion from before.  I would like to
have a way of saying that for any two types T1 and T2 (where T1 might
equal T2) chosen from the set {int, long, float}, < imposes a total
ordering on values of those types.

Come to think of it, that's not really a Cartesian product.  Rather,
it's a claim about the members of the set union(int,union(long, float)).

From  Mon Aug 26 21:25:16 2002
From: (Oren Tirosh)
Date: Mon, 26 Aug 2002 23:25:16 +0300
Subject: [Python-Dev] type categories
In-Reply-To: <>; from on Mon, Aug 26, 2002 at 03:51:13PM -0400
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <> <> <> <> <>
Message-ID: <>

On Mon, Aug 26, 2002 at 03:51:13PM -0400, Andrew Koenig wrote:
> Oren> Can you give a more concrete example of what could a cartesian
> Oren> product of type predicates actually stand for in Python?
> Consider my TotallyOrdered suggestion from before.  I would like to
> have a way of saying that for any two types T1 and T2 (where T1 might
> equal T2) chosen from the set {int, long, float}, < imposes a total
> ordering on values of those types.
> Come to think of it, that's not really a Cartesian product.  Rather,
> it's a claim about the members of the set union(int,union(long, float)).

Isn't it easier to just spell it union(int, long, float)?

Your example helped me make the distinction between two very types of type

1. Type categories based on form: presence of methods, call signatures, etc.
2. Type categories based on semantics.

Semantic categories only live within a single form category. A method call
cannot possibly be semantically correct if it isn't well-formed: it will
cause a runtime error. But a method call that is well-formed may or may not
be semantically correct.

A language *can* verify well-formedness. It cannot verify semantical 
correctness but it can provide tools to help developers communicate their
semantic expectations.

Form-based categories may be used to convey semantic categories: just add
a dummy method or member to serve as a marker. It can force an interface
with an otherwise identical form to be intentionally incompatible to help 
you detect semantic categorization errors.

The opposite is not true: semantic categories cannot be used to enforce 
well-formedness. You can mark a class as implementing the "TotallyOrdered"
interface when it doesn't even have a comparison method. 

A similar case can happen when using inheritance for categorization: a 
subclass may modify the call signatures, making the class form-incompatible 
but it still retains its ancestry which may be interpreted in some cases as 
a marker of a semantic category.

From  Mon Aug 26 21:28:58 2002
From: (Andrew Koenig)
Date: Mon, 26 Aug 2002 16:28:58 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <> (message from Oren Tirosh on
 Mon, 26 Aug 2002 23:25:16 +0300)
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <> <> <> <> <> <>
Message-ID: <>

From  Mon Aug 26 21:33:53 2002
From: (Andrew Koenig)
Date: Mon, 26 Aug 2002 16:33:53 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <> (message from Oren Tirosh on
 Mon, 26 Aug 2002 23:25:16 +0300)
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <> <> <> <> <> <>
Message-ID: <>

Oren> On Mon, Aug 26, 2002 at 03:51:13PM -0400, Andrew Koenig wrote:
Oren> Can you give a more concrete example of what could a cartesian
Oren> product of type predicates actually stand for in Python?

>> Consider my TotallyOrdered suggestion from before.  I would like to
>> have a way of saying that for any two types T1 and T2 (where T1
>> might equal T2) chosen from the set {int, long, float}, < imposes a
>> total ordering on values of those types.

>> Come to think of it, that's not really a Cartesian product.
>> Rather, it's a claim about the members of the set
>> union(int,union(long, float)).

Oren> Isn't it easier to just spell it union(int, long, float)?

Yes but I have a cold today so I'm not thinking clearly.

Oren> Your example helped me make the distinction between two very
Oren> types of type categories:

Oren> 1. Type categories based on form: presence of methods, call signatures, etc.
Oren> 2. Type categories based on semantics.

Oren> Semantic categories only live within a single form category. A
Oren> method call cannot possibly be semantically correct if it isn't
Oren> well-formed: it will cause a runtime error. But a method call
Oren> that is well-formed may or may not be semantically correct.


Oren> A language *can* verify well-formedness. It cannot verify
Oren> semantical correctness but it can provide tools to help
Oren> developers communicate their semantic expectations.


Oren> Form-based categories may be used to convey semantic categories:
Oren> just add a dummy method or member to serve as a marker. It can
Oren> force an interface with an otherwise identical form to be
Oren> intentionally incompatible to help you detect semantic
Oren> categorization errors.

Remember that one thing I consider important is the ability to claim
that classes written by others belong to a category defined by me.  I do not
want to have to modify those classes in order to do so.

So, for example, if I want to say that int is TotallyOrdered, I do not
want to have to modify the definition of int to do so.

Oren> The opposite is not true: semantic categories cannot be used to
Oren> enforce well-formedness. You can mark a class as implementing
Oren> the "TotallyOrdered" interface when it doesn't even have a
Oren> comparison method.

Yes.  But semantic categories are useful anyway.

Oren> A similar case can happen when using inheritance for
Oren> categorization: a subclass may modify the call signatures,
Oren> making the class form-incompatible but it still retains its
Oren> ancestry which may be interpreted in some cases as a marker of a
Oren> semantic category.
Right.  And several people have noted that it can be desirable for
subclasses sometimes not to be members of all of their base classes'

From  Mon Aug 26 22:13:02 2002
From: (Oren Tirosh)
Date: Tue, 27 Aug 2002 00:13:02 +0300
Subject: [Python-Dev] type categories
In-Reply-To: <>; from on Mon, Aug 26, 2002 at 04:33:53PM -0400
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <> <> <> <> <> <> <>
Message-ID: <>

On Mon, Aug 26, 2002 at 04:33:53PM -0400, Andrew Koenig wrote:
> Oren> Form-based categories may be used to convey semantic categories:
> Oren> just add a dummy method or member to serve as a marker. It can
> Oren> force an interface with an otherwise identical form to be
> Oren> intentionally incompatible to help you detect semantic
> Oren> categorization errors.
> Remember that one thing I consider important is the ability to claim
> that classes written by others belong to a category defined by me.  I do not
> want to have to modify those classes in order to do so.

How about union(int, long, float, has_marker("TotallyOrdered")) ?

This basically means "I know that int, long and float are totally ordered
and I'm willing to take your word for it if you claim that your type is 
also totally ordered".

If the set of types that match a predicate is cached it should be at least 
as efficient as any other form of runtime interface checking.

> Oren> The opposite is not true: semantic categories cannot be used to
> Oren> enforce well-formedness. You can mark a class as implementing
> Oren> the "TotallyOrdered" interface when it doesn't even have a
> Oren> comparison method.
> Yes.  But semantic categories are useful anyway.

Sure they are, but if form-based categories can be used to define semantic
categories but not the other way around makes a point in favor of using
form-based categories as as the basic form for categories implemented by 
the language.

Inheritance of implementation also inherits the form (methods and call
signatures). If you don't go out of your way to modify it a subclass will 
usually also be a subcategory so this should be pretty transparent most of 
the time.

Form-based categories are a tool for making claims about code: "under 
condition X the method Y should not raise NameError or TypeError". If you 
want, you also use this tool to make semantic claims about your data types.
With compile-time type inference these claims can be upgraded to the level 
of formal proofs.


From  Mon Aug 26 22:17:28 2002
From: (Andrew Koenig)
Date: Mon, 26 Aug 2002 17:17:28 -0400 (EDT)
Subject: [Python-Dev] type categories
In-Reply-To: <> (message from Oren Tirosh on
 Tue, 27 Aug 2002 00:13:02 +0300)
References: <Pine.LNX.4.44.0208260051020.18425-100000@ziggy> <> <> <> <> <> <> <> <> <>
Message-ID: <>

>> Remember that one thing I consider important is the ability to claim
>> that classes written by others belong to a category defined by me.  I do not
>> want to have to modify those classes in order to do so.

Oren> How about union(int, long, float, has_marker("TotallyOrdered")) ?

How about it?  There is still the question of how to make such claims.

Oren> Inheritance of implementation also inherits the form (methods
Oren> and call signatures). If you don't go out of your way to modify
Oren> it a subclass will usually also be a subcategory so this should
Oren> be pretty transparent most of the time.

Right.  So how do you define a subclass that you do not want to be
a subcategory?

From  Mon Aug 26 22:51:51 2002
From: (Oren Tirosh)
Date: Tue, 27 Aug 2002 00:51:51 +0300
Subject: [Python-Dev] type categories
In-Reply-To: <>; from on Mon, Aug 26, 2002 at 05:17:28PM -0400
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

On Mon, Aug 26, 2002 at 05:17:28PM -0400, Andrew Koenig wrote:
> >> Remember that one thing I consider important is the ability to claim
> >> that classes written by others belong to a category defined by me.  I do not
> >> want to have to modify those classes in order to do so.
> Oren> How about union(int, long, float, has_marker("TotallyOrdered")) ?
> How about it?  There is still the question of how to make such claims.
> Oren> Inheritance of implementation also inherits the form (methods
> Oren> and call signatures). If you don't go out of your way to modify
> Oren> it a subclass will usually also be a subcategory so this should
> Oren> be pretty transparent most of the time.
> Right.  So how do you define a subclass that you do not want to be
> a subcategory?

If you make an incompatible change to a method call signature it will just
happen by itself. If that's what you meant - good. It that wasn't what you
meant this serves as a form of error checking.  

It gets harder if you want to remove a method or marker. The problem is 
that there is currently no way to mask inherited attributes. This will
require either a language extension that will allow you to del them or 
using some other convention for this purpose.


From  Mon Aug 26 22:56:22 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 26 Aug 2002 17:56:22 -0400
Subject: [Python-Dev] A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
References: <>
Message-ID: <>

Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

[François Pinard]

> Allow me some random thoughts.  [...]  Some people may think that these
> are all problems which are orthogonal to the design of a basic set feature,
> and which should be addressed in separate Python modules.

>From the received comments, I wrote a simple module reading sequences and
generating lists, instead of reading and producing sets, and taking care
of generating cartesian products, power sets, combinations, arrangements
and permutations.  I took various ideas here and there, like from previously
published messages on the Python list, and made them to look a bit alike.

The module could be called `cogen', abbreviation for COmbinatorial
GENerators.  Here is a first throw, to be criticised and improved.

Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: attachment;
Content-Transfer-Encoding: 8bit

#!/usr/bin/env python
# Copyright © 2002 Progiciels Bourbeau-Pinard inc.
# François Pinard <>, 2002-08.

Combinatorial generators.

All generators below have the property of yielding successive results
in sorted order, given than input sequences were already sorted.

from __future__ import generators

def cartesian(*sequences):
Generate the `cartesian product' of all SEQUENCES.  Each member of the
product is a list containing an element taken from each original sequence.
    if len(sequences) == 0:
	yield []
	first, remainder = sequences[0], sequences[1:]
	for element in first:
	    for result in cartesian(*remainder):
		result.insert(0, element)
		yield result

def subsets(sequence):
Generate all subsets of a given SEQUENCE.  Each subset is delivered
as a list holding zero or more elements of the original sequence.
    yield []
    if len(sequence) > 0:
        first, remainder = sequence[0], sequence[1:]
        # Some subsets retain FIRST.
        for result in subsets(remainder):
            result.insert(0, first)
            yield result
        # Some subsets do not retain FIRST.
        for result in subsets(remainder):
            if len(result) > 0:
                yield result

def combinations(sequence, number):
Generate all combinations of NUMBER elements from list SEQUENCE.
    # Adapted from Python 2.2 `test/'.
    if number > len(sequence):
    if number == 0:
	yield []
	first, remainder = sequence[0], sequence[1:]
	# Some combinations retain FIRST.
	for result in combinations(remainder, number-1):
	    result.insert(0, first)
	    yield result
	# Some combinations do not retain FIRST.
	for result in combinations(remainder, number):
	    yield result

def arrangements(sequence, number):
Generate all arrangements of NUMBER elements from list SEQUENCE.
    # Adapted from PERMUTATIONS below.
    if number > len(sequence):
    if number == 0:
	yield []
	cut = 0
	for element in sequence:
	    for result in arrangements(sequence[:cut] + sequence[cut+1:],
		result.insert(0, element)
		yield result
	    cut += 1

def permutations(sequence):
Generate all permutations from list SEQUENCE.
    # Adapted from Gerhard Häring <>, 2002-08-24.
    if len(sequence) == 0:
	yield []
	cut = 0
	for element in sequence:
	    for result in permutations(sequence[:cut] + sequence[cut+1:]):
		result.insert(0, element)
		yield result
	    cut += 1

def test():
    if True:
	print '\nTesting CARTESIAN.'
	for result in cartesian((5, 7), [8, 9], 'abc'):
	    print result
    if True:
        print '\nTesting SUBSETS.'
        for result in subsets(range(1, 5)):
            print result
    if True:
	print '\nTesting COMBINATIONS.'
	sequence = range(1, 5)
	for counter in range(len(sequence) + 2):
	    print "%d-combs of %s:" % (counter, sequence)
	    for combination in combinations(sequence, counter):
		print "   ", combination
    if True:
	print '\nTesting ARRANGEMENTS.'
	sequence = range(1, 5)
	for counter in range(len(sequence) + 2):
	    print "%d-arrs of %s:" % (counter, sequence)
	    for combination in arrangements(sequence, counter):
		print "   ", combination
    if True:
	print '\nTesting PERMUTATIONS.'
	for permutation in permutations(range(1, 5)):
	    print permutation

if __name__ == '__main__':

Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

François Pinard


From  Mon Aug 26 22:56:52 2002
From: (Guido van Rossum)
Date: Mon, 26 Aug 2002 17:56:52 -0400
Subject: [Python-Dev] type categories
In-Reply-To: Your message of "Tue, 27 Aug 2002 00:51:51 +0300."
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

> It gets harder if you want to remove a method or marker. The problem is 
> that there is currently no way to mask inherited attributes. This will
> require either a language extension that will allow you to del them or 
> using some other convention for this purpose.

Can't you use this?

def B:
   def foo(self): pass

def C:
   foo = None # Don't implement foo

--Guido van Rossum (home page:

From  Tue Aug 27 00:19:13 2002
From: (Michael McLay)
Date: Mon, 26 Aug 2002 19:19:13 -0400
Subject: [Python-Dev] Move sets and `cogen' into a math module
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

On Monday 26 August 2002 05:56 pm, Fran=E7ois Pinard wrote:

> The module could be called `cogen', abbreviation for COmbinatorial
> GENerators.  Here is a first throw, to be criticised and improved.

With sets and possibly cogen being added to the standard library is it li=
that additional interesting math capabilities will creep into the standar=
Python libraries? Reducing the clutter of the top level namespace is hard=
do if code depends on it, so better to do it right from the start. Do the=
modules belong at the top level?  Would it make sense to change the math=20
module into a package and move the new module inside that the math packag=

From  Tue Aug 27 01:38:57 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 26 Aug 2002 20:38:57 -0400
Subject: [Python-Dev] Re: Move sets and `cogen' into a math module
In-Reply-To: <>
References: <>
Message-ID: <>

[Michael McLay]

> On Monday 26 August 2002 05:56 pm, François Pinard wrote:

> > The module could be called `cogen', abbreviation for COmbinatorial
> > GENerators.  Here is a first throw, to be criticised and improved.

> With [...] possibly cogen being added to the standard library [...]
> Would it make sense to change the math module into a package and move
> the new module inside that the math package?

For one, I do not feel that `cogen' is especially mathematical, or geared
especially towards mathematical or numerical problems

Algorithmic maybe...  But a good part of the Python library already shares
that property, doesn't it? :-)

François Pinard

From  Tue Aug 27 02:39:44 2002
From: (
Date: Mon, 26 Aug 2002 20:39:44 -0500
Subject: [Python-Dev] type categories
In-Reply-To: <>
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

On Mon, Aug 26, 2002 at 05:56:52PM -0400, Guido van Rossum wrote:
> > It gets harder if you want to remove a method or marker. The problem is 
> > that there is currently no way to mask inherited attributes. This will
> > require either a language extension that will allow you to del them or 
> > using some other convention for this purpose.
> Can't you use this?
> def B:
>    def foo(self): pass
> def C:
>    foo = None # Don't implement foo

This comes closer:

    def raise_attributeerror(self):
        raise AttributeError

    RemoveAttribute = property(raise_attributeerror)

    class A:
        def f(self): print "method A.f"
        def g(self): print "method A.g"

    class B:
        f = RemoveAttribute

    a = A()
    b = B()
    print hasattr(b, "f"), hasattr(B, "f"), hasattr(b, "g"), hasattr(B, "g")
    except AttributeError: print "b.f does not exist (correctly)"
    else: print "Expected AttributeError not raised"

writing 'b.f' will raise AttributeError, but unfortunately hasattr(B, 'f')
will still return True.


From  Tue Aug 27 06:18:28 2002
From: (Oren Tirosh)
Date: Tue, 27 Aug 2002 08:18:28 +0300
Subject: [Python-Dev] type categories
In-Reply-To: <>; from on Mon, Aug 26, 2002 at 08:39:44PM -0500
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

On Mon, Aug 26, 2002 at 08:39:44PM -0500, wrote:
> On Mon, Aug 26, 2002 at 05:56:52PM -0400, Guido van Rossum wrote:
> > > It gets harder if you want to remove a method or marker. The problem is 
> > > that there is currently no way to mask inherited attributes. This will
> > > require either a language extension that will allow you to del them or 
> > > using some other convention for this purpose.
> > 
> > Can't you use this?
> > 
> > def B:
> >    def foo(self): pass
> > 
> > def C:
> >    foo = None # Don't implement foo
> This comes closer:
>     def raise_attributeerror(self):
>         raise AttributeError
>     RemoveAttribute = property(raise_attributeerror)
>     class A:
>         def f(self): print "method A.f"
>         def g(self): print "method A.g"
>     class B:
>         f = RemoveAttribute

Yes, that's a good solution. But it should be some special builtin 
out-of-band value, not a user-defined property.

> writing 'b.f' will raise AttributeError, but unfortunately hasattr(B, 'f')
> will still return True.

This isn't necessarily a problem but hasattr could be taught about this 
out-of-band value.


Proposed hierarchy for categories, types and interfaces:


Both types and interfaces define a set. The set membership test is the 
'isinstance' function (implemented by a new slot). For types the set 
membership is defined by inheritance - the isinstance handler will get the 
first argument's type and crawl up the __bases__ DAG to see if it finds the 
itself.  Interfaces check the object's form instead of its ancestry.

An Iattribute interface checks for the presence of a single attribute and
applies another interface check to its value. An Icallsignature interface 
checks if the argument is a callable object with a specified number of 
arguments, default arguments, etc. An Iintersection interface checks that 
the argument matches a set of categories. 


    interface readable:
        def read(bytes: int): str
        def readline(): str
        def readlines(): [str]

is just a more convenient way to write:

    readable = Iintersection(
        Iattribute('read', Icallsignature(str, ('bytes', int) )),
        Iattribute('readline', Icallsignature(str)),
        Iattribute('readlines', Icallsignature(Ilistof(str)))

The name 'readable' is simply bound to the resulting object; interfaces are
defined by their value, not their name.  The types of arguments and return 
values will not be checked at first and only serve as documentation. Note
that they don't necessarily have to be types - they can be interfaces, too.
For example, 'str|int' in an interface declaration will be coverted to 
Iunion(str, int).

    >>>isinstance(file('/dev/null'), readable)
    >>>isinstance(MyFileLikeClass(), readable)

The MyFileLIkeClass or file classes do not have to be explicitly declared 
as implementing the readable interface.  The benefit of explit inteface
declarations is that you will get an error if you write a method that does
not match the declaration. If you try to implement two conflicting
interfaces this can also be detected immediately - the intersection of the 
two interfaces will reduce to the empty interface.  For now this will only
catch the same method name with different number of arguments but in the 
future it may detect conflicting argument or return value types.

  doesn't-have-anything-better-to-do-at-6-am-ly yours,


From  Tue Aug 27 06:29:11 2002
From: (Oren Tirosh)
Date: Tue, 27 Aug 2002 01:29:11 -0400
Subject: [Python-Dev] A `cogen' module - an observation
In-Reply-To: <>
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

[f(x, y) for x in X for y in Y] 

  is equivalent to:

[f(x, y) for x, y in cartesian(X, Y)] 

I never found any real use for list comprehensions with more than one 
dimension. When I use nested loops they are usually for something that 
cannot be expressed as a list comprehension.


From  Tue Aug 27 06:56:36 2002
From: (Raymond Hettinger)
Date: Tue, 27 Aug 2002 01:56:36 -0400
Subject: [Python-Dev] A `cogen' module - an observation
References: <> <> <> <> <> <> <> <> <> <> <>
Message-ID: <00a701c24d8e$8222aac0$6cb63bd0@othello>

From: "Oren Tirosh" <>

> [f(x, y) for x in X for y in Y] 
>   is equivalent to:
> [f(x, y) for x, y in cartesian(X, Y)] 

Is the order guaranteed to be the same?

Will each work the same for a non-restartable
iterator, say a file object (equivalently put,
does the second one read Y once or many times)?

Would Descartes object to his name being used thusly?

Raymond Hettinger

From  Tue Aug 27 06:59:52 2002
From: (Raymond Hettinger)
Date: Tue, 27 Aug 2002 01:59:52 -0400
Subject: [Python-Dev] Move sets and `cogen' into a math module
References: <> <> <> <>
Message-ID: <00b201c24d8e$f6e9afc0$6cb63bd0@othello>

I think of sets as being more closely affiliated with heapq, UserDict, and array
and being less affiliated with math, cmath, random, etc.

Raymond Hettinger

----- Original Message ----- 
From: "Michael McLay" <

With sets and possibly cogen being added to the standard library is it likely 
that additional interesting math capabilities will creep into the standard 
Python libraries? Reducing the clutter of the top level namespace is hard to 
do if code depends on it, so better to do it right from the start. Do these 
modules belong at the top level?  Would it make sense to change the math 
module into a package and move the new module inside that the math package?

Python-Dev mailing list

From  Tue Aug 27 07:14:51 2002
From: (Greg Ewing)
Date: Tue, 27 Aug 2002 18:14:51 +1200 (NZST)
Subject: [Python-Dev] A `cogen' module - an observation
In-Reply-To: <>
Message-ID: <>

> [f(x, y) for x in X for y in Y] 
>   is equivalent to:
> [f(x, y) for x, y in cartesian(X, Y)] 

Hmmm, in other words, cartesian() is a lazy version
of zip().

(Given its intended use, I always thought zip()
should have been lazy from the beginning, but
the BDFL thought otherwise.)

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Tue Aug 27 07:15:54 2002
From: (Oren Tirosh)
Date: Tue, 27 Aug 2002 02:15:54 -0400
Subject: [Python-Dev] A `cogen' module - an observation
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Tue, Aug 27, 2002 at 06:14:51PM +1200, Greg Ewing wrote:
> > [f(x, y) for x in X for y in Y] 
> > 
> >   is equivalent to:
> > 
> > [f(x, y) for x, y in cartesian(X, Y)] 
> Hmmm, in other words, cartesian() is a lazy version
> of zip().


>>> zip([1, 2], ['a', 'b'])
[(1, 'a'), (2, 'b')]

>>> list(cartesian([1, 2], ['a', 'b']))
[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]


From  Tue Aug 27 07:27:40 2002
From: (Oren Tirosh)
Date: Tue, 27 Aug 2002 02:27:40 -0400
Subject: [Python-Dev] A `cogen' module - an observation
In-Reply-To: <00a701c24d8e$8222aac0$6cb63bd0@othello>
References: <> <> <> <> <> <> <> <> <> <00a701c24d8e$8222aac0$6cb63bd0@othello>
Message-ID: <>

On Tue, Aug 27, 2002 at 01:56:36AM -0400, Raymond Hettinger wrote:
> From: "Oren Tirosh" <>
> > [f(x, y) for x in X for y in Y] 
> > 
> >   is equivalent to:
> > 
> > [f(x, y) for x, y in cartesian(X, Y)] 
> Is the order guaranteed to be the same?


Combinatorial generators.

All generators below have the property of yielding successive results
in sorted order, given than input sequences were already sorted.

> Will each work the same for a non-restartable
> iterator, say a file object (equivalently put,
> does the second one read Y once or many times)?

They have exactly the same re-iterability wart as nested loops or list 
comprehensions - an exhausted iterator is indistinguishable from an empty 

> Would Descartes object to his name being used thusly?

The cartesian product is a set operation and therefore has no defined 
order. When generating it you need some specific order and this one makes
the most sense. If you use it with a 'nested loop' mindset instead of a
'set theory' mindset Rene would have had some grounds for objection :-)


From  Tue Aug 27 13:58:16 2002
From: (Michael Chermside)
Date: Tue, 27 Aug 2002 08:58:16 -0400
Subject: [Python-Dev] Fw: Security hole in rexec?
Message-ID: <>

> [rexec compromised by deleting __builtins__]
> This has been known for a while, see
> My recommendation is the same as always: don't trust rexec.
> --Guido van Rossum (home page:

I think it is a VERY BAD idea to advertise publicly that rexec can be 
used to "safely" restrict execution, while privately (ie, the above 
postings to a developers-only list and to sourceforge).

Therefore I propose that the official documentation to the Python 
Library Reference for the module rexec be modified to add a note saying 
that rexec is not completely reliable and can be undermined by a 
knowledgable hacker. The current documentation STRONGLY implies this is 
NOT the case by explaining in detail the more minor susceptibility to 
DOS attacks (memory or CPU time) and raising SystemExit.

Why not add something like the following to the beginning of the module 

Warning: While the rexec module is designed to perform as described 
below, it does have a few known vulnerabilities which could be exploited 
by carefully written code. Thus it should not be relied upon in 
situations requiring "production ready" security. In such situations, 
execution via sub-processes (a separate Python executable) or very 
careful "cleansing" of data to be processed may be necessary. 
Alternatively, help in patching known rexec vulnerabilities would be 

Admitting to library weaknesses (especially in the area of security) 
doesn't make great PR, but at least it's honest!

-- Michael Chermside

From  Tue Aug 27 16:30:07 2002
From: (Michael Chermside)
Date: Tue, 27 Aug 2002 11:30:07 -0400
Subject: [Python-Dev] Re: Fw: Security hole in rexec?
References: <> <>
Message-ID: <>

 >>> [rexec is broke]
 >> [let's document that]
> Yes.  This should be done.
> --Guido van Rossum (home page:

OK. I'll submit a patch.

-- Michael Chermside

From  Tue Aug 27 16:02:24 2002
From: (Guido van Rossum)
Date: Tue, 27 Aug 2002 11:02:24 -0400
Subject: [Python-Dev] Fw: Security hole in rexec?
In-Reply-To: Your message of "Tue, 27 Aug 2002 08:58:16 EDT."
References: <>
Message-ID: <>

> > [rexec compromised by deleting __builtins__]
> > 
> > This has been known for a while, see
> > 
> > My recommendation is the same as always: don't trust rexec.
> > 
> > --Guido van Rossum (home page:
> I think it is a VERY BAD idea to advertise publicly that rexec can be 
> used to "safely" restrict execution, while privately (ie, the above 
> postings to a developers-only list and to sourceforge).
> Therefore I propose that the official documentation to the Python 
> Library Reference for the module rexec be modified to add a note saying 
> that rexec is not completely reliable and can be undermined by a 
> knowledgable hacker. The current documentation STRONGLY implies this is 
> NOT the case by explaining in detail the more minor susceptibility to 
> DOS attacks (memory or CPU time) and raising SystemExit.
> Why not add something like the following to the beginning of the module 
> documentation:
> """
> Warning: While the rexec module is designed to perform as described 
> below, it does have a few known vulnerabilities which could be exploited 
> by carefully written code. Thus it should not be relied upon in 
> situations requiring "production ready" security. In such situations, 
> execution via sub-processes (a separate Python executable) or very 
> careful "cleansing" of data to be processed may be necessary. 
> Alternatively, help in patching known rexec vulnerabilities would be 
> welcomed.
> """
> Admitting to library weaknesses (especially in the area of security) 
> doesn't make great PR, but at least it's honest!

Yes.  This should be done.

--Guido van Rossum (home page:

From  Tue Aug 27 20:37:55 2002
From: (Marcelo Matus)
Date: Tue, 27 Aug 2002 12:37:55 -0700
Subject: [Python-Dev] valgrind and python?
Message-ID: <>

Did somebody already test python 2.2.1 using valgrind 1.0.1?

because testing one of my own modules, I get the following
recurrent error report whenever I import something:

==19951== Conditional jump or move depends on uninitialised value(s)
==19951==    at 0x8094B85: find_module (in 
==19951==    by 0x8095DE2: import_submodule (Python/import.c:1887)
==19951==    by 0x80959B8: load_next (Python/import.c:1752)
==19951==    by 0x8097608: import_module_ex (Python/import.c:1603)

and  I don't know if I can ignore it or if this is a real python error.


PS: the error appears whenever you import a module, so,
after installing valgrind, try doing:

    echo import math >
    valgrind /usr/local/bin/python

if you have python installed in the /usr/local/bin directory, of course.

From  Tue Aug 27 20:57:57 2002
From: (Steve M. Robbins)
Date: Tue, 27 Aug 2002 15:57:57 -0400
Subject: [Python-Dev] Re: Weird error handling in os._execvpe
Message-ID: <>


I think the patch associated with this thread has an unintended

Zack pointed out three flaws in the original code:

    Third, if an error other than the expected one comes back, the
    loop clobbers the saved exception info and keeps going.  Consider
    the situation where PATH=/bin:/usr/bin, /bin/foobar exists but is
    not executable by the invoking user, and /usr/bin/foobar does not
    exist.  The exception thrown will be 'No such file or directory',
    not the expected 'Permission denied'.

The patch, as I understand it, changes the behaviour so as to raise
the exception "Permission denied" in this case.

Consider a similar situation in which both /bin/foobar (not executable
by the user) and /usr/bin/foobar (executable by the user) exist.
Given the command "foobar", the shell will execute /usr/bin/foobar.
If I understand the patch correctly, python will give up when it
encounters /bin/foobar and raise the "Permission denied" exception.

I believe this just happened to me today.  I had a shell script named
"gcc" in ~/bin (first on my path) some months back.  When I was
finished with it, I just did "chmod -x ~/bin/gcc" and forgot about it.
Today was the first time since this patch went in that I ran gcc via
python (using scipy's weave).  Boy was I surprised at the message
"unable to execute gcc: Permission denied"!

I guess the fix is to save the EPERM exception and keep going
in case there is an executable later in the path.


From  Wed Aug 28 00:17:59 2002
From: (Neal Norwitz)
Date: Tue, 27 Aug 2002 19:17:59 -0400
Subject: [Python-Dev] valgrind and python?
References: <>
Message-ID: <>

Marcelo Matus wrote:
> Did somebody already test python 2.2.1 using valgrind 1.0.1?

I have used valgrind on python, but mostly the latest CVS version,
not 2.2.1.  valgrind and python were built with gcc 2.96 (redhat 7.2).
I also use purify from time to time when it works (pretty rare).

> because testing one of my own modules, I get the following
> recurrent error report whenever I import something:
> ==19951== Conditional jump or move depends on uninitialised value(s)
> ==19951==    at 0x8094B85: find_module (in
> /home/mmatus/oss2/gcc3/bin/python)
> ==19951==    by 0x8095DE2: import_submodule (Python/import.c:1887)
> ==19951==    by 0x80959B8: load_next (Python/import.c:1752)
> ==19951==    by 0x8097608: import_module_ex (Python/import.c:1603)
> and  I don't know if I can ignore it or if this is a real python error.

Unlikely a python problem.  I just tried valgrind with the current
CVS version of 2.2.1+ (what will eventually become 2.2.2).
There were no problems reported doing import math.

You can try several things to fix the warning:
 * run without your module
 * use a different compiler (version)
 * run the CVS version: cvs upd -r release22-maint
 * pass --workaround-gcc296-bugs=yes option to valgrind


From  Wed Aug 28 01:25:53 2002
From: (Marcelo Matus)
Date: Tue, 27 Aug 2002 17:25:53 -0700
Subject: [Python-Dev] valgrind and python?
References: <> <>
Message-ID: <>

hmmm... I tried what you said:

- I used the last cvs 2.2 version (cvs upd -r release22-maint)

- I never load my module

- I used gcc 3.1.1

- I even passed the option --workaround-gcc296-bugs=yes

but still I'v got the same kind of error report.....
maybe the compiler is too new, but I can go back to
gcc 2.96 (because of the c++ support).

So, in the meantime, I will just ignore the reports......



Neal Norwitz wrote:

>Marcelo Matus wrote:
>>Did somebody already test python 2.2.1 using valgrind 1.0.1?
>I have used valgrind on python, but mostly the latest CVS version,
>not 2.2.1.  valgrind and python were built with gcc 2.96 (redhat 7.2).
>I also use purify from time to time when it works (pretty rare).
>>because testing one of my own modules, I get the following
>>recurrent error report whenever I import something:
>>==19951== Conditional jump or move depends on uninitialised value(s)
>>==19951==    at 0x8094B85: find_module (in
>>==19951==    by 0x8095DE2: import_submodule (Python/import.c:1887)
>>==19951==    by 0x80959B8: load_next (Python/import.c:1752)
>>==19951==    by 0x8097608: import_module_ex (Python/import.c:1603)
>>and  I don't know if I can ignore it or if this is a real python error.
>Unlikely a python problem.  I just tried valgrind with the current
>CVS version of 2.2.1+ (what will eventually become 2.2.2).
>There were no problems reported doing import math.
>You can try several things to fix the warning:
> * run without your module
> * use a different compiler (version)
> * run the CVS version: cvs upd -r release22-maint
> * pass --workaround-gcc296-bugs=yes option to valgrind

From  Wed Aug 28 01:40:30 2002
From: (Greg Ewing)
Date: Wed, 28 Aug 2002 12:40:30 +1200 (NZST)
Subject: [Python-Dev] A `cogen' module - an observation
In-Reply-To: <>
Message-ID: <>

Oren Tirosh <>:

> > Hmmm, in other words, cartesian() is a lazy version
> > of zip().
> Nope.
> >>> zip([1, 2], ['a', 'b'])
> [(1, 'a'), (2, 'b')]
> >>> list(cartesian([1, 2], ['a', 'b']))
> [(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]

Sorry, BrainError. In that case, it's probably
faster to use the nested loops -- unless
cartesian() were implemented in C.

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Wed Aug 28 02:08:43 2002
From: (
Date: Tue, 27 Aug 2002 20:08:43 -0500
Subject: [Python-Dev] valgrind and python?
In-Reply-To: <>
References: <> <> <>
Message-ID: <>

On Tue, Aug 27, 2002 at 05:25:53PM -0700, Marcelo Matus wrote:
> So, in the meantime, I will just ignore the reports......

I hope you know that you can write a "supressions" file to make valgrind
never print this message for this location in Python.  If not, read the
documentation ..


From  Wed Aug 28 02:02:23 2002
From: (Guido van Rossum)
Date: Tue, 27 Aug 2002 21:02:23 -0400
Subject: [Python-Dev] Re: Weird error handling in os._execvpe
In-Reply-To: Your message of "Tue, 27 Aug 2002 15:57:57 EDT."
References: <>
Message-ID: <>

> I think the patch associated with this thread has an unintended
> consequence.
> In
> Zack pointed out three flaws in the original code:
>     [...]
>     Third, if an error other than the expected one comes back, the
>     loop clobbers the saved exception info and keeps going.  Consider
>     the situation where PATH=/bin:/usr/bin, /bin/foobar exists but is
>     not executable by the invoking user, and /usr/bin/foobar does not
>     exist.  The exception thrown will be 'No such file or directory',
>     not the expected 'Permission denied'.
> The patch, as I understand it, changes the behaviour so as to raise
> the exception "Permission denied" in this case.
> Consider a similar situation in which both /bin/foobar (not executable
> by the user) and /usr/bin/foobar (executable by the user) exist.
> Given the command "foobar", the shell will execute /usr/bin/foobar.
> If I understand the patch correctly, python will give up when it
> encounters /bin/foobar and raise the "Permission denied" exception.
> I believe this just happened to me today.  I had a shell script named
> "gcc" in ~/bin (first on my path) some months back.  When I was
> finished with it, I just did "chmod -x ~/bin/gcc" and forgot about it.
> Today was the first time since this patch went in that I ran gcc via
> python (using scipy's weave).  Boy was I surprised at the message
> "unable to execute gcc: Permission denied"!
> I guess the fix is to save the EPERM exception and keep going
> in case there is an executable later in the path.

This is definitely a bug.  Can you or Zack provide a patch?

I've opened a bug report:

--Guido van Rossum (home page:

From  Wed Aug 28 02:11:48 2002
From: (Marcelo Matus)
Date: Tue, 27 Aug 2002 18:11:48 -0700
Subject: [Python-Dev] valgrind and python?
References: <> <> <> <>
Message-ID: <>

Thanks, now that I know that there is nothing wrong with python,
I'll check how to ignore the reports automatically.

Thanks again

Marcelo wrote:

>On Tue, Aug 27, 2002 at 05:25:53PM -0700, Marcelo Matus wrote:
>>So, in the meantime, I will just ignore the reports......
>I hope you know that you can write a "supressions" file to make valgrind
>never print this message for this location in Python.  If not, read the
>documentation ..

From  Wed Aug 28 02:26:00 2002
From: (Guido van Rossum)
Date: Tue, 27 Aug 2002 21:26:00 -0400
Subject: [Python-Dev] A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: Your message of "Mon, 26 Aug 2002 17:56:22 EDT."
References: <> <> <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

> def cartesian(*sequences):
>     """\
> Generate the `cartesian product' of all SEQUENCES.  Each member of the
> product is a list containing an element taken from each original sequence.
> """
>     if len(sequences) == 0:
> 	yield []
>     else:
> 	first, remainder = sequences[0], sequences[1:]
> 	for element in first:
> 	    for result in cartesian(*remainder):
> 		result.insert(0, element)
> 		yield result

It occurred to me that this is rather ineffecient because it invokes
itself recursively many time (once for each element in the first
sequence).  This version is much faster, because iterating over a
built-in sequence (like a list) is much faster than iterating over a

def cartesian(*sequences):
    if len(sequences) == 0:
        yield []
        head, tail = sequences[:-1], sequences[-1]
        for x in cartesian(*head):
            for y in tail:
                    yield x + [y]

I also wonder if perhaps ``tail = list(tail)'' should be inserted
just before the for loop, so that the arguments may be iterators as

I would have more suggestions (I expect that Eric Raymond's powerset
is much faster than your recursive subsets()) but my family is calling

--Guido van Rossum (home page:

From  Wed Aug 28 03:41:18 2002
From: (Fred L. Drake, Jr.)
Date: Tue, 27 Aug 2002 22:41:18 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Doc/lib libexcs.tex,1.43,
In-Reply-To: <>
References: <>
Message-ID: <>

[Following up to a message that went to the checkins list.]

Raymond sez:
 > Note change in behavior from 1.5.2.  The new argument to
 > NameError is an error message and not just the missing name.

Skip Montanaro writes:
 > It seems to me that somewhere in the docs it would be worthwhile to state
 >     Messages to exceptions are not part of the Python API.  Their contents
 >     may change from one version of Python to the next without warning and
 >     should not be relied on for code which will be run with multiple
 >     versions of the interpreter.


The catch, of course, is that it's not clear (perhaps only to me?)
that what changed was a message.  I'd interpret the original behavior
(if documented, which I won't bother to check) as an API requirement.
AttributeError use to have a similar behavior; I don't know how
rigorously that's been maintained either.

In either case, I think the ideal solution to the problem of figuring
out what went wrong, from within the executing program, is for these
errors to have an attribute that identifies the missing name ("name"
would be a good name for it).  KeyError could similarly have an
attribute "key".  To deal with existing code, the attributes would not
be set.  Additional C functions could be provided for use in code that
is modified to provide the information.


Fred L. Drake, Jr.  <fdrake at>
PythonLabs at Zope Corporation

From  Wed Aug 28 03:43:06 2002
From: (Tim Peters)
Date: Tue, 27 Aug 2002 22:43:06 -0400
Subject: [Python-Dev] FW: Your message to Python-Dev awaits moderator approval
Message-ID: <>

Well, if SpamAssassin wasn't so stupid, I suppose you could have read this
msg <wink>.

-----Original Message-----
From: []
Sent: Tuesday, August 27, 2002 10:38 PM
Subject: Your message to Python-Dev awaits moderator approval

Your mail to 'Python-Dev' with the subject

    The first trustworthy <wink> GBayes results

Is being held until the list moderator can review it for approval.

The reason it is being held:

    Message has a suspicious header

Either the message will get posted to the list, or you will receive
notification of the moderator's decision.

From  Wed Aug 28 03:36:17 2002
From: (Tim Peters)
Date: Tue, 27 Aug 2002 22:36:17 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
Message-ID: <>

Setting this up has been a bitch.  All early attempts floundered beca=
use it
turned out there was *some* systematic difference between the ham and=
archives that made the job trivial.

The ham archive:  I selected 20,000 messages, and broke them into 5 s=
ets of
4,000 each, at random, from a python-list archive Barry put together,
containing msgs only after SpamAssassin was put into play on python.o=
It's hoped that's pretty clean, but nobody checked all ~=3D 160,000+ =
msgs.  As
will be seen below, it's not clean enough.

The spam archive:  This is essentially all of Bruce Guenter's 2002 sp=
collection, at <>.  It was broken at ra=
into 5 sets of 2,750 spams each.

Problems included:

+ Mailman added distinctive headers to every message in the ham
  archive, which appear nowhere in the spam archive.  A Bayesian
  classifier picks up on that immediately.

+ Mailman also adds "[name-of-list]" to every Subject line.

+ The spam headers had tons of clues about Bruce Guenter's mailing
  addresses that appear nowhere in the ham headers.

+ The spam archive has Windows line ends (\r\n), but the ham archive
  plain Unix \n.  This turned out to be a killer clue(!) in the simpl=
  character n-gram attempts.  (Note:  I can't use text mode to read
  msgs, because there are binary characters in the archives that
  Windows treats as EOF in text mode -- indeed, 400MB of the ham
  archive vanishes when read in text mode!)

What I'm reporting on here is after normalizing all line-ends to \n, =
ignoring the headers *completely*.  There are obviously good clues in=
headers, the problem is that they're killer-good clues for accidental
reasons in this test data.  I don't want to write code to suppress th=
clues either, as then I'd be testing some mix of my insights (or lack
thereof) with what a blind classifier would do.  But I don't care how=
 good I
am, I only care about how well the algorithm does.

Since it's ignoring the headers, I think it's safe to view this as a =
bound on what can be achieved.  There's another way this should be a =

def tokenize_split(string):
    for w in string.split():
        yield w

tokenize =3D tokenize_split

class Msg(object):
    def __init__(self, dir, name):
        path =3D dir + "/" + name
        self.path =3D path
        f =3D file(path, 'rb')
        guts =3D
        # Skip the headers.
        i =3D guts.find('\n\n')
        if i >=3D 0:
            guts =3D guts[i+2:]
        self.guts =3D guts

    def __iter__(self):
        return tokenize(self.guts)

This is about the stupidest tokenizer imaginable, merely splitting th=
e body
on whitespace.  Here's the output from the first run, training agains=
t one
pair of spam+ham groups, then seeing how its predictions stack up aga=
each of the four other pairs of spam+ham groups:

Training on Data/Ham/Set1 and Data/Spam/Set1 ... 4000 hams and 2750 s=
    testing against Data/Spam/Set2 and Data/Ham/Set2
    tested 4000 hams and 2750 spams
    false positive: 0.00725 (i.e., under 1%)
    false negative: 0.0530909090909 (i.e., over 5%)

    testing against Data/Spam/Set3 and Data/Ham/Set3
    tested 4000 hams and 2750 spams
    false positive: 0.007
    false negative: 0.056

    testing against Data/Spam/Set4 and Data/Ham/Set4
    tested 4000 hams and 2750 spams
    false positive: 0.0065
    false negative: 0.0545454545455

    testing against Data/Spam/Set5 and Data/Ham/Set5
    tested 4000 hams and 2750 spams
    false positive: 0.00675
    false negative: 0.0516363636364

It's a Good Sign that the false positive/negative rates are very clos=
across the four test runs.  It's possible to quantify just how good a=
that is, but they're so close by eyeball that there's no point in bot=

This is using the new in the sandbox, and that class automa=
remembers the false positives and negatives.  Here's the start of the=
false positive from the first run:

It's not really hard!!
Turn $6.00 into $1,000 or this to find out how!! READING
THIS COULD CHANGE YOUR LIFE!! I found this on a bulletin board
to try it. A little while back, while chatting on the internet, I cam=
across an article
similar to this that said you could make thousands of dollars in cash
within weeks
with only an initial investment of $6.00! So I thought, "Yeah right,
this must be a scam", but like most of us, I was curious, so I kept
reading. Anyway,
it said that you send $1.00 to each of the six names and address
statedin the
article. You then place your own name and address in the bottom of th=
list at #6, and
post the article in at least 200 newsgroups (There are thousands) or
e-mail them. No

Call me forgiving, but I think it's vaguely possible that this should=
been in the spam corpus instead <wink>.

Here's the start of the second false positive:

Please forward this message to anyone you know who is active in the s=

See Below for Press Release

Dear Friends,

I am a normal investor same as you.  I am not a finance  professional=
 nor am
I connected to FDNI in any way.

I recently stumbled onto this OTC stock (FDNI) while searching throug=
h yahoo
for small float, big potential stocks. At the time, the company had r=
a press release which stated they were doing a stock buyback.  Intrig=
ued, I
bought 5,000 shares at $.75 each.  The stock went to $1.50 and I sold=
shares.  I then bought them back at $1.15.  The company then circulat=
another press release about a foreign acquisition (see below).  The s=
jumped to $2.75 (I sold @ $2.50 for a massive profit).  I then bought=
in at $1.25 where I am holding until the next major piece of news.

Here's the start of the third:

Grand Treasure Industrial Limited

Contact Information

We are a manufacturer and exporter in Hong Kong for all kinds of plas=
We export to worldwide markets. Recently , we join-ventured with a ba=
factory in China produce all kinds of shopping , lady's , traveller's
bags.... visit our page and send us your enquiry by email now.
Contact Address :
Rm. 1905, Asian Trade Centre , 79 Lei Muk Rd, Tsuen Wan , Hong Kong.
Telephone : ( 852 ) 2408 9382

That is, all the "false positives" there are blatant spam.  It will t=
ake a
long time to sort this all out, but I want to make a point here now: =
classifier works so well that it can *help* clean the ham corpus!  I =
found a non-spam among the "false positives" yet.  Another lesson rei=
one from my previous life in speech recognition:  rigorous data colle=
cleaning, tagging and maintenance is crucial when working with statis=
approaches, and is damned expensive to do.

Here's the start of the first "false negative" (including the headers=

Return-Path: <911@911.COM>
Received: (qmail 24322 invoked from network); 28 Jul 2002 12:51:44 -0=
Received: from unknown (HELO PC-5.) (
  by with SMTP; 28 Jul 2002 12:51:44 -0000
x-esmtp: 0 0 1
Message-ID: <>
To: "NEW020515" <911@911.COM>
=46rom: "=D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www.itdata= =A3=A9" <911@911.COM>
Subject: =D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www.itdata= =A3=A9
Date: Sun, 28 Jul 2002 17:45:13 +0800
MIME-Version: 1.0
Content-type: text/plain; charset=3Dgb2312
Content-Transfer-Encoding: quoted-printable
Content-Length: 977


Since I'm ignoring the headers, and the tokenizer is just a whitespac=
split, each line of quoted-printable looks like a single word to the
classifier.  Since it's never seen these "words" before, it has no re=
ason to
believe they're either spam or ham indicators, and favors calling it =

One more mondo cool thing and that's it for now.  The GrahamBayes cla=
keeps track of how many times each word makes it into the list of the=
strongest indicators.  These are the "killer clues" the classifier ge=
ts the
most value from.  The most valuable spam indicator turned out to be
"<br>" -- there's simply almost no HTML mail in the ham archive (but =
that this clue would be missed if you stripped HTML!).  You're never =
to guess what the most valuable non-spam indicator was, but it's quit=
plausible after you see it.  Go ahead, guess.  Chicken <wink>.

Here are the 15 most-used killer clues across the runs shown above:  =
repr of the word, followed by the # of times it made into the 15-best=
and the estimated probability that a msg is spam if it contains this =

    testing against Data/Spam/Set2 and Data/Ham/Set2
    best discrimators:
        'Helvetica,' 243 0.99
        'object' 245 0.01
        'language' 258 0.01
        '<BR>' 292 0.99
        '>' 339 0.179104
        'def' 397 0.01
        'article' 423 0.01
        'module' 436 0.01
        'import' 499 0.01
        '<br>' 652 0.99
        '>>>' 667 0.01
        'wrote' 677 0.01
        'python' 755 0.01
        'Python' 1947 0.01
        'wrote:' 1988 0.01

    testing against Data/Spam/Set3 and Data/Ham/Set3
    best discrimators:
        'string' 494 0.01
        'Helvetica,' 496 0.99
        'language' 524 0.01
        '<BR>' 553 0.99
        '>' 687 0.179104
        'article' 851 0.01
        'module' 857 0.01
        'def' 875 0.01
        'import' 1019 0.01
        '<br>' 1288 0.99
        '>>>' 1344 0.01
        'wrote' 1355 0.01
        'python' 1461 0.01
        'Python' 3858 0.01
        'wrote:' 3984 0.01

    testing against Data/Spam/Set4 and Data/Ham/Set4
    best discrimators:
        'object' 749 0.01
        'Helvetica,' 757 0.99
        'language' 763 0.01
        '<BR>' 877 0.99
        '>' 954 0.179104
        'article' 1240 0.01
        'module' 1260 0.01
        'def' 1364 0.01
        'import' 1517 0.01
        '<br>' 1765 0.99
        '>>>' 1999 0.01
        'wrote' 2071 0.01
        'python' 2160 0.01
        'Python' 5848 0.01
        'wrote:' 6021 0.01

    testing against Data/Spam/Set5 and Data/Ham/Set5
    best discrimators:
        'object' 980 0.01
        'language' 992 0.01
        'Helvetica,' 1005 0.99
        '<BR>' 1139 0.99
        '>' 1257 0.179104
        'article' 1678 0.01
        'module' 1702 0.01
        'def' 1846 0.01
        'import' 2003 0.01
        '<br>' 2387 0.99
        '>>>' 2624 0.01
        'wrote' 2743 0.01
        'python' 2864 0.01
        'Python' 7830 0.01
        'wrote:' 8060 0.01

Note that an "intelligent" tokenizer would likely miss that the Pytho=
prompt ('>>>') is a great non-spam indicator on python-list.  I've ha=
d this
argument with some of you before <wink>, but the best way to let this=
of thing be as intelligent as it can be is not to try to help it too =
it will learn things you'll never dream of, provided only you don't f=
clues out in an attempt to be clever.

everything's-a-clue-ly y'rs  - tim

From  Wed Aug 28 04:09:49 2002
From: (Skip Montanaro)
Date: Tue, 27 Aug 2002 22:09:49 -0500
Subject: [Python-Dev] FW: Your message to Python-Dev awaits moderator approval
In-Reply-To: <>
References: <>
Message-ID: <>

    Tim> Well, if SpamAssassin wasn't so stupid, I suppose you could have
    Tim> read this msg <wink>.

I think that response was generated by Mailman.  SpamAssassin does nothing
more than tag messages...


From  Wed Aug 28 04:20:06 2002
From: (Tim Peters)
Date: Tue, 27 Aug 2002 23:20:06 -0400
Subject: [Python-Dev] FW: Your message to Python-Dev awaits moderator
In-Reply-To: <>
Message-ID: <>

[Skip Montanaro]
> I think that response was generated by Mailman.  SpamAssassin does
> nothing more than tag messages...

Like I care who the stupid party is -- the smart party was Martijn Pieters,
who now stays up all night waiting to approve my msgs <wink>.

From  Wed Aug 28 04:51:09 2002
From: (Tim Peters)
Date: Tue, 27 Aug 2002 23:51:09 -0400
Subject: [Python-Dev] FW: Your message to Python-Dev awaits moderator
In-Reply-To: <>
Message-ID: <>

FYI, here's the closest thing to a real false positive I've seen so far:

What is the key for to break the script execution in pythonwin?


Follow the White Rabbit...
Knock Knock Neo...

ICQ #42292922
Webmaster di
La prima community di web developer Ad Free
The first web developer's community  Ad Free

Other "false positives" included a strangely quoted copy of the Nigerian
scam, "STOP PAYING $19.95 or more TODAY for your web site, WHEN YOU CAN
GET ONE FOR ONLY $11.95 PER MONTH!", something entirely composed of high-bit
characters except for a URL pointing to a Russian hosting site, "I am
looking for young models (prefer early teen 14-16 year old female) for nude
and semi-nude photography", "Everyone likes making easy money.  This place
pays you 50 cents for every hour you browse the web!!!", "We want to invited
you for a new adult TOP50 on", and my favorite:

    These girls are not phonesex workers.  They are horny girls who are
    on the line cuz it's free phonesex for them.

Unfortunately, I don't think I can fudge this system to let msgs thru from
my sisters <wink>.

From  Wed Aug 28 05:04:41 2002
From: (Skip Montanaro)
Date: Tue, 27 Aug 2002 23:04:41 -0500
Subject: [Python-Dev] FW: Your message to Python-Dev awaits moderator
In-Reply-To: <>
References: <>
Message-ID: <>

    Tim> FYI, here's the closest thing to a real false positive I've seen so
    Tim> far:

I have much smaller spam and ham corpora (currently about 400 msgs each),
but both consist only of messages sent to me in the past couple weeks
(though not all messages sent during that interval), so some of the header
clues which skewed Tim's tests shouldn't be present.  Using my currently
undeleted Python mail as "unknown" (but which doesn't actually contain any
spam), I saw two false positives.  One had an attached gif image.  The other
was a one-line text+html message whose "words" were thus dominated by the
HTML tags in the second part.

Once my spam and ham grow to something more like 2000 each I will try Tim's
technique of splitting them into smaller chunks, training on one chunk, then
testing against the remaining chunks.


From  Wed Aug 28 05:06:52 2002
From: (Barry A. Warsaw)
Date: Wed, 28 Aug 2002 00:06:52 -0400
Subject: [Python-Dev] FW: Your message to Python-Dev awaits moderator
References: <>
Message-ID: <>

>>>>> "TP" == Tim Peters <> writes:

    TP> [Skip Montanaro]
    >> I think that response was generated by Mailman.  SpamAssassin
    >> does nothing more than tag messages...

    TP> Like I care who the stupid party is -- the smart party was
    TP> Martijn Pieters, who now stays up all night waiting to approve
    TP> my msgs <wink>.

Tim's of course joking, because he cares intimately and deeply about
all things.  But actually it was a combination of brilliant <wink>
software.  SA tagged the message with a X-Spam-Level: header value
that the python-dev's Mailman filter was set up to catch as


From  Wed Aug 28 05:17:25 2002
From: (Martijn Pieters)
Date: Wed, 28 Aug 2002 00:17:25 -0400
Subject: [Python-Dev] FW: Your message to Python-Dev awaits moderator approval
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Tue, Aug 27, 2002 at 11:20:06PM -0400, Tim Peters wrote:
> Like I care who the stupid party is -- the smart party was Martijn Pieters,
> who now stays up all night waiting to approve my msgs <wink>.

Studying or waiting on Tim Peters, makes no difference ;)

Martijn Pieters
| Software Engineer
| Zope Corporation
| Creators of Zope

From  Wed Aug 28 05:44:50 2002
From: (Oren Tirosh)
Date: Wed, 28 Aug 2002 00:44:50 -0400
Subject: [Python-Dev] A `cogen' module - an observation
In-Reply-To: <>
References: <> <>
Message-ID: <>

On Wed, Aug 28, 2002 at 12:40:30PM +1200, Greg Ewing wrote:
> Oren Tirosh <>:
> > > Hmmm, in other words, cartesian() is a lazy version
> > > of zip().
> > 
> > Nope.
> > 
> > >>> zip([1, 2], ['a', 'b'])
> > [(1, 'a'), (2, 'b')]
> > 
> > >>> list(cartesian([1, 2], ['a', 'b']))
> > [(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]
> Sorry, BrainError. In that case, it's probably
> faster to use the nested loops -- unless
> cartesian() were implemented in C.

Yes, but a nested loop cannot be easily passed as an argument to a 
function. Generator functions are pretty efficient, too - yield does 
not incur the relatively high overhead of Python function calls.


From  Wed Aug 28 06:13:55 2002
From: (Oren Tirosh)
Date: Wed, 28 Aug 2002 01:13:55 -0400
Subject: [Python-Dev] A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

On Tue, Aug 27, 2002 at 09:26:00PM -0400, Guido van Rossum wrote:
> It occurred to me that this is rather ineffecient because it invokes
> itself recursively many time (once for each element in the first
> sequence).  This version is much faster, because iterating over a
> built-in sequence (like a list) is much faster than iterating over a
> generator:
> def cartesian(*sequences):
>     if len(sequences) == 0:
>         yield []
>     else:
>         head, tail = sequences[:-1], sequences[-1]
>         for x in cartesian(*head):
>             for y in tail:
>                     yield x + [y]

My implementation from

def xcartprod(arg1, *args):
    if not args:
        for x in arg1:
            yield (x,)
    elif len(args) == 1:
        arg2 = args[0]
        for x in arg1:
            for y in arg2:
                yield x, y
        for x in arg1:
            for y in xcartprod(args[0], *args[1:]):
                yield (x,) + y

Special-casing the 2 argument case helps a lot. It brings the performace
within 50% of nested loops which means that if you actually do something
inside the loop the overhead is quite negligible.

The 'x' prefix is shared with other functions in this module: a lazy xmap,
xzip and xfilter.

> I also wonder if perhaps ``tail = list(tail)'' should be inserted
> just before the for loop, so that the arguments may be iterators as
> well.

Ahh... re-iterability again...

This is a good example of a function that *fails silently* for non 
re-iterable arguments.

Slurping the tail into a list loses the lazy efficiency of this function. 
One of the ways I've used this function is to scan combinations until a 
condition is satisfied. The iteration is always terminated before reaching 
the end. Reading ahead may waste computation and memory.

All I want is something that will raise an exception if any argument but 
the first is not re-iterable (e.g. my reiter() proposal). I'll add list()
to the argument myself if I really want to. Don't try to guess what I


From  Wed Aug 28 11:45:11 2002
From: (Vinay Sajip)
Date: Wed, 28 Aug 2002 11:45:11 +0100
Subject: [Python-Dev] PEP 282 Implementation
References: <00e001c2261d$19bfc320$652b6992@alpha>  <>
Message-ID: <006601c24e7f$fcff5440$652b6992@alpha>

> In general the code looks good.  Only one style nits: I prefer
> docstrings that have a one-line summary, then a blank line, and then a
> longer description.

I will update the docstrings as per your feedback.

> There's a lot of code there!  Should it perhaps be broken up into
> different modules?  Perhaps it should become a logging *package* with
> submodules that define the various filters and handlers.

How strongly do you feel about this? I did think about doing this and in
fact the first implementation of the module was as a package. I found this a
little more cumbersome than the single-file solution, and reimplemented as The module is a little on the large side but the single-file
organization makes it a little easier to use.

> - Why does the FileHandler open the file with mode "a+" (and later
>   with "w+")?  The "+" makes the file readable, but I see no reason to
>   read it.  Am I missing?

No, you're right - using "a" and "w" should work. I'll change the code to
lose the "+".

> - setRollover(): the explanation isn't 100% clear.  I *think* that you
>   always write to "app.log", and when that's full, you rename it to
>   app.log.1, and app.log.1 gets renamed to app.log.2, and so on, and
>   then you start writing to a new app.log, right?

Yes. The original implementation was different - it just closed the current
file and opened a new file app.log.n. The current implementation is slightly
slower due to the need to rename several files, but the user can tell more
easily which the latest log file is. I will update the setRollover()
docstring to indicate more clearly how it works; I'm assuming that the
current algorithm is deemed good enough.

> - class SocketHandler: why set yourself up for buffer overflow by
>   using only 2 bytes for the packet size?  You can use the struct
>   module to encode/decode this, BTW.  I also wonder what the
>   application for this is, BTW.

I agree about the 2-byte limit. I can change it to use struct and an integer
length. The application for encoding the length is simply to allow a
socket-based server to handle multiple events sent by SocketHandler, in the
event that the connection is kept open as long as possible and not shut down
after every event.

>   - method send(): in Python 2.2 and later, you can use the sendall()
>     socket method which takes care of this loop for you.

OK. I can update the code to use this in the case of 2.2 and later.

> - class DatagramHandler, method send(): I don't think UDP handles
>   fragmented packets very well -- if you have to break the packet up,
>   there's no guarantee that the receiver will see the parts in order
>   (or even all of them).

You're absolutely right - I wasn't thinking clearly enough about how UDP
actually works. I will replace the loop with a single sendto() call.

> - fileConfig(): Is there documentation for the configuration file?

There is some documentation in the python_logging.html file which is part of
the distribution and also on the Web at - it's in the form of comments
in an annotated logconf.ini. I have not polished the documentation in this
area as I'm not sure how much of the configuration stuff should be in the
logging module itself. Feedback I've had indicates that at least some people
object moderately strongly to having a particular configuration design
forced on them. I'd appreciate views on this.

Many thanks for the feedback,

Vinay Sajip

From  Wed Aug 28 14:27:16 2002
From: (Greg Ward)
Date: Wed, 28 Aug 2002 09:27:16 -0400
Subject: [Python-Dev] Re: The first trustworthy <wink> GBayes results
In-Reply-To: <>
References: <>
Message-ID: <>

On 27 August 2002, Tim Peters said:
> Setting this up has been a bitch.  All early attempts floundered because it
> turned out there was *some* systematic difference between the ham and spam
> archives that made the job trivial.
> The ham archive:  I selected 20,000 messages, and broke them into 5 sets of
> 4,000 each, at random, from a python-list archive Barry put together,
> containing msgs only after SpamAssassin was put into play on
> It's hoped that's pretty clean, but nobody checked all ~= 160,000+ msgs.  As
> will be seen below, it's not clean enough.

One of the other perennial-seeming topics on spamassassin-devel (a list
that I follow only sporodically) is that careful manual cleaning of your
corpus is *essential*.  The concern of the main SA developers is that
spam in your non-spam folder (and vice-versa) will prejudice the genetic
algorithm that evolves SA's scores in the wrong direction.  Gut instinct
tells me the Bayesian approach ought to be more robust against this sort
of thing, but even it must have a breaking point at which misclassified
messages throw off the probabilities.

But that's entirely consistent with your statement:

> Another lesson reinforces
> one from my previous life in speech recognition:  rigorous data collection,
> cleaning, tagging and maintenance is crucial when working with statisical
> approaches, and is damned expensive to do.

On corpus collection...

> The spam archive:  This is essentially all of Bruce Guenter's 2002 spam
> collection, at <>.  It was broken at random
> into 5 sets of 2,750 spams each.

One possibility occurs to me: we could build our own corpus by
collecting spam on for a few weeks.  Here's a rough breakdown
of mail rejected by over the last 10 days,
eyeball-estimated messages per day:

  bad RCPT                       150 - 300 [1]
  bad sender                      50 - 190 [2]
  relay denied                    20 - 180 [3]
  known spammer addr/domain       15 -  60
  8-bit chars in subject         130 - 200
  8-bit chars in header addrs     10 -  60
  banned charset in subject        5 -  50 [4]
  "ADV" in subject                 0 -   5
  no Message-Id header           100 - 400 [5]
  invalid header address syntax    5 -  50 [6]
  no valid senders in header      10 -  15 [7]
  rejected by SpamAssassin        20 -  50 [8]
  quarantined by SpamAssassin      5 -  50 [8]

[1] this includes mail accidentally sent to eg.,
    but based on scanning the reject logs, I'd say the vast majority
    is spam.  However, such messages are rejected after RCPT TO,
    so we never see the message itself.  Most of the bad recipient
    addrs are either ancient (, or fictitious (,

[2] sender verification failed, eg. someone tried to claim an
    envelope sender like foo@bogus.domain.  Usually spam, but innocent
    bystanders can be hit by DNS servers suddenly exploding (hello,  This only includes hard failures (DNS "no such
    domain"), not soft failures (DNS timeout).    

[3] I'd be leery of accepting mail that's trying to hijack as an open relay, even though that would
    be a goldmine of spam.  (OTOH, we could reject after the
    DATA command, and save the message anyways.)

[4] rejects any message with a properly MIME-encoded
    subject using any of the following charsets:
      big5, euc-kr, gb2312, ks_c_5601-1987

[5] includes viruses as well as spam (and no doubt some innocent
    false positives, although I have added exemptions for the MUA/MTA
    combinations that most commonly result in legit mail reaching without a Message-Id header, eg. KMail/qmail)

[6] eg. "To: all my friends" or "From: <>"
[7] no valid sender address in any header line -- eg. someone gives a
    valid MAIL FROM address, but then puts "From: blah@bogus.domain"
    in the headers.  Easily defeated with a "Sender" or "Reply-to"

[8] any message scoring >= 10.0 is rejected at SMTP time; any
    message scoring >= 5.0 but < 10 is saved in /var/mail/spam
    for later review

Executive summary:

  * it's a good thing we do all those easy checks before involving
    SA, or the load on the server would be a lot higher

  * give me 10 days of spam-harvesting, and I can equal Bruce
    Guenter's spam archive for 2002.  (Of course, it'll take a couple
    of days to set the mail server up for the harvesting, and a couple
    more days to clean through the ~2000 caught messages, but you get
    the idea.)

> + Mailman added distinctive headers to every message in the ham
>   archive, which appear nowhere in the spam archive.  A Bayesian
>   classifier picks up on that immediately.
> + Mailman also adds "[name-of-list]" to every Subject line.

Perhaps that spam-harvesting run should also set aside a random
selection of apparently-non-spam messages received at the same time.
Then you'd have a corpus of mail sent to the same server, more-or-less
to the same addresses, over the same period of time.

Oh, any custom corpus should also include the ~300 false positives and
~600 false negatives gathered since SA started running on in April.


From  Wed Aug 28 15:10:54 2002
From: (Paul Graham)
Date: 28 Aug 2002 14:10:54 -0000
Subject: [Python-Dev] Re: The first trustworthy <wink> GBayes results
Message-ID: <>

Bayesian filters are pretty robust in the face of corpus
contamination, if you have a threshold for the number of
occurrences of a word that you'll consider.  If you don't
do that, then yes, a single legit email in your spam 
corpus could cause your filters to reject every similar

A single email could easily contain five to eight words
that never occur in any other email.  (Username, domain
name, server name, street address, etc.)  If this got
into your spam corpus by mistake, then every succeeding
email from the same person would be classified as spam.

What this means is that you may want to use slightly
different thresholds for occurrences depending on how 
much you trust the (human) classifier.  For an app to be 
used by end users, you might want to have a high threshold,
like 20 occurrences.

I find from my own experience that I often misclassify
mail.  I seem to be more likely to put spam in a legit
mail folder than the reverse.  But, as you guys found,
the first result of testing your filters tends to be to
clean up such mistakes.


--Greg Ward wrote:
> On 27 August 2002, Tim Peters said:
> > Setting this up has been a bitch.  All early attempts floundered because it
> > turned out there was *some* systematic difference between the ham and spam
> > archives that made the job trivial.
> > 
> > The ham archive:  I selected 20,000 messages, and broke them into 5 sets of
> > 4,000 each, at random, from a python-list archive Barry put together,
> > containing msgs only after SpamAssassin was put into play on
> > It's hoped that's pretty clean, but nobody checked all ~= 160,000+ msgs.  As
> > will be seen below, it's not clean enough.
> One of the other perennial-seeming topics on spamassassin-devel (a list
> that I follow only sporodically) is that careful manual cleaning of your
> corpus is *essential*.  The concern of the main SA developers is that
> spam in your non-spam folder (and vice-versa) will prejudice the genetic
> algorithm that evolves SA's scores in the wrong direction.  Gut instinct
> tells me the Bayesian approach ought to be more robust against this sort
> of thing, but even it must have a breaking point at which misclassified
> messages throw off the probabilities.
> But that's entirely consistent with your statement:
> > Another lesson reinforces
> > one from my previous life in speech recognition:  rigorous data collection,
> > cleaning, tagging and maintenance is crucial when working with statisical
> > approaches, and is damned expensive to do.
> On corpus collection...
> > The spam archive:  This is essentially all of Bruce Guenter's 2002 spam
> > collection, at <>.  It was broken at random
> > into 5 sets of 2,750 spams each.
> One possibility occurs to me: we could build our own corpus by
> collecting spam on for a few weeks.  Here's a rough breakdown
> of mail rejected by over the last 10 days,
> eyeball-estimated messages per day:
>   bad RCPT                       150 - 300 [1]
>   bad sender                      50 - 190 [2]
>   relay denied                    20 - 180 [3]
>   known spammer addr/domain       15 -  60
>   8-bit chars in subject         130 - 200
>   8-bit chars in header addrs     10 -  60
>   banned charset in subject        5 -  50 [4]
>   "ADV" in subject                 0 -   5
>   no Message-Id header           100 - 400 [5]
>   invalid header address syntax    5 -  50 [6]
>   no valid senders in header      10 -  15 [7]
>   rejected by SpamAssassin        20 -  50 [8]
>   quarantined by SpamAssassin      5 -  50 [8]
> [1] this includes mail accidentally sent to eg.,
>     but based on scanning the reject logs, I'd say the vast majority
>     is spam.  However, such messages are rejected after RCPT TO,
>     so we never see the message itself.  Most of the bad recipient
>     addrs are either ancient (,
> or fictitious (,
> [2] sender verification failed, eg. someone tried to claim an
>     envelope sender like foo@bogus.domain.  Usually spam, but innocent
>     bystanders can be hit by DNS servers suddenly exploding (hello,
>  This only includes hard failures (DNS "no such
>     domain"), not soft failures (DNS timeout).    
> [3] I'd be leery of accepting mail that's trying to hijack
> as an open relay, even though that would
>     be a goldmine of spam.  (OTOH, we could reject after the
>     DATA command, and save the message anyways.)
> [4] rejects any message with a properly MIME-encoded
>     subject using any of the following charsets:
>       big5, euc-kr, gb2312, ks_c_5601-1987
> [5] includes viruses as well as spam (and no doubt some innocent
>     false positives, although I have added exemptions for the MUA/MTA
>     combinations that most commonly result in legit mail reaching
> without a Message-Id header, eg. KMail/qmail)
> [6] eg. "To: all my friends" or "From: <>"
> [7] no valid sender address in any header line -- eg. someone gives a
>     valid MAIL FROM address, but then puts "From: blah@bogus.domain"
>     in the headers.  Easily defeated with a "Sender" or "Reply-to"
>     header.
> [8] any message scoring >= 10.0 is rejected at SMTP time; any
>     message scoring >= 5.0 but < 10 is saved in /var/mail/spam
>     for later review
> Executive summary:
>   * it's a good thing we do all those easy checks before involving
>     SA, or the load on the server would be a lot higher
>   * give me 10 days of spam-harvesting, and I can equal Bruce
>     Guenter's spam archive for 2002.  (Of course, it'll take a couple
>     of days to set the mail server up for the harvesting, and a couple
>     more days to clean through the ~2000 caught messages, but you get
>     the idea.)
> > + Mailman added distinctive headers to every message in the ham
> >   archive, which appear nowhere in the spam archive.  A Bayesian
> >   classifier picks up on that immediately.
> > 
> > + Mailman also adds "[name-of-list]" to every Subject line.
> Perhaps that spam-harvesting run should also set aside a random
> selection of apparently-non-spam messages received at the same time.
> Then you'd have a corpus of mail sent to the same server, more-or-less
> to the same addresses, over the same period of time.
> Oh, any custom corpus should also include the ~300 false positives and
> ~600 false negatives gathered since SA started running on
> in April.
>         Greg

From  Wed Aug 28 15:27:30 2002
From: (Guido van Rossum)
Date: Wed, 28 Aug 2002 10:27:30 -0400
Subject: [Python-Dev] A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: Your message of "Wed, 28 Aug 2002 01:13:55 EDT."
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

> Special-casing the 2 argument case helps a lot. It brings the performace
> within 50% of nested loops which means that if you actually do something
> inside the loop the overhead is quite negligible.

Hm, I tried that and found no difference.  Maybe I didn't benchmark right.

> Ahh... re-iterability again...
> This is a good example of a function that *fails silently* for non 
> re-iterable arguments.

This failure is hardly silent IMO: the results are totally bogus,
which is a pretty good clue that something's wrong.

> Slurping the tail into a list loses the lazy efficiency of this function. 
> One of the ways I've used this function is to scan combinations until a 
> condition is satisfied. The iteration is always terminated before reaching 
> the end. Reading ahead may waste computation and memory.

I don't understand.  The Cartesian product calculation has to iterate
over the second argument many times (unless you have it iterate over
the first argument many times).  So a lazy argument won't work.  Am I
missing something?

> All I want is something that will raise an exception if any argument but 
> the first is not re-iterable (e.g. my reiter() proposal). I'll add list()
> to the argument myself if I really want to. Don't try to guess what I
> meant.

Actually, I don't want to reiterate this debate.

--Guido van Rossum (home page:

From  Wed Aug 28 15:49:20 2002
From: (Oren Tirosh)
Date: Wed, 28 Aug 2002 10:49:20 -0400
Subject: [Python-Dev] A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

On Wed, Aug 28, 2002 at 10:27:30AM -0400, Guido van Rossum wrote:
> > Ahh... re-iterability again...
> > 
> > This is a good example of a function that *fails silently* for non 
> > re-iterable arguments.
> This failure is hardly silent IMO: the results are totally bogus,
> which is a pretty good clue that something's wrong.

Sure, at the interactive prompt or very shallow code it is obvious. 

Exceptions are noisy. Anything else is silent.

> > Slurping the tail into a list loses the lazy efficiency of this function. 
> > One of the ways I've used this function is to scan combinations until a 
> > condition is satisfied. The iteration is always terminated before reaching 
> > the end. Reading ahead may waste computation and memory.
> I don't understand.  The Cartesian product calculation has to iterate
> over the second argument many times (unless you have it iterate over
> the first argument many times).  So a lazy argument won't work.  Am I
> missing something?

Even if all the arguments are re-iterable containers the recursive call
produces a lazy generator object - the cartesian product of the tail. I 
don't want to read it eagerly into a list.


From  Wed Aug 28 16:02:25 2002
From: (Guido van Rossum)
Date: Wed, 28 Aug 2002 11:02:25 -0400
Subject: [Python-Dev] A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: Your message of "Wed, 28 Aug 2002 10:49:20 EDT."
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

> Even if all the arguments are re-iterable containers the recursive call
> produces a lazy generator object - the cartesian product of the tail. I 
> don't want to read it eagerly into a list.

And I wasn't proposing that.

def cartesian(*sequences):
    if len(sequences) == 0:
        yield []
        head, tail = sequences[:-1], sequences[-1]
        tail = list(tail) # <--- This is what I was proposing
        for x in cartesian(*head):
            for y in tail:
                    yield x + [y]

--Guido van Rossum (home page:

From  Wed Aug 28 16:50:29 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 28 Aug 2002 11:50:29 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
References: <>
Message-ID: <>

[Oren Tirosh]

> My implementation from
> [...]  Special-casing the 2 argument case helps a lot.

Good idea!  I added such speed-up code for cartesians, subsets and
permutations, and am now seeking how to do it (nicely) for combinations
and arrangements as well.

> Slurping the tail into a list loses the lazy efficiency of this function.

Generators are an elegant way to be lazy.  I agree that we are likely to
loose something if we attempt to do too much, too soon.

François Pinard

From  Wed Aug 28 17:10:36 2002
From: (Oren Tirosh)
Date: Wed, 28 Aug 2002 12:10:36 -0400
Subject: [Python-Dev] A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

On Wed, Aug 28, 2002 at 11:02:25AM -0400, Guido van Rossum wrote:
> > Even if all the arguments are re-iterable containers the recursive call
> > produces a lazy generator object - the cartesian product of the tail. I 
> > don't want to read it eagerly into a list.
> And I wasn't proposing that.
> def cartesian(*sequences):
>     if len(sequences) == 0:
>         yield []
>     else:
>         head, tail = sequences[:-1], sequences[-1]
>         tail = list(tail) # <--- This is what I was proposing
>         for x in cartesian(*head):
>             for y in tail:
>                     yield x + [y]

Silly me. Too much LISPthink made me automatically see "head" as the 
first item and "tail" as the rest. Now I see that the head is all but the 
last item and the tail is the last.  It was really funny - like seeing the
cup change into two faces right in front of your eyes...


From  Wed Aug 28 17:12:41 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 28 Aug 2002 12:12:41 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
References: <>
Message-ID: <>

[Guido van Rossum]

> > def cartesian(*sequences): [...]

> It occurred to me that this is rather ineffecient because it invokes
> itself recursively many time (once for each element in the first
> sequence).  This version is much faster, because iterating over a
> built-in sequence (like a list) is much faster than iterating over a
> generator:

Granted, thanks!  I postponed optimisations for the first draft of `cogen',
but if it looks acceptable overall, we can try to get some speed of it now.

> def cartesian(*sequences):
>     if len(sequences) == 0:
>         yield []
>     else:
>         head, tail = sequences[:-1], sequences[-1]
>         for x in cartesian(*head):
>             for y in tail:
>                     yield x + [y]

> I also wonder if perhaps ``tail = list(tail)'' should be inserted
> just before the for loop, so that the arguments may be iterators as
> well.

`cogen' does not make any special effort for protecting iterators given as
input sequences.  `cartesian' is surely not the only place where iterators
would create a problem.  Solving it as a special case for `cartesian'
only is not very nice.  Of course, we might transform all sequences into
lists everywhere in `cogen', but `list'-ing a list copies it, and I'm not
sure this would be such a good idea in the end.

Best may be to let the user explicitly transform input iterators into
lists by calling `list' explicitly those arguments.  That might be an
acceptable compromise.

> I expect that Eric Raymond's powerset is much faster than your recursive
> subsets() [...]

Very likely, yes.  I started `cogen' with algorithms looking a bit all
alike, and did not look at speed.  OK, I'll switch.  I do not have Eric's
algorithm handy, but if I remember well, it merely mapped successive
integers to subsets by associating each bit with an element.

François Pinard

From  Wed Aug 28 17:49:51 2002
From: (Michael Chermside)
Date: Wed, 28 Aug 2002 12:49:51 -0400
Subject: [Python-Dev] Re: PEP 282 Implementation
Message-ID: <>

>> - setRollover(): the explanation isn't 100% clear.  I *think* that you
>>   always write to "app.log", and when that's full, you rename it to
>>   app.log.1, and app.log.1 gets renamed to app.log.2, and so on, and
>>   then you start writing to a new app.log, right?
> Yes. The original implementation was different - it just closed the current
> file and opened a new file app.log.n. The current implementation is slightly
> slower due to the need to rename several files, but the user can tell more
> easily which the latest log file is. I will update the setRollover()
> docstring to indicate more clearly how it works; I'm assuming that the
> current algorithm is deemed good enough.

Why not have the current logfile named "app.log", and when it's full 
rename it as "app.log.n" (for the approprite value of n)? It's still 
easy to find the current log file, there's only one file to rename, and 
the obvious sort order will put the files in chronlogical order not 
reverse chronological order.

The only downside that's obvious to me is that if you just clear out old 
log files by deleting all but the latest 3, then your numbers will keep 
increasing over time. But I hardly see that as a problem... if you find 
that a filename of "app.log.143" really drives you crazy you can just 
rename the remaining logfiles the next time you clean out the log 
directory and everything's fine.

-- Michael Chermside

From  Wed Aug 28 18:58:18 2002
From: (Guido van Rossum)
Date: Wed, 28 Aug 2002 13:58:18 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: Your message of "Wed, 28 Aug 2002 12:12:41 EDT."
References: <> <> <> <> <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

> Granted, thanks!  I postponed optimisations for the first draft of `cogen',
> but if it looks acceptable overall, we can try to get some speed of it now.

I'm not saying that it looks good overall -- I'd like to defer to Tim,
who has used and written this kind of utities for real, and who
probably has a lot of useful feedback.  Right now, he's felled by some
kind of ilness though.

> > def cartesian(*sequences):
> >     if len(sequences) == 0:
> >         yield []
> >     else:
> >         head, tail = sequences[:-1], sequences[-1]
> >         for x in cartesian(*head):
> >             for y in tail:
> >                     yield x + [y]
> > I also wonder if perhaps ``tail = list(tail)'' should be inserted
> > just before the for loop, so that the arguments may be iterators as
> > well.
> `cogen' does not make any special effort for protecting iterators
> given as input sequences.  `cartesian' is surely not the only place
> where iterators would create a problem.  Solving it as a special
> case for `cartesian' only is not very nice.  Of course, we might
> transform all sequences into lists everywhere in `cogen', but
> `list'-ing a list copies it, and I'm not sure this would be such a
> good idea in the end.

Hm.  All but the last will be iterated over many times.  In practice
the inputs will be relatively small (I can't imagine using this for
sequences with 100s of 1000s elements).  Or you might sniff the type
and avoid the copy if you know it's a list or tuple.  Or you might use
Oren's favorite rule of thumb and listify it when iter(x) is iter(x)
(or iter(x) is x).

> Best may be to let the user explicitly transform input iterators into
> lists by calling `list' explicitly those arguments.  That might be an
> acceptable compromise.


> > I expect that Eric Raymond's powerset is much faster than your recursive
> > subsets() [...]
> Very likely, yes.  I started `cogen' with algorithms looking a bit all
> alike, and did not look at speed.  OK, I'll switch.  I do not have Eric's
> algorithm handy, but if I remember well, it merely mapped successive
> integers to subsets by associating each bit with an element.

def powerset(base):
    """Powerset of an iterable, yielding lists."""
    pairs = [(2**i, x) for i, x in enumerate(base)]
    for n in xrange(2**len(pairs)):
        yield [x for m, x in pairs if m&n]

--Guido van Rossum (home page:

From  Wed Aug 28 20:12:53 2002
From: (Tim Peters)
Date: Wed, 28 Aug 2002 15:12:53 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <>

FYI.  After cleaning the blatant spam identified by the classifier out of my
ham corpus, and replacing it with new random msgs from Barry's corpus, the
reported false positive rate fell to about 0.2% (averaging 8 per each batch
of 4000 ham test messages).  This seems remarkable given that it's ignoring
headers, and just splitting the raw text on whitespace in total ignorance of
HTML & MIME etc.

'FREE' (all caps) moved into the ranks of best spam indicators.  The false
negative rate got reduced by a small amount, but I doubt it's a
statistically significant reduction (I'll compute that stuff later; I'm
looking for Big Things now).

Some of these false positives are almost certainly spam, and at least one is
almost certainly a virus:  these are msgs that are 100% base64-encoded, or
maximally obfuscated quoted-printable.  That could almost certainly be fixed
by, e.g., decoding encoded text.

The other false positives seem harder to deal with:

+ Brief HMTL msgs from newbies.  I doubt the headers will help these
  get through, as they're generally first-time posters, and aren't
  replies to earlier msgs.  There's little positive content, while
  all elements of raw HTML have high "it's spam" probability.


Content-Description: filename="text1.txt"
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Is there a version of Python with Prolog Extension??
Where can I find it if there is?


P.S. Could you please reply to the sender too.

Content-Description: filename="text1.html"
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable

        <TITLE>Prolog Extension</TITLE>
        <META NAME=3D"GENERATOR" CONTENT=3D"StarOffice/5.1 (Linux)">
        <META NAME=3D"CREATED" CONTENT=3D"19991127;12040200">
        <META NAME=3D"CHANGEDBY" CONTENT=3D"Luis Cortes">
        <META NAME=3D"CHANGED" CONTENT=3D"19991127;12044700">
<PRE>Is there a version of Python with Prolog Extension??
Where can I find it if there is?


P.S. Could you please reply to the sender too.</PRE>


Here's how it got scored:

prob = 0.999958816093
prob('<META') = 0.957529
prob('<META') = 0.957529
prob('<META') = 0.957529
prob('<BODY>') = 0.979284
prob('Prolog') = 0.01
prob('<HEAD>') = 0.97989
prob('Thanks,') = 0.0337316
prob('Prolog') = 0.01
prob('Python') = 0.01
prob('NAME=3D"GENERATOR"') = 0.99
prob('<HTML>') = 0.99
prob('</HTML>') = 0.989494
prob('</BODY>') = 0.987429
prob('Thanks,') = 0.0337316
prob('Python') = 0.01

Note that '<META' gets penalized 3 times.  More on that later.

+ Msgs talking *about* HTML, and including HTML in examples.  This one
  may be troublesome, but there are mercifully few of them.

+ Brief msgs with obnoxious employer-generated signatures.  Example:

Hi there,

I am looking for you recommendations on training courses available in the UK
on Python.  Can you help?


Vickie Mills
IS Training Analyst

Tel:    0131 245 1127
Fax:    0131 245 1550

For more information on Standard Life, visit our website   The Standard Life Assurance Company, Standard
Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland
(No SZ4) and regulated by the Personal Investment Authority.  Tel: 0131 225
2552 - calls may be recorded or monitored.  This confidential e-mail is for
the addressee only.  If received in error, do not retain/copy/disclose it
without our consent and please return it to us.  We virus scan all e-mails
but are not responsible for any damage caused by a virus or alteration by a
third party after it is sent.

The scoring:

prob = 0.98654879055
prob('our') = 0.928936
prob('sent.') = 0.939891
prob('Tel:') = 0.0620155
prob('Thanks,') = 0.0337316
prob('received') = 0.940256
prob('Tel:') = 0.0620155
prob('Hi') = 0.0533333
prob('help?') = 0.01
prob('Personal') = 0.970976
prob('regulated') = 0.99
prob('Road,') = 0.01
prob('Training') = 0.99
prob('e-mails') = 0.987542
prob('Python.') = 0.01
prob('Investment') = 0.99

The brief human-written part is fine, but the longer boilerplate sig is
indistinguishable from spam.

+ The occassional non-Python conference announcement(!).  These are
  long, so I'll skip an example.  In effect, it's automated bulk email
  trying to sell you a conference, so is prone to use the language and
  artifacts of advertising.  Here's typical scoring, for the TOOLS
  Europe '99 conference announcement:

prob = 0.983583974285
prob('THE') = 0.983584
prob('Object') = 0.01
prob('Bell') = 0.01
prob('Object-Oriented') = 0.01
prob('**************************************************************') =
prob('Bertrand') = 0.01
prob('Rational') = 0.01
prob('object-oriented') = 0.01
prob('CONTACT') = 0.99
prob('**************************************************************') =
prob('innovative') = 0.99
prob('**************************************************************') =
prob('Olivier') = 0.01
prob('VISIT') = 0.99
prob('OUR') = 0.99

Note the repeated penalty for the lines of asterisks.  That segues into the
next one:

+ Artifacts of that the algorithm counts multiples instances of "a word"
  multiple times.  These are baffling at first sight!  The two clearest

> > Can you create and use new files with
> Yes. But if I run db_dump on these files, it says "unexpected file type
> or format", regardless which db_dump version I use (2.0.77, 3.0.55,
> 3.1.17)

It may be that db_dump isn't compatible with version 1.85 databse files.  I
can't remember.  I seem to recall that there was an option to build 1.85
versions of db_dump and db_load.  Check the configure options for
BerkeleyDB to find out.  (Also, while you are there, make sure that
BerkeleyDB was built the same on both of your platforms...)

> >  Try running db_verify (one of the utilities built
> > when you compiled DB) on the file and see what it tells you.
> There is no db_verify among my Berkeley DB utilities.

There should have been a bunch of them built when you compiled DB.  I've got

-r-xr-xr-x  1 rd       users     343108 Dec 11 12:11 db_archive
-r-xr-xr-x  1 rd       users     342580 Dec 11 12:11 db_checkpoint
-r-xr-xr-x  1 rd       users     342388 Dec 11 12:11 db_deadlock
-r-xr-xr-x  1 rd       users     342964 Dec 11 12:11 db_dump
-r-xr-xr-x  1 rd       users     349348 Dec 11 12:11 db_load
-r-xr-xr-x  1 rd       users     340372 Dec 11 12:11 db_printlog
-r-xr-xr-x  1 rd       users     341076 Dec 11 12:11 db_recover
-r-xr-xr-x  1 rd       users     353284 Dec 11 12:11 db_stat
-r-xr-xr-x  1 rd       users     340340 Dec 11 12:11 db_upgrade
-r-xr-xr-x  1 rd       users     340532 Dec 11 12:11 db_verify

Robin Dunn
Software Craftsman     Java give you jitters?        Relax with wxPython!

Looks utterly on-topic!  So why did Robin's msg get flagged?  It's solely
due to his Unix name in the ls output(!):

prob = 0.999999999895
prob('Berkeley') = 0.01
prob('configure') = 0.01
prob('remember.') = 0.01
prob('these:') = 0.01
prob('recall') = 0.01
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99

Spammers often generate random "word-like" gibberish at the ends of msgs,
and "rd" is one of the random two-letter combos that appears in the spam
corpus.  Perhaps it would be good to ignore "words" with fewer than W
characters (to be determined by experiment).

The other example is long, an off-topic but delightful exchange between
Peter Hansen and Alex Martelli.  Here's a "typical" paragraph:

    Since it's important to use very abundant amounts of water when
    cooking pasta, the price of what is still a very cheap dish would
    skyrocket if that abundant water had to be costly bottled mineral

The scoring:

prob = 0.99
prob('"Peter') = 0.01
prob(':-)') = 0.01
prob('<>') = 0.01
prob('tasks') = 0.01
prob('drinks') = 0.01
prob('wrote') = 0.01
prob('Hansen"') = 0.01
prob('water') = 0.99
prob('water') = 0.99
prob('skyrocket') = 0.99
prob('water') = 0.99
prob('water') = 0.99
prob('water') = 0.99
prob('water') = 0.99
prob('water') = 0.99

Alex is drowning in his aquatic excess <wink>.

I expect that including the headers would have given these much better
chances of getting through, given Robin and Alex's posting histories.
Still, the idea of counting words multiple times is open to question, and
experiments both ways are in order.

+ Brief put-ons, like


It's not actually things like WAREZ that hurt here, it's more the mere fact

prob = 0.999982095931
prob('AUTOCODING') = 0.2
prob('THING.') = 0.2
prob('DUDEZ') = 0.2
prob('ANYONE') = 0.884211
prob('GET') = 0.847334
prob('GET') = 0.847334
prob('HEY') = 0.2
prob('--') = 0.0974729
prob('KNOW') = 0.969697
prob('THIS') = 0.953191
prob('?') = 0.0490886
prob('WANT') = 0.99
prob('TO') = 0.988829
prob('CAN') = 0.884211
prob('WAREZ') = 0.2

OTOH, a lot of the Python community considered the whole autocoding thread
to be spam, and I personally could have lived without this contribution to
its legacy (alas, the autocoding thread wasn't spam, just badly off-topic).

+ Msgs top-quoting an earlier spam in its entirety.  For example,
  one msg quoted an entire Nigerian scam msg, and added just

    Aw jeez, another one of these Nigerian wire scams.  This one has
    been around for 20 years.

What's an acceptable false positive rate?  What do we get from SpamAssassin?
I expect we can end up below 0.1% here, and with a generous meaning for "not
spam", but I think *some* of these examples show that the only way to get a
0% false-positive rate is to recode spamprob like so:

    def spamprob(self, wordstream, evidence=False):
        return 0.0

That would also allow other simplifications <wink>.

From  Wed Aug 28 20:34:11 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 28 Aug 2002 15:34:11 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
References: <>
Message-ID: <>

[Guido van Rossum]

> Right now, [Tim] is felled by some kind of illness though.

For the mere crumbs falling off of his project table, he seems to be working
like hell, so no surprise his body asks him to slow down once in a while.

I wish he will soon be healthy and happy again!  Life is not the same when
he is away...

> All but the last will be iterated over many times.  In practice the inputs
> will be relatively small (I can't imagine using this for sequences with
> 100s of 1000s elements).

Do not under-estimate statisticians!  They are well known for burning oodles
and oodles of computer cycles, contemplating or combining data sets which
are not always small, in all ways imaginable.

Of course, even statisticians cannot afford all subsets of sets having
100 elements, or will not scan permutations of 1000 elements.  But 100s
or 1000s elements are well within the bounds of cartesian products.

> Or you might use Oren's favorite rule of thumb and listify it when iter(x)
> is iter(x) (or iter(x) is x).

I'm a bit annoyed by the idea that `iter(x)' might require some computation
for producing an iterator, and that we immediately throw away the result.
Granted that `__iter__(self): return self' is efficient when an object is
an iterator, but nowhere it is said that `__iter__' has to be efficient
when the object is a container, and it does not shock me that some complex
containers require time to produce their iterator.  I much prefer limiting
the use of `__iter__' for when one intends to use the iterator...

> def powerset(base):
>     """Powerset of an iterable, yielding lists."""
>     pairs = [(2**i, x) for i, x in enumerate(base)]
>     for n in xrange(2**len(pairs)):
>         yield [x for m, x in pairs if m&n]

Thanks!  Hmph!  This does not yield the subsets in "sorted order", like
the other `cogen' methods do, and I would prefer to keep that promise.
Hopefully, the optimisation I added this morning will make both algorithms
more comparable, speed-wise.  I should benchmark them to see.

François Pinard

From  Wed Aug 28 20:36:57 2002
From: (Tim Peters)
Date: Wed, 28 Aug 2002 15:36:57 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
Message-ID: <>

This is a multi-part message in MIME format.

Content-type: text/plain; charset=iso-8859-1
Content-transfer-encoding: 7BIT

> I'm not saying that it looks good overall -- I'd like to defer to Tim,
> who has used and written this kind of utities for real, and who
> probably has a lot of useful feedback.  Right now, he's felled by some
> kind of ilness though.

I think that's over!  I'm very tired, though (couldn't get to sleep until
11, and woke up at 2 with the last, umm, episode <wink>).

This is a Big Project if done right.  I volunteered time for it a few years
ago, but there wasn't enough interest then to keep it going.  I'll attach
the last publicly-distribued module I had then, solely devoted to
combinations.  It was meant to be the first in a series, all following some
basic design decisions:

+ A Basic class that doesn't compromise on speed, typically by working
  on canonical representatives in Python list-of-int form.

+ A more general class that deals with arbitrary sequences, perhaps
  at great loss of efficiency.

+ Multiple iterators are important:  lex order is needed sometimes;
  Gray code order is an enormous help sometimes; random generation is
  vital sometimes.

+ State-of-the-art algorithms.  That's a burden for anything that
  goes into the core -- if it's a toy algorithm, users can
  do just as well on their own, and then people submit patch after
  patch that the original author isn't really qualified to judge
  (else they would have done a state-of-the-art thing to begin with).

+ The ability to override the random number generator.  Python's
  default WH generator is showing its age as machines get faster;
  it's simply not adequate anymore for long-running programs making
  heavy use of it on a fast box.  Combinatorial algorithms in
  particular do tend to make heavy use of it.  (Speaking of which,
  "someone" should look into grabbing one of the Mersenne Twister
  extensions for Python -- that's the current state of *that* art).

Ideas not worth taking:

+ Leave the chi-square algorithm out of it.  A better implementation
  would be nice to have in a statistics package, but it doesn't
  belong here regardless.

me-i'm-going-back-to-sleep-ly y'rs  - tim

Content-type: text/plain;
Content-transfer-encoding: 7BIT
Content-disposition: attachment;

# Module combgen version 0.9.1
# Released to the public domain 18-Dec-1999,
# by Tim Peters (

# Provided as-is; use at your own risk; no warranty; no promises; enjoy!

CombGen(s, k) supplies methods for generating k-combinations from s.

CombGenBasic(n, k) acts like CombGen(range(n), k) but is more

s is of any sequence type such that s supports catenation (s1 + s2)
and slicing (s[i:j]).  For example, s can be a list, tuple or string.

k is an integer such that 0 <= k <= len(s).

A k-combination of s is a subsequence C of s where len(C) = k, and for
some k integers i_0, i_1, ..., i_km1 (km1 = k-1) with
0 <= i_0 < i_1 < ... < i_km1 < len(s),

    C[0] is s[i_0]
    C[1] is s[i_1]
    C[k-1] is s[i_km1]

Note that each k-combination is a sequence of the same type as s.

Different methods generate k-combinations in lexicographic index
order, a particular "Gray code" order, or at random.

The .reset() method can be used to start over.

The .set_start(ivector) method can be used to force generation to
begin at a particular combination.

Module function comb(n, k) returns the number of combinations of n
things taken k at a time; n >= k >= 0 required.


+ The CombGen constructor saves a reference to (not a copy of) s, so
  don't mutate s after calling CombGen.

+ For efficiency, CombGenBasic getlex and getgray return the *same*
  list each time, mutating it in place.  You must not mutate this
  list; and, if you want to save a combination's value across calls,
  copy the list.  For example,

>>> g = CombGenBasic(2, 1)
>>> x = g.getlex(); y = g.getlex()
>>> x is y  # the same!
>>> x, y    # so these print the same thing
([1], [1])
>>> g.reset()
>>> x = g.getlex()[:]; y = g.getlex()[:]
>>> x, y    # copies work as expected
([0], [1])

In contrast, CombGen methods return a new sequence each time -- but
they're slower.


Each invocation of .getlex() returns a new k-combination of s.  The
combinations are generated in lexicographic index order (for
CombGenBasic, the k-combinations themselves are in lexicographic
order).  That is, the first k-combination consists of

    s[0], s[1], ..., s[k-1]

in that order; the next of

    s[0], s[1], ..., s[k]

and so on until reaching

    s[len(s)-k], s[len(s)-k+1], ..., s[len(s)-1]

After all k-combinations have been generated, .getlex() returns None.


>>> g = CombGen("abc", 0).getlex
>>> g(), g()
('', None)

>>> g = CombGen("abc", 1).getlex
>>> g(), g(), g(), g()
('a', 'b', 'c', None)

>>> g = CombGenBasic(3, 2).getlex
>>> print g(), g(), g(), g()
[0, 1] [0, 2] [1, 2] None

>>> g = CombGen((0, 1, 2), 3).getlex
>>> print g(), g(), g()
(0, 1, 2) None None

>>> p = CombGenBasic(4, 2)
>>> g = p.getlex
>>> print g(), g(), g(), g(), g(), g(), g(), g()
[0, 1] [0, 2] [0, 3] [1, 2] [1, 3] [2, 3] None None
>>> p.reset()
>>> print g(), g(), g(), g(), g(), g(), g(), g()
[0, 1] [0, 2] [0, 3] [1, 2] [1, 3] [2, 3] None None


Each invocation of .getgray() returns a triple

    C, tossed, added

    C is the next k-combination of s
    tossed is the element of s removed from the last k-combination
    added is the element of s added to the last k-combination

tossed and added are None for the first call.

Consecutive combinations returned by .getgray() differ by two elements
(one removed, one added).  If you invoke getgray() more than comb(n,k)
times, it "wraps around" and generates the same sequence again.  Note
that the last combination in the return sequence also differs by two
elements from the first combination in the return sequence.

Gray code ordering can be very useful when you're computing an
expensive function on each combination:  that exactly one element is
added and exactly one removed can often be exploited to save
recomputation for the k-2 common elements.

>>> o = CombGen("abcd", 2)
>>> for i in range(7):  # note that this wraps around
...     print o.getgray()
('ab', None, None)
('bd', 'a', 'd')
('bc', 'd', 'c')
('cd', 'b', 'd')
('ad', 'c', 'a')
('ac', 'd', 'c')
('ab', 'c', 'b')


Each invocation of .getrand() returns a random k-combination.

>>> o = CombGenBasic(1000, 6)
>>> import random
>>> random.seed(87654)
>>> o.getrand()
[69, 223, 437, 573, 722, 778]
>>> o.getrand()
[409, 542, 666, 703, 732, 847]
>>> CombGenBasic(1000000, 4).getrand()
[199449, 439831, 606885, 874530]

# 0,0,1    09-Dec-1999
#    initial version
# 0,0,2    10-Dec-1999
#    Sped CombGenBasic.{getlex, getgray} substantially by no longer
#    making copies of the indices; getgray is truly O(1) now.
#    A bad aspect is that they return the same list object each time
#    now, which can be confusing; e.g., had to change some examples.
#    Use CombGen instead if this bothers you -- CombGenBasic's
#    purpose in life is to be lean & mean.
#    Removed the restriction on mixing calls to CombGenBasic's
#    getlex and getgray; not sure it's useful, but it was irksome.
#    Changed __findj to return a simpler result.  This is less useful
#    for getgray, but now getlex can exploit it too (there are no
#    longer any Python-level loops in CombGenBasic's getlex; there's
#    an implied C-level loop (via "range"), and it's in the nature of
#    lex order that this can't be removed).
#    Added some exhaustive tests for getlex, and finger verification.
# 0,9,1    18-Dec-1999
#    Changed _testrand to compute and print chi-square statistics,
#    and probabilities, because one of _testrand's outputs didn't
#    "look random" to me.  Indeed, it's got a poor chi-square value!
#    But sometimes that *should* happen, and it does not appear to
#    be happening more often than expected.

__version__ = 0, 9, 1

def _chop(n):
    """n -> int if it fits, else long."""

        return int(n)
    except OverflowError:
        return n

def comb(n, k):
    """n, k -> number of combinations of n items, k at a time.

    n >= k >= 0 required.

    >>> for i in range(7):
    ...     print "comb(6, %d) ==" % i, comb(6, i)
    comb(6, 0) == 1
    comb(6, 1) == 6
    comb(6, 2) == 15
    comb(6, 3) == 20
    comb(6, 4) == 15
    comb(6, 5) == 6
    comb(6, 6) == 1
    >>> comb(52, 5)   # number of poker hands
    >>> comb(52, 13)  # number of bridge hands

    if not n >= k >= 0:
        raise ValueError("n >= k >= 0 required: " + `n, k`)
    if k > (n >> 1):
        k = n-k
    if k == 0:
        return 1
    result = long(n)
    i = 2
    n, k = n-1, k-1
    while k:
        # assert (result * n) % i == 0
        result = result * n / i
        i = i+1
        k = k-1
        n = n-1
    return _chop(result)

import random

class CombGenBasic:

    def __init__(self, n, k):
        self.n, self.k = n, k
        if not n >= k >= 0:
            raise ValueError("n >= k >= 0 required:" + `n, k`)

    def reset(self):
        """Restore state to that immediately after construction."""
        # The first result is the same for either lexicographic or
        # Gray code generation.

    # __findj is used only to initialize self.j for getlex and
    # getgray.  It returns the largest j such that slot j has
    # "breathing room"; that is, such that slot j isn't at its largest
    # possible value (n-k+j).  j is -1 if no such index exists.
    # After initialization, getlex and getgray incrementally update
    # this more efficiently.

    def __findj(self, v):
        n, k = self.n, self.k
        assert len(v) == k
        j = k-1
        while j >= 0 and v[j] == n-k+j:
            # v[j] is at its largest possible value
            j = j-1
        return j

    def getlex(self):
        """Return next (in lexicographic order) k-combination.

        Return None if all possibilities have been generated.

        Caution:  getlex returns the *same* list each time, mutated
        in place.  Don't mutate it yourself, or save a reference to it
        (the next call will mutate its contents; make a copy if you
        need to save the value across calls).

        indices, n, k, j = self.indices, self.n, self.k, self.j
        if self.firstcall:
            self.firstcall = 0
            return indices
        if j < 0:
            return None
        new = indices[j] = indices[j] + 1
        if j+1 == k:
            if new + 1 == n:
                j = j-1
            if new + 1 < indices[j+1]:
                indices[j:] = range(new, new + k - j)
                j = k-1
                j = j-1

        self.j = j
        # assert j == self.__findj(indices)
        return indices

    def getgray(self):
        """Return next (c, tossed, added) triple.

        c is the next k-combination in a particular Gray code order.
        tossed is the element of range(n) removed from the last
        added is the element of range(n) added to the last

        tossed and added are None if this is the first call, or on
        every call if there is only one k-combination.  Else tossed !=
        added, and neither is None.

        Caution:  getgray wraps around if you invoke it more than
        comb(n, k) times.

        Caution:  getgray returns the *same* list each time, mutated
        in place.  Don't mutate it yourself, or save a reference to it
        (the next call will mutate its contents; make a copy if you
        need to save the value across calls).

        # The popular routine in Nijenhuis & Wilf's "Combinatorial
        # Algorithms" is exceedingly complicated (although trivial
        # to program with recursive generators!).
        # Instead I'm using a variation of Algorithm A3 from the paper
        # "Loopless Gray Code Algorithms", by T.A. Jenkyns (Brock
        # University, Ontario).  The code is much simpler, and,
        # because it's loop-free, takes O(1) time on each call (not
        # just amortized over the whole sequence).
        # Because the paper doesn't yet seem to be well known, here's
        # the idea:  Modify the definition of lexicographic ordering
        # in a funky way:  in the element comparisons, replace "<" by
        # ">" in every other element position starting at the 2nd.
        # IOW, and skipping end cases, sequence s is "less than"
        # sequence t iff their elements are equal up until index
        # position i, and then s[i] < t[i] if i is even, or s[i] >
        # t[i] if i is odd.  Jenkyns calls this "alternating
        # lexicographic" order.  It's clear that this defines a total
        # ordering.  What isn't obvious is that it's also a Gray code
        # ordering!  Very pretty.
        # Modifications made here to A3 are minor, and include
        # switching from 1-based to 0-based; allowing for trivial
        # sequences; allowing for wrap-around; returning the "tossed"
        # and "added" elements; starting the generation at an
        # arbitrary k-combination; and sharing a finger (self.j) with
        # the getlex method.

        indices, n, k, j = self.indices, self.n, self.k, self.j
        if self.firstcall:
            self.firstcall = 0
            return indices, None, None

        # Slide over to first slot that *may* be able to move down.
        # Note that this leaves odd j alone (including -1!), and may
        # make j equal to k.
        j = j | 1

        if j == k:
            # k is odd and so indices[-1] "wants to move up", and
            # indices[-1] < n-1 so it *can* move up.
            tossed = indices[-1]
            added = indices[-1] = tossed + 1
            j = j-1
            if added == n-1:
                j = j-1
        elif j < 0:
            # indices has the last value in alt-lex order, e.g.
            # [4, 5, 6, 7]; wrap around to the first value, e.g.
            # [0, 5, 6, 7].
            assert indices == range(n-k, n)
            if k and indices[0]:
                tossed = indices[0]
                added = indices[0] = 0
                j = 0
                # comb(n, k) is 1 -- this is a trivial sequence.
                tossed = added = None
            # 0 < j < k (note that 0 < j because j is odd).
            # Want to move this slot down (again because j is odd).
            atj = indices[j]
            if indices[j-1] + 1 == atj:
                # can't move it down; move preceding up
                tossed = atj - 1    # the value in indices[j-1]
                indices[j-1] = atj
                added = indices[j] = n-k+j
                j = j-1
                if atj + 1 == added:
                    j = j-1
                # can move it down
                tossed = atj
                added = indices[j] = atj - 1
                if j+1 < k:
                    tossed = indices[j+1]
                    indices[j+1] = atj
                    j = j+1

        self.j = j
        # assert j == self.__findj(indices)
        return indices, tossed, added

    def set_start(self, start):
        """Force .getlex() or .getgray() to start at given value.

        start is a vector of k unique integers in range(n), where
        k and n were passed to the CombGenBasic constructor.

        The vector is sorted in increasing order, and is used as the
        the next k-combination to be returned by .getlex() or

        >>> gen = CombGenBasic(3, 2)
        >>> for i in range(4):
        ...     print gen.getgray()
        ([0, 1], None, None)
        ([1, 2], 0, 2)
        ([0, 2], 1, 0)
        ([0, 1], 2, 1)

        >>> gen.set_start([0, 2])
        >>> for i in range(4):
        ...     print gen.getgray()
        ([0, 2], None, None)
        ([0, 1], 2, 1)
        ([1, 2], 0, 2)
        ([0, 2], 1, 0)

        if len(start) != self.k:
            raise ValueError("start vector not of length " + `k`)
        indices = start[:]
        seen = {}
        # Verify the vector makes sense.
        for i in indices:
            if not 0 <= i < self.n:
                raise ValueError("start vector contains element "
                                 "not in 0.." + `self.n-1` +
                                 ": " + `i`)
            if seen.has_key(i):
                raise ValueError("start vector contains duplicate "
                                 "element: " + `i`)
            seen[i] = 1
        self.indices = indices
        self.j = self.__findj(indices)
        self.firstcall = 1

    def getrand(self, random=random.random):
        """Return a k-combination at random.

        Optional arg random specifies a no-argument function that
        returns a random float in [0., 1.).  By default, random.random
        is used.

        # The trap to avoid is doing O(n) work when k is much less
        # than n.  Letting m = min(k, n-k), we actually do Python work
        # of O(m), and C-level work of O(m log m) for a sort.  In
        # addition, O(k) work is required to build the final result,
        # but at worst O(m) of that work is done at Python speed.

        n, k = self.n, self.k
        complement = 0
        if k > n/2:
            # Generate the values *not* in the combination.
            complement = 1
            k = n-k

        # Generate k distinct random values.
        result = {}
        for i in xrange(k):
            # The expected # of times thru the next loop is n/(n-i).
            # Since i < k <= n/2, n-i > n/2, so n/(n-i) < 2 and is
            # usually closer to 1:  on average, this succeeds very
            # quickly!
            while 1:
                candidate = int(random() * n)
                if not result.has_key(candidate):
                    result[candidate] = 1
        result = result.keys()
        if complement:
            # We want everything in range(n) that's *not* in result.
            avoid = result
            result = []
            start = 0
            for limit in avoid:
                result.extend(range(start, limit))
                start = limit + 1
        return result

class CombGen:

    def __init__(self, seq, k):
        n = len(seq)
        if not 0 <= k <= n:
            raise ValueError("k must be in 0.." + `n` + ": " + `k`)
        self.seq = seq
        self.base = CombGenBasic(n, k)

    def reset(self):
        """Restore state to that immediately after construction."""

    def getlex(self):
        """Return next (in lexicographic index order) k-combination.

        Return None if all possibilities have been generated.

        indices = self.base.getlex()
        if indices is None:
            return None
            return self.__indices2seq(indices)

    def getgray(self):
        """Return next (c, tossed, added) triple.

        c is the next k-combination in a particular Gray code order.
        tossed is the element of s removed from the last combination.
        added is the element of s added to the last combination.

        Caution:  getgray wraps around if you invoke it more than
        comb(len(s), k) times.

        indices, tossed, added = self.base.getgray()
        if tossed is None:
            return (self.__indices2seq(indices), None, None)
            return (self.__indices2seq(indices),

    def set_start(self, start):
        """Force .getlex() or .getgray() to start at given value.

        start is a vector of k unique integers in range(len(s)), where
        k and s were passed to the CombGen constructor.

        The vector is sorted in increasing order, and is used as a
        vector of indices (into s) for the next k-combination to be
        returned by .getlex() or .getgray().

        >>> gen = CombGen("abc", 2)
        >>> for i in range(4):
        ...     print gen.getgray()
        ('ab', None, None)
        ('bc', 'a', 'c')
        ('ac', 'b', 'a')
        ('ab', 'c', 'b')

        >>> gen.set_start([0, 2]) # start with "ac"
        >>> for i in range(4):
        ...     print gen.getgray()
        ('ac', None, None)
        ('ab', 'c', 'b')
        ('bc', 'a', 'c')
        ('ac', 'b', 'a')

        >>> gen.set_start([0, 2]) # ditto
        >>> print gen.getlex(), gen.getlex(), gen.getlex()
        ac bc None


    def getrand(self, random=random.random):
        """Return a k-combination at random.

        Optional arg random specifies a no-argument function that
        returns a random float in [0., 1.).  By default, random.random
        is used.

        return self.__indices2seq(self.base.getrand(random))

    def __indices2seq(self, ivec):
        assert len(ivec) == self.base.k, "else internal error"
        seq = self.seq
        result = seq[0:0]   # an empty sequence of the proper type
        for i in ivec:
            result = result + seq[i:i+1]
        return result

del random

# Testing.

def _verifycomb(n, k, comb, inbase, baseobj=None):
    if len(comb) != k:
        print "OUCH!", this, "should have length", k

    # verify it's an increasing sequence of baseseq elements
    lastelt = None
    for elt in comb:
        if not inbase(elt):
            print "OUCH!", elt, "not in base seqeuence", n, k, comb
        if not lastelt < elt:
            print "OUCH!", elt, ">=", lastelt, n, k, comb
        lastelt = elt

    if baseobj:
        # verify search finger is correct
        cachedj = baseobj.j
        truej = baseobj._CombGenBasic__findj(baseobj.indices)
        if cachedj != truej:
            print "OUCH! cached j", cachedj, "!= true j", truej, \
                  n, k, comb

def _testnk_gray(n, k):
    start = "abcdefghijklmnopqrstuvwxyz"[:n]
    def inbase(elt, start=start):
        return elt in start
    o = CombGen(start, k)
    c = comb(n, k)
    seen = {}
    last, lastlist = None, None
    for i in xrange(c+1):
        this, tossed, added = o.getgray()
        _verifycomb(n, k, this, inbase, o.base)

        if seen.has_key(this) and i < c:
            print "OUCH!", this, "seen before at", seen[this], n, k
        seen[this] = i

        thislist = list(this)
        if (tossed is None) != (added is None):
            print "OUCH! tossed and added None clash", tossed, \
                  added, n, k, last, this
        if last is None:
            last, lastlist = this, thislist
        if tossed is not None:
            if tossed == added:
                print "OUCH! tossed == added", tossed, added, \
                      n, k, last, this
        elif c != 1:
            print "OUCH! tossed None but comb(n, k) not 1", \
                  c, tossed, added, n, k, last, this
        if lastlist != thislist:
            print "OUCH! does not compute", n, k, tossed, added, \
                  last, this
        last, lastlist = this, thislist
    if last != start[:k]:
        print "OUCH! didn't wrap around", n, k, last, this

# getgray is especially delicate, so hammer on it.

def _testgray():
    >>> _testgray()
    testing getgray 0
    testing getgray 1
    testing getgray 2
    testing getgray 3
    testing getgray 4
    testing getgray 5
    testing getgray 6
    testing getgray 7
    testing getgray 8
    testing getgray 9
    testing getgray 10
    testing getgray 11
    testing getgray 12
    for n in range(13):
        print "testing getgray", n
        for k in range(n+1):
            _testnk_gray(n, k)

# getlex is easier.

def _testnk_lex(n, k):
    start = "abcdefghijklmnopqrstuvwxyz"[:n]
    def inbase(elt, start=start):
        return elt in start
    o = CombGen(start, k)
    c = comb(n, k)
    last = None
    for i in xrange(c):
        this = o.getlex()
        _verifycomb(n, k, this, inbase, o.base)
        if not last < this:
            print "OUCH! not lexicographic", last, this, n, k
        last = this
    this = o.getlex()
    if this is not None:
        print "OUCH! should have returned None", n, k, this

def _testlex():
    >>> _testlex()
    testing getlex 0
    testing getlex 1
    testing getlex 2
    testing getlex 3
    testing getlex 4
    testing getlex 5
    testing getlex 6
    testing getlex 7
    testing getlex 8
    for n in range(9):
        print "testing getlex", n
        for k in range(n+1):
            _testnk_lex(n, k)

import math
_math = math
del math

# This is a half-assed implementation, prone to overflow and/or
# underflow given "large" x or v.  If they're both <= a few hundred,
# though, it's quite accurate.  The main advantage is that it's
# self-contained.

def _chi_square_distrib(x, v):
    """x, v -> return probability that chi-square statistic <= x.

    v is the number of degrees of freedom, an integer >= 1.

    x is a non-negative float or int.

    if x < 0:
        raise ValueError("x must be >= 0: " + `x`)
    if v < 1:
        raise ValueError("v must be >= 1: " + `v`)
    if v != int(v):
        raise TypeError("v must be an integer: " + `v`)
    if x == 0:
        return 0.0

    # (x/2)**(v/2) / gamma((v+2)/2) * exp(-x/2) *
    # (1 + sum(i=1 to inf, x**i/prod(j=1 to i, v+2*j)))

    # Alas, for even moderately large x or v, this is numerically
    # intractable.  But the mean of the distribution is v, so in
    # practice v will likely be "close to" x.  Rewrite the first
    # line as
    # (x/2/e)**(v/2) / gamma((v+2)/2) * exp(v/2-x/2)
    # Now exp is much less likely to over or underflow.  The power is
    # still a problem, though, so we compute
    #    (x/2/e)**(v/2) / gamma((v+2)/2)
    # via repeated multiplication.
    x = float(x)
    a = x / 2 / _math.exp(1)
    v = float(v)
    v2 = v/2
    if int(v2) * 2 == v:
        # v is even
        base = 1.0
        i = 1.0
        # v is odd, so the gamma bottoms out at gamma(.5) = sqrt(pi),
        # and we need to get a sqrt(a) factor into the numerator
        # (since v2 "ends with" .5).
        base = 1.0 / _math.sqrt(a *  _math.pi)
        i = 0.5
    while i <= v2:
        base = base * (a / i)
        i = i + 1.0
    base = base * _math.exp(v2 - x/2)
    # Now do the infinite sum.
    oldsum = None
    sum = base
    while oldsum != sum:
        oldsum = sum
        v = v + 2.0
        base = base * (x / v)
        sum = sum + base
    return sum

def _chisq(observed, expected):
    n = len(observed)
    assert n == len(expected)
    sum = 0.0
    for i in range(n):
        e = float(expected[i])
        sum = sum + (observed[i] - e)**2 / e
    return sum, _chi_square_distrib(sum, n-1)

def _testrand():
    >>> _testrand()
    random 0 combs of abcde
    random 1 combs of abcde
    a 99
    b 106
    c 98
    d 99
    e 98
    probability[chisq <= 0.46] = 0.0227
    random 2 combs of abcde
    ab 100
    ac 115
    ad 111
    ae 98
    bc 98
    bd 103
    be 95
    cd 84
    ce 100
    de 96
    probability[chisq <= 6.6] = 0.321
    random 3 combs of abcde
    abc 83
    abd 119
    abe 86
    acd 88
    ace 103
    ade 94
    bcd 107
    bce 101
    bde 112
    cde 107
    probability[chisq <= 12.78] = 0.827
    random 4 combs of abcde
    abcd 86
    abce 99
    abde 113
    acde 101
    bcde 101
    probability[chisq <= 3.68] = 0.549
    random 5 combs of abcde
    abcde 100

    def drive(s, k):
        print "random", k, "combs of", s
        o = CombGen(s, k)
        g = o.getrand
        n = len(s)
        def inbase(elt, s=s):
            return elt in s
        count = {}
        c = comb(len(s), k)
        for i in xrange(100 * c):
            x = g()
            _verifycomb(n, k, x, inbase)
            count[x] = count.get(x, 0) + 1
        items = count.items()
        for x, i in items:
            print x, i
        if c > 1:
            observed = count.values()
            if len(observed) < c:
                observed.extend([0] * (c - len(observed)))
            x, p = _chisq(observed, [100]*c)
            print "probability[chisq <= %g] = %.3g" % (x, p)

    for k in range(6):
        drive("abcde", k)

__test__ = {"_testgray": _testgray,
            "_testlex":  _testlex,
            "_testrand": _testrand}

def _test():
    import doctest, combgen

if __name__ == "__main__":


From  Wed Aug 28 20:41:00 2002
From: (Guido van Rossum)
Date: Wed, 28 Aug 2002 15:41:00 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: Your message of "Wed, 28 Aug 2002 15:34:11 EDT."
References: <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

> Of course, even statisticians cannot afford all subsets of sets having
> 100 elements, or will not scan permutations of 1000 elements.  But 100s
> or 1000s elements are well within the bounds of cartesian products.

Yes, and there the cost of an extra list() copy is neglegeable
(allocate and copy 4K bytes).

> > Or you might use Oren's favorite rule of thumb and listify it when iter(x)
> > is iter(x) (or iter(x) is x).
> I'm a bit annoyed by the idea that `iter(x)' might require some computation
> for producing an iterator, and that we immediately throw away the result.
> Granted that `__iter__(self): return self' is efficient when an object is
> an iterator, but nowhere it is said that `__iter__' has to be efficient
> when the object is a container, and it does not shock me that some complex
> containers require time to produce their iterator.  I much prefer limiting
> the use of `__iter__' for when one intends to use the iterator...

Yes, that's why I prefer to just make a list() copy.

> > def powerset(base):
> >     """Powerset of an iterable, yielding lists."""
> >     pairs = [(2**i, x) for i, x in enumerate(base)]
> >     for n in xrange(2**len(pairs)):
> >         yield [x for m, x in pairs if m&n]
> Thanks!  Hmph!  This does not yield the subsets in "sorted order", like
> the other `cogen' methods do, and I would prefer to keep that promise.

That may be a matter of permuting the bits?

> Hopefully, the optimisation I added this morning will make both algorithms
> more comparable, speed-wise.  I should benchmark them to see.

Yes!  Nothing beats a benchmark as an eye-opener.

--Guido van Rossum (home page:

From  Wed Aug 28 20:42:48 2002
From: (Greg Ward)
Date: Wed, 28 Aug 2002 15:42:48 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
References: <> <>
Message-ID: <>

On 28 August 2002, Tim Peters said:
> What's an acceptable false positive rate?

Speaking as one of the people who reviews suspected spam for
and rescues false positives, I would say that the more relevant figure
is: how much suspected spam do I have to review every morning?  < 10
messages would be peachy; right now it's around 5-20 messages per day.

Currently there are probably 1-3 FPs per day, although on a bad day
there can be 5-10.  (Eg. on 2002-08-21, six mailman-users posts from the
same guy were all caught, mainly because his ISP added X-AntiAbuse, and
his messages were multipart/alternative with unwrapped plain text.  This
is a perfect example of SpamAssassin screwing up royally.)  1-3 FPs/day
I can live with, but the real burden is the manual review: I'd much
rather have 5 FPs in a pool of 10 suspects than 1 FP out of 100

> What do we get from SpamAssassin?

Recall the stats I posted this morning; the bulk of spam is in Chinese
or Korean, and I have things setup so SpamAssassin never even sees it.
I think the only way to meaningfully answer this question is to stash
*everything* receives for a day or 10, spam and
otherwise, and run it all through SA.


From  Wed Aug 28 20:47:44 2002
From: (Paul Graham)
Date: 28 Aug 2002 19:47:44 -0000
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
Message-ID: <>

Don't count words multiple times, and you'll probably
get fewer false positives.  That's the main reason I
don't do it-- because it magnifies the effect of some 
random word like water happening to have a big spam
probability. (Incidentally, why so high?  In my db it's 
only 0.3930784.)  --pg

--Tim Peters wrote:
> FYI.  After cleaning the blatant spam identified by the classifier out of my
> ham corpus, and replacing it with new random msgs from Barry's corpus, the
> reported false positive rate fell to about 0.2% (averaging 8 per each batch
> of 4000 ham test messages).  This seems remarkable given that it's ignoring
> headers, and just splitting the raw text on whitespace in total ignorance of
> HTML & MIME etc.
> 'FREE' (all caps) moved into the ranks of best spam indicators.  The false
> negative rate got reduced by a small amount, but I doubt it's a
> statistically significant reduction (I'll compute that stuff later; I'm
> looking for Big Things now).
> Some of these false positives are almost certainly spam, and at least one is
> almost certainly a virus:  these are msgs that are 100% base64-encoded, or
> maximally obfuscated quoted-printable.  That could almost certainly be fixed
> by, e.g., decoding encoded text.
> The other false positives seem harder to deal with:
> + Brief HMTL msgs from newbies.  I doubt the headers will help these
>   get through, as they're generally first-time posters, and aren't
>   replies to earlier msgs.  There's little positive content, while
>   all elements of raw HTML have high "it's spam" probability.
> Example:
> """
> --------------=_4D4800B7C99C4331D7B8
> Content-Description: filename="text1.txt"
> Content-Type: text/plain; charset=ISO-8859-1
> Content-Transfer-Encoding: quoted-printable
> Is there a version of Python with Prolog Extension??
> Where can I find it if there is?
> Thanks,
> Luis.
> P.S. Could you please reply to the sender too.
> --------------=_4D4800B7C99C4331D7B8
> Content-Description: filename="text1.html"
> Content-Type: text/html
> Content-Transfer-Encoding: quoted-printable
> <HTML>
> <HEAD>
>         <TITLE>Prolog Extension</TITLE>
>         <META NAME=3D"GENERATOR" CONTENT=3D"StarOffice/5.1 (Linux)">
>         <META NAME=3D"CREATED" CONTENT=3D"19991127;12040200">
>         <META NAME=3D"CHANGEDBY" CONTENT=3D"Luis Cortes">
>         <META NAME=3D"CHANGED" CONTENT=3D"19991127;12044700">
> </HEAD>
> <BODY>
> <PRE>Is there a version of Python with Prolog Extension??
> Where can I find it if there is?
> Thanks,
> Luis.
> P.S. Could you please reply to the sender too.</PRE>
> </BODY>
> </HTML>
> --------------=_4D4800B7C99C4331D7B8--"""
> """
> Here's how it got scored:
> prob = 0.999958816093
> prob('<META') = 0.957529
> prob('<META') = 0.957529
> prob('<META') = 0.957529
> prob('<BODY>') = 0.979284
> prob('Prolog') = 0.01
> prob('<HEAD>') = 0.97989
> prob('Thanks,') = 0.0337316
> prob('Prolog') = 0.01
> prob('Python') = 0.01
> prob('NAME=3D"GENERATOR"') = 0.99
> prob('<HTML>') = 0.99
> prob('</HTML>') = 0.989494
> prob('</BODY>') = 0.987429
> prob('Thanks,') = 0.0337316
> prob('Python') = 0.01
> Note that '<META' gets penalized 3 times.  More on that later.
> + Msgs talking *about* HTML, and including HTML in examples.  This one
>   may be troublesome, but there are mercifully few of them.
> + Brief msgs with obnoxious employer-generated signatures.  Example:
> """
> Hi there,
> I am looking for you recommendations on training courses available in the UK
> on Python.  Can you help?
> Thanks,
> Vickie Mills
> IS Training Analyst
> Tel:    0131 245 1127
> Fax:    0131 245 1550
> E-mail:
> For more information on Standard Life, visit our website
>   The Standard Life Assurance Company, Standard
> Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland
> (No SZ4) and regulated by the Personal Investment Authority.  Tel: 0131 225
> 2552 - calls may be recorded or monitored.  This confidential e-mail is for
> the addressee only.  If received in error, do not retain/copy/disclose it
> without our consent and please return it to us.  We virus scan all e-mails
> but are not responsible for any damage caused by a virus or alteration by a
> third party after it is sent.
> """
> The scoring:
> prob = 0.98654879055
> prob('our') = 0.928936
> prob('sent.') = 0.939891
> prob('Tel:') = 0.0620155
> prob('Thanks,') = 0.0337316
> prob('received') = 0.940256
> prob('Tel:') = 0.0620155
> prob('Hi') = 0.0533333
> prob('help?') = 0.01
> prob('Personal') = 0.970976
> prob('regulated') = 0.99
> prob('Road,') = 0.01
> prob('Training') = 0.99
> prob('e-mails') = 0.987542
> prob('Python.') = 0.01
> prob('Investment') = 0.99
> The brief human-written part is fine, but the longer boilerplate sig is
> indistinguishable from spam.
> + The occassional non-Python conference announcement(!).  These are
>   long, so I'll skip an example.  In effect, it's automated bulk email
>   trying to sell you a conference, so is prone to use the language and
>   artifacts of advertising.  Here's typical scoring, for the TOOLS
>   Europe '99 conference announcement:
> prob = 0.983583974285
> prob('THE') = 0.983584
> prob('Object') = 0.01
> prob('Bell') = 0.01
> prob('Object-Oriented') = 0.01
> prob('**************************************************************') =
> 0.99
> prob('Bertrand') = 0.01
> prob('Rational') = 0.01
> prob('object-oriented') = 0.01
> prob('CONTACT') = 0.99
> prob('**************************************************************') =
> 0.99
> prob('innovative') = 0.99
> prob('**************************************************************') =
> 0.99
> prob('Olivier') = 0.01
> prob('VISIT') = 0.99
> prob('OUR') = 0.99
> Note the repeated penalty for the lines of asterisks.  That segues into the
> next one:
> + Artifacts of that the algorithm counts multiples instances of "a word"
>   multiple times.  These are baffling at first sight!  The two clearest
>   examples:
> """
> > > Can you create and use new files with
> >
> > Yes. But if I run db_dump on these files, it says "unexpected file type
> > or format", regardless which db_dump version I use (2.0.77, 3.0.55,
> > 3.1.17)
> >
> It may be that db_dump isn't compatible with version 1.85 databse files.  I
> can't remember.  I seem to recall that there was an option to build 1.85
> versions of db_dump and db_load.  Check the configure options for
> BerkeleyDB to find out.  (Also, while you are there, make sure that
> BerkeleyDB was built the same on both of your platforms...)
> >
> > >  Try running db_verify (one of the utilities built
> > > when you compiled DB) on the file and see what it tells you.
> >
> > There is no db_verify among my Berkeley DB utilities.
> There should have been a bunch of them built when you compiled DB.  I've got
> these:

From  Wed Aug 28 21:11:39 2002
From: (Oren Tirosh)
Date: Wed, 28 Aug 2002 16:11:39 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

On Wed, Aug 28, 2002 at 03:41:00PM -0400, Guido van Rossum wrote:
> > > Or you might use Oren's favorite rule of thumb and listify it when iter(x)
> > > is iter(x) (or iter(x) is x).
> > 
> > I'm a bit annoyed by the idea that `iter(x)' might require some computation
> > for producing an iterator, and that we immediately throw away the result.
> > Granted that `__iter__(self): return self' is efficient when an object is
> > an iterator, but nowhere it is said that `__iter__' has to be efficient
> > when the object is a container, and it does not shock me that some complex
> > containers require time to produce their iterator.  I much prefer limiting
> > the use of `__iter__' for when one intends to use the iterator...
> Yes, that's why I prefer to just make a list() copy.

Oh, come on you two... stop beating up the poor strawman. The reiter()
function doesn't make a single redundant call to __iter__.  It's just 
like iter() but ensures in passing that the result is really a fresh 


From  Wed Aug 28 21:19:11 2002
From: (Guido van Rossum)
Date: Wed, 28 Aug 2002 16:19:11 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: Your message of "Wed, 28 Aug 2002 16:11:39 EDT."
References: <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

> Oh, come on you two... stop beating up the poor strawman. The reiter()
> function doesn't make a single redundant call to __iter__.  It's just 
> like iter() but ensures in passing that the result is really a fresh 
> iterator.

I'm really sorry.  I had forgotten what exactly your proposal was.  It
is actually very reasonable:

    Proposal: new built-in function reiter()

    def reiter(obj):
        """reiter(obj) -> iterator

    Get an iterator from an object. If the object is already an iterator a
    TypeError exception will be raised. For all Python built-in types it is
    guaranteed that if this function succeeds the next call to reiter() will
    return a new iterator that produces the same items unless the object is
    modified. Non-builtin iterable objects which are not iterators SHOULD
    support multiple iteration returning the same items."""

        it = iter(obj)
        if it is obj:
            raise TypeError('Object is not re-iterable')
        return it

--Guido van Rossum (home page:

From  Wed Aug 28 21:30:11 2002
From: (Jeremy Hylton)
Date: Wed, 28 Aug 2002 16:30:11 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
References: <>
Message-ID: <>

Content-Type: text/plain; charset=us-ascii
Content-Description: message body text
Content-Transfer-Encoding: 7bit

  TP> + The ability to override the random number generator.  Python's
  TP>   default WH generator is showing its age as machines get
  TP>   faster; it's simply not adequate anymore for long-running
  TP>   programs making heavy use of it on a fast box.  Combinatorial
  TP>   algorithms in particular do tend to make heavy use of it.
  TP>   (Speaking of which, "someone" should look into grabbing one of
  TP>   the Mersenne Twister extensions for Python -- that's the
  TP>   current state of *that* art).

The last time we talked about random number generation, I remember
finding a tiny algorithm by Pierre L'Ecuyer based on a recommendation
from Luc Devroye.  (That's a good pedigree!)  Here's an almost equally
tiny C extension that wraps up the algorithm.

We should do a real test of it.  Last time I checked, it wasn't
obvious how to actually run the DIEHARD tests.


Content-Type: application/octet-stream
Content-Disposition: attachment;
Content-Transfer-Encoding: base64

Content-Type: application/octet-stream
Content-Disposition: attachment;
Content-Transfer-Encoding: base64


From  Wed Aug 28 21:45:14 2002
From: (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 28 Aug 2002 16:45:14 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
References: <>
Message-ID: <>

[Guido van Rossum]

> > > def powerset(base):

> > This does not yield the subsets in "sorted order", like the other
> > `cogen' methods do, and I would prefer to keep that promise.

> That may be a matter of permuting the bits?

I looked at them: nothing evident pops up.  I tried inverting, inversing,
and both, in hope to trigger the sight of some magic property, to no avail.

François Pinard

From  Wed Aug 28 21:59:39 2002
From: (Tim Peters)
Date: Wed, 28 Aug 2002 16:59:39 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <>

[Paul Graham]
> Don't count words multiple times, and you'll probably
> get fewer false positives.  That's the main reason I
> don't do it-- because it magnifies the effect of some
> random word like water happening to have a big spam
> probability.

Yes, that makes sense, but I'm trained not to think <wink>.  Experiment will
decide it (although I *expect* it's a good change, and counting multiple
occurrences was obviously a factor in several of the rare false positives).
If spam really is different, it should be different in several distinct

> (Incidentally, why so high?  In my db it's  only 0.3930784.)  --pg

I expect it's because this tokenizer *only* split on whitespace.
Punctuation was left intact.  So, e.g., on the Python discussion list stuff

    The new approach blows it out of the water:
    This is very deep water;
    Then you'll take to Python like a duck takes to water!

are counted as "water:" and "water;" and "water!", not as "water".

The spam corpus is chock full o' "water", though:

+ Porn sites advertising water sports.
+ Assorted bottled water pitches.
+ Assorted "oxygenated water" pitches.
+ Claims of environmental friendliness explicated via stuff like
  "no harmful chlorine to pollute the water or air!".
+ Pitches for weight-loss gimmicks emphasizing that you'll really
  loss fat, not just reduce water retention.
+ Pitches for weight-loss gimmicks empphasizing that you'll reduce
  water retention as well as lose fat.
+ One repeated bizarre analogy for how a breast enlargement cream
  works in the way "a sponge absorbs water".
+ This revolutionary new flat garden hose will really cut your water
+ Ditto this miracle new laundry tablet lets you use a fraction of
  the water needed by old-fashioned detergents.
+ Survivalist pitches often mention water in the same sentence as
  air and medical care.

I got tired then <wink>.

From  Wed Aug 28 22:04:46 2002
From: (Paul Graham)
Date: 28 Aug 2002 21:04:46 -0000
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
Message-ID: <>

I see, if you count the punctuation as part of the
token, you end up with undersized-corpus effects.
Esp if you are case-sensitive too.  If I were you
I'd map your input down into a narrower set of tokens,
or you'll get too many errors.  --pg

--Tim Peters wrote:
> [Paul Graham]
> > Don't count words multiple times, and you'll probably
> > get fewer false positives.  That's the main reason I
> > don't do it-- because it magnifies the effect of some
> > random word like water happening to have a big spam
> > probability.
> Yes, that makes sense, but I'm trained not to think <wink>.  Experiment will
> decide it (although I *expect* it's a good change, and counting multiple
> occurrences was obviously a factor in several of the rare false positives).
> If spam really is different, it should be different in several distinct
> ways.
> > (Incidentally, why so high?  In my db it's  only 0.3930784.)  --pg
> I expect it's because this tokenizer *only* split on whitespace.
> Punctuation was left intact.  So, e.g., on the Python discussion list stuff
> like
>     The new approach blows it out of the water:
> and
>     This is very deep water;
> and
>     Then you'll take to Python like a duck takes to water!
> are counted as "water:" and "water;" and "water!", not as "water".
> The spam corpus is chock full o' "water", though:
> + Porn sites advertising water sports.
> + Assorted bottled water pitches.
> + Assorted "oxygenated water" pitches.
> + Claims of environmental friendliness explicated via stuff like
>   "no harmful chlorine to pollute the water or air!".
> + Pitches for weight-loss gimmicks emphasizing that you'll really
>   loss fat, not just reduce water retention.
> + Pitches for weight-loss gimmicks empphasizing that you'll reduce
>   water retention as well as lose fat.
> + One repeated bizarre analogy for how a breast enlargement cream
>   works in the way "a sponge absorbs water".
> + This revolutionary new flat garden hose will really cut your water
>   bills.
> + Ditto this miracle new laundry tablet lets you use a fraction of
>   the water needed by old-fashioned detergents.
> + Survivalist pitches often mention water in the same sentence as
>   air and medical care.
> I got tired then <wink>.

From  Wed Aug 28 22:20:02 2002
From: (Tim Peters)
Date: Wed, 28 Aug 2002 17:20:02 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <>
Message-ID: <>

[Jeremy Hylton]
> The last time we talked about random number generation, I remember
> finding a tiny algorithm by Pierre L'Ecuyer based on a recommendation
> from Luc Devroye.  (That's a good pedigree!)  Here's an almost equally
> tiny C extension that wraps up the algorithm.
> We should do a real test of it.  Last time I checked, it wasn't
> obvious how to actually run the DIEHARD tests.

It still isn't, but DIEHARD is likely obsolete now.  Testing for randomness
has become a full-blown science in its own right.  Your government is happy
to give you a bunch of more modern randomness tests developed on a Sun,
complete with a multi-hundred page testing manual every word of which is
vitally important <0.8 wink>:

Note that the Mersenne Twister is likely substantially faster than the
little C program (e.g, it doesn't need division, and on some platforms is
reported to be faster than the uselessly simple-minded C rand()), is
provably equi-distributed through 623 dimensions (linear congruential
generators are damned luck to get 6), has a period of nearly 2**20000, and
is probably the most widely tested generator in existence now.  Knuth was
reported as saying "well, I guess that about wraps it up for random number
generation!", although I'd be more likely to believe L'Ecuyer or Marsaglia
on this particular topic <wink>.

From  Wed Aug 28 22:41:43 2002
From: (Tim Peters)
Date: Wed, 28 Aug 2002 17:41:43 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <>

[Paul Graham]
> I see, if you count the punctuation as part of the
> token, you end up with undersized-corpus effects.
> Esp if you are case-sensitive too.  If I were you
> I'd map your input down into a narrower set of tokens,
> or you'll get too many errors.  --pg

Possibly, but that's for experiment to decide (along with many other
variations).  The initial tokenization method was chosen merely for speed.
Still, I looked at every false positive across 80,000 presumed non-spam test
inputs, and posted the results earlier:  it's hard to imagine that ignoring
punctuation and/or case would have stopped any of them except for this one
(which is darned hard to care about <wink>):


prob = 0.999982095931
prob('AUTOCODING') = 0.2
prob('THING.') = 0.2
prob('DUDEZ') = 0.2
prob('ANYONE') = 0.884211
prob('GET') = 0.847334
prob('GET') = 0.847334
prob('HEY') = 0.2
prob('--') = 0.0974729
prob('KNOW') = 0.969697
prob('THIS') = 0.953191
prob('?') = 0.0490886
prob('WANT') = 0.99
prob('TO') = 0.988829
prob('CAN') = 0.884211
prob('WAREZ') = 0.2

I also noted earlier that FREE (all caps) is now one of the 15 words that
most often makes it into the scorer's best-15 list, and cutting the legs off
a clue like that is unattractive on the face of it.  So I'm loathe to fold
case unless experiment proves that's an improvement, and it just doesn't
look likely to do so.

For smaller corpora, some other conclusion may well be justified; but
experimenting on smaller corpora isn't on my near-term agenda, so that will
have to wait (we've got a specific application in mind right now for which
the copora size I'm using is actually tiny -- hosts some very
high-volume mailing lists).

From  Wed Aug 28 23:13:18 2002
From: (Jack Jansen)
Date: Thu, 29 Aug 2002 00:13:18 +0200
Subject: [Python-Dev] Sourceforge CVS repository behaving strange?
Message-ID: <>

Either my brain needs to be pushed into gear or the CVS 
repository is behaving very strange.

About half an hour ago I added and checked in a file 
Mac/OSX/, but after I did another cvs update the 
file immediately disappeared with the message that it was no 
longer in the repository.

And if I check the repository over the web (at
bin/viewcvs.cgi/python/python/dist/src/Mac/OSX/) the file is 
indeed gone. Moreover, if I look there the other files (such as 
the Makefile) *also* seem to be gone. But cvs update doesn't 
think they're gone, and doing a fresh checkout also makes them 
appear as they should (but it does not revive

Hmm, together with checking in I also added an 
entry to Misc/ACKS, lemme check... No Misc/ACKS in viewcvs... 
Actually, there are no files at all in viewcvs, just directories!

I am now utterly confused, can someone shed light on this?
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Wed Aug 28 23:22:57 2002
From: (Jack Jansen)
Date: Thu, 29 Aug 2002 00:22:57 +0200
Subject: [Python-Dev] Sourceforge CVS repository behaving strange?
In-Reply-To: <>
Message-ID: <>

On donderdag, augustus 29, 2002, at 12:13 , Jack Jansen wrote:

> Either my brain needs to be pushed into gear or the CVS 
> repository is behaving very strange.
> About half an hour ago I added and checked in a file 
> Mac/OSX/, but after I did another cvs update the 
> file immediately disappeared with the message that it was no 
> longer in the repository.

When I re-checked my console output I noticed what seems to have 
actually happened, but I'm still at loss for an explanation. It 
seems cvs added the file in the wrong location.

In Mac/OSX, I did
cvs add
cvs commit ../../Misc/ACKS

The commit message produced the following output:
Checking in ../../Misc/ACKS;
/cvsroot/python/python/dist/src/Misc/ACKS,v  <--  ACKS
new revision: 1.199; previous revision: 1.198
RCS file: /cvsroot/python/python/,v
Checking in;
/cvsroot/python/python/,v  <--
initial revision: 1.1

So, it somehow added my file at the root of the repository!

Anyway, I've fixed it, so the only thing that remains is a 
baffled look on my face as to why it did this, and also as to 
why viewcvs acts funny.
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Thu Aug 29 01:39:36 2002
From: (Tim Peters)
Date: Wed, 28 Aug 2002 20:39:36 -0400
Subject: [Python-Dev] Sourceforge CVS repository behaving strange?
In-Reply-To: <>
Message-ID: <>

[Jack Jansen]
> ...
> Anyway, I've fixed it, so the only thing that remains is a
> baffled look on my face as to why it did this, and also as to
> why viewcvs acts funny.

Check the state of the

    Show only files with tag

dropdown box at the bottom of the ViewCVS page?  It that's set to a funny
value, you'll see funny things <wink>.

From  Thu Aug 29 02:19:38 2002
From: (Tim Peters)
Date: Wed, 28 Aug 2002 21:19:38 -0400
Subject: [Python-Dev] RE: The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <>

[Greg Ward]
> One of the other perennial-seeming topics on spamassassin-devel (a list
> that I follow only sporodically) is that careful manual cleaning of your
> corpus is *essential*.  The concern of the main SA developers is that
> spam in your non-spam folder (and vice-versa) will prejudice the genetic
> algorithm that evolves SA's scores in the wrong direction.  Gut instinct
> tells me the Bayesian approach ought to be more robust against this sort
> of thing, but even it must have a breaking point at which misclassified
> messages throw off the probabilities.

Like all other questions <wink>, this can be quantified if someone is
willing to do the grunt work of setting up, running, and analyzing
appropriate experiments.  This kind of algorithm is generally quite robust
against disaster, but note that even tiny changes in accuracy rates can have
a large effect on *you*:  say that 99% of the time the system says a thing
is spam, it really is.  Then say that degrades by a measly 1%:  99% falls to
98%.  From *your* POV this is huge, because the error rate has actually
doubled (from 1% wrong to 2% wrong:  you've got twice as many false
positives to deal with).

So the scheme has an ongoing need for accurate human training (spam changes,
list topics change, list members change, etc; the system needs an ongoing
random sample of both new spam and new non-spam to adapt).

> ...
> One possibility occurs to me: we could build our own corpus by
> collecting spam on for a few weeks.

Simpler is better:  as you suggested later, capture everything for a while,
and without injecting Mailman or SpamAssasin headers.  That won't be a
particularly good corpus for the lists in general, because over any brief
period a small number of topics and posters dominate.  But it will be a fair
test for how systems do over exactly that brief period <wink>.

> Here's a rough breakdown of mail rejected by over the
> last 10 days, eyeball-estimated messages per day:
>   bad RCPT                       150 - 300 [1]
>   bad sender                      50 - 190 [2]
>   relay denied                    20 - 180 [3]
>   known spammer addr/domain       15 -  60
>   8-bit chars in subject         130 - 200
>   8-bit chars in header addrs     10 -  60
>   banned charset in subject        5 -  50 [4]
>   "ADV" in subject                 0 -   5
>   no Message-Id header           100 - 400 [5]
>   invalid header address syntax    5 -  50 [6]
>   no valid senders in header      10 -  15 [7]
>   rejected by SpamAssassin        20 -  50 [8]
>   quarantined by SpamAssassin      5 -  50 [8]

We should start another category, "Messages from Tim rejected for bogus
reasons" <wink>.

> [1] this includes mail accidentally sent to eg.,
>     but based on scanning the reject logs, I'd say the vast majority
>     is spam.  However, such messages are rejected after RCPT TO,
>     so we never see the message itself.  Most of the bad recipient
>     addrs are either ancient (,
> or fictitious (,
> [2] sender verification failed, eg. someone tried to claim an
>     envelope sender like foo@bogus.domain.  Usually spam, but innocent
>     bystanders can be hit by DNS servers suddenly exploding (hello,
>  This only includes hard failures (DNS "no such
>     domain"), not soft failures (DNS timeout).
> [3] I'd be leery of accepting mail that's trying to hijack
> as an open relay, even though that would
>     be a goldmine of spam.  (OTOH, we could reject after the
>     DATA command, and save the message anyways.)
> [4] rejects any message with a properly MIME-encoded
>     subject using any of the following charsets:
>       big5, euc-kr, gb2312, ks_c_5601-1987
> [5] includes viruses as well as spam (and no doubt some innocent
>     false positives, although I have added exemptions for the MUA/MTA
>     combinations that most commonly result in legit mail reaching
> without a Message-Id header, eg. KMail/qmail)
> [6] eg. "To: all my friends" or "From: <>"
> [7] no valid sender address in any header line -- eg. someone gives a
>     valid MAIL FROM address, but then puts "From: blah@bogus.domain"
>     in the headers.  Easily defeated with a "Sender" or "Reply-to"
>     header.
> [8] any message scoring >= 10.0 is rejected at SMTP time; any
>     message scoring >= 5.0 but < 10 is saved in /var/mail/spam
>     for later review

Greg, you show signs of enjoying this job too much <wink>.

> Executive summary:
>   * it's a good thing we do all those easy checks before involving
>     SA, or the load on the server would be a lot higher

So long as easy checks don't block legitimate email, I can't complain about

>   * give me 10 days of spam-harvesting, and I can equal Bruce
>     Guenter's spam archive for 2002.  (Of course, it'll take a couple
>     of days to set the mail server up for the harvesting, and a couple
>     more days to clean through the ~2000 caught messages, but you get
>     the idea.)

If it would be helpful for me to do research on corpora that include the
headers, then the point would be to collect both spam and non-spam messages,
so that they can be compared directly to each other.  Those should be as
close to the bytes coming off the pipe as possible (e.g., before injecting
new headers of our own).  As is, I've had to throw the headers away in both
corpora, so am, in effect, working with a crippled version of the algorithm.

Or if someone else is doing research on how best to tokenize and tag
headers, I'm not terribly concerned about merging the approaches untested.
If the approach is valuable enough to deploy, we'll eventually see exactly
how well it works in real life.

> ...
> Perhaps that spam-harvesting run should also set aside a random
> selection of apparently-non-spam messages received at the same time.
> Then you'd have a corpus of mail sent to the same server, more-or-less
> to the same addresses, over the same period of time.

Yes, it wants something as close to a slice of real life as possible, in all
conceivable respects, including ratio of spam to not spam, arrival times,
and so on.

> Oh, any custom corpus should also include the ~300 false positives and
> ~600 false negatives gathered since SA started running on
> in April.

Definitely not.  That's not a slice of real life, it's a distortion based on
how some *other* system screwed up.  Train it systematically on that, and
you're not training it for real life.  The urge to be clever is strong, but
must be resisted <0.3 wink>.

What would be perfectly reasonable is to run (not train) the system against
those corpora to see how it does.

BTW, Barry said the good-message archives he put together were composed of
msgs archived after SpamAssassin was enabled.  Since about 80% of the 1%
"false positive" rate I first saw turned out to be blatant spam in the ham
corpus, this suggests SpamAssassin let about 160000 * 1% * 80% = 12800 spams
through to the python-list archive alone.  That doesn't jibe with "600 false
negatives" at all.  I don't want to argue about it, it's just fair warning
that I don't believe much that I hear <wink>.  In particular, in *this* case
I don't believe python-list actually got 160000 messages since April, unless
we're talking about April of 2000.

From  Thu Aug 29 04:15:11 2002
From: (Skip Montanaro)
Date: Wed, 28 Aug 2002 22:15:11 -0500
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
References: <>
Message-ID: <>

[ lots of interesting stuff elided ]

    Tim> What's an acceptable false positive rate?  What do we get from
    Tim> SpamAssassin?  I expect we can end up below 0.1% here, and with a
    Tim> generous meaning for "not spam", but I think *some* of these
    Tim> examples show that the only way to get a 0% false-positive rate is
    Tim> to recode spamprob like so:

I don't know what an acceptable false positive rate is.  I guess it depends
on how important those falsies are. ;-)

One thing I think would be worthwhile would be to run GBayes first, then
only run stuff it thought was spam through SpamAssassin.  Only messages that
both systems categorized as spam would drop into the spam folder.  This has
a couple benefits over running one or the other in isolation:

    * The training set for GBayes probably doesn't need to be as big

    * The two systems use substantially different approaches to identifying
      spam, so I suspect your false positive rate would go way down.  False
      negatives would go up, but only testing can suggest by how much.

    * Since SA is dog slow most of the time, SA users get a big speedup,
      since a substantially smaller fraction of your messages get run
      through it.

This sort of chaining is pretty trivial to setup with procmail.  Dunno what
the Windows set will do though.


From  Thu Aug 29 05:18:04 2002
From: (Tim Peters)
Date: Thu, 29 Aug 2002 00:18:04 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <>

FYI, about counting multiple instances of a word multiple times, or only
once, when scoring.  Changing it to count words only once did fix the
specific false positive examples I mentioned.  However, across 20 test runs
(training on one of five pairs of corpora, and then for each such training
pair running predictions across the remaining four pairs), it was a mixed
bag.  On some runs it appeared to be a real improvement, on others a real
regression.  Overall, the results didn't support concluding it made a
significant difference to the false positive rate, but weakly supported
concluding that it increased the false negative rate.

That's very tentative -- I didn't stare at the actual misclassifications, I
just ran it while sleeping off a flu, then woke up and crunched the numbers.
This ignorant-of-MIME tokenization scheme is ridiculously bad for the false
negative rate anyway (an entire line of base64 or obfuscated
quoted-printable looks like a ham-favoring single "unknown word" to it), so
there are bigger fish to fry first.

From  Thu Aug 29 06:23:07 2002
From: (Raymond Hettinger)
Date: Thu, 29 Aug 2002 01:23:07 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
References: <><><><><><><><><><><> <>
Message-ID: <007501c24f1c$29565ec0$2261accf@othello>

Ideas for the day:

1. Optimize BaseSet._update(iterable) by checking for two special cases where a C-speed update method is already available and the
entries are known in advance to be immutable:

            . . .
            if isinstance(iterable, BaseSet):
            if isinstance(iterable, dict):
            . . .

2.  Eliminate the binary sanity checks which verify for operators that 'other' is a BaseSet. If 'other' isn't a BaseSet, try using
it, directly or by coercing to a set, as an iterable:

>>> Set('abracadabra') | 'alacazam'
Set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])

This improves usability because the second argument did not have to be pre-wrapped with Set.  It improves speed, for some
operations, by using the iterable directly and not having to build an equivalent dictionary.

3.  Have ImmutableSet keep a reference to the original iterable.  Add an ImmutableSet.refresh() method that rebuilds ._data from
the iterable.  Add a Set.refresh() method that triggers ImmutableSet.refresh() where possible.  The goal is to improve the
usability of sets of sets where the inner sets have been updated after the outer set was created.

>>> inner = Set('abracadabra')
>>> outer = Set([inner])
>>> inner.add('z')                 # now the outer set is out-of-date
>>> outer.refresh()               # now it is current
>>> outer
Set(['a', 'c', 'r', 'z', 'b', 'd'])

This would only work for restartable iterables -- a file object would not be so easily refreshed.

Raymond Hettinger

From  Thu Aug 29 06:45:52 2002
From: (Raymond Hettinger)
Date: Thu, 29 Aug 2002 01:45:52 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
References: <><><><><><><><><><><> <> <007501c24f1c$29565ec0$2261accf@othello>
Message-ID: <007d01c24f1f$5713efa0$2261accf@othello>

> 3.  Have ImmutableSet keep a reference to the original iterable.  Add an ImmutableSet.refresh() method that rebuilds ._data from
> the iterable.  Add a Set.refresh() method that triggers ImmutableSet.refresh() where possible.  The goal is to improve the
> usability of sets of sets where the inner sets have been updated after the outer set was created.
> >>> inner = Set('abracadabra')
> >>> outer = Set([inner])
> >>> inner.add('z')                 # now the outer set is out-of-date
> >>> outer.refresh()               # now it is current
> >>> outer
> Set(['a', 'c', 'r', 'z', 'b', 'd'])

Make that:

Set(ImmutableSet('a', 'c', 'r', 'z', 'b', 'd']))

From  Thu Aug 29 06:49:56 2002
From: (Skip Montanaro)
Date: Thu, 29 Aug 2002 00:49:56 -0500
Subject: [Python-Dev] lots of test failures...
Message-ID: <>

Just cvs up'd and got a bunch of test failures on my Linux box:

    12 tests failed:
        test___all__ test_cookie test_descrtut test_difflib test_doctest
        test_doctest2 test_generators test_grammar test_inspect
        test_pyclbr test_sundry test_tokenize

Several failed looking for a missing attribute "testmod", e.g.:

    test test_generators crashed -- exceptions.AttributeError: 'module'
    object has no attribute 'testmod'

Here's the test_tokenize output:

    test test_tokenize produced unexpected output:
    *** mismatch between lines 127-138 of expected output and lines 127-138 of actual output:
    - 42,17-42,29:      NUMBER  '020000000000'
    ?          ^^
    + 42,17-42,30:      NUMBER  '020000000000L'
    ?          ^^                      +
    - 42,29-42,30:      NEWLINE '\n'
    ?    ^^     ^
    + 42,30-42,31:      NEWLINE '\n'
    ?    ^^     ^
    - 43,0-43,12:       NUMBER  '037777777777'
    ?          ^
    + 43,0-43,13:       NUMBER  '037777777777L'
    ?          ^                      +
    - 43,13-43,15:      OP      '!='
    ?     ^     ^
    + 43,14-43,16:      OP      '!='
    ?     ^     ^
    - 43,16-43,17:      OP      '-'
    ?     ^     ^
    + 43,17-43,18:      OP      '-'
    ?     ^     ^
    - 43,17-43,18:      NUMBER  '1'
    ?     ^     ^
    + 43,18-43,19:      NUMBER  '1'
    ?     ^     ^
    - 43,18-43,19:      NEWLINE '\n'
    ? ------
    + 43,19-43,20:      NEWLINE '\n'
    ?      ++++++
    - 44,0-44,10:       NUMBER  '0xffffffff'
    ?          ^
    + 44,0-44,11:       NUMBER  '0xffffffffL'
    ?          ^                    +
    - 44,11-44,13:      OP      '!='
    ?     ^     ^
    + 44,12-44,14:      OP      '!='
    ?     ^     ^
    - 44,14-44,15:      OP      '-'
    ?     ^     ^
    + 44,15-44,16:      OP      '-'
    ?     ^     ^
    - 44,15-44,16:      NUMBER  '1'
    ?     ^     ^
    + 44,16-44,17:      NUMBER  '1'
    ?     ^     ^
    - 44,16-44,17:      NEWLINE '\n'
    ?     ^     ^
    + 44,17-44,18:      NEWLINE '\n'
    ?     ^     ^

I'll look more closely in the morning.  It's too late to investigate now.


From David Abrahams" <  Thu Aug 29 07:03:24 2002
From: David Abrahams" < (David Abrahams)
Date: Thu, 29 Aug 2002 02:03:24 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
References: <><> <>
Message-ID: <02c401c24f21$ceed79e0$>

>   TP> + The ability to override the random number generator.  Python's
>   TP>   default WH generator is showing its age as machines get
>   TP>   faster; it's simply not adequate anymore for long-running
>   TP>   programs making heavy use of it on a fast box.  Combinatorial
>   TP>   algorithms in particular do tend to make heavy use of it.
>   TP>   (Speaking of which, "someone" should look into grabbing one of
>   TP>   the Mersenne Twister extensions for Python -- that's the
>   TP>   current state of *that* art).

FWIW, in case "someone" cares:
It's a nice library architecture, designed and implemented by people who
know the domain, and I think it should be applicable to Python.

           David Abrahams * Boost Consulting *

From Anthony Baxter <>  Thu Aug 29 08:04:32 2002
From: Anthony Baxter <> (Anthony Baxter)
Date: Thu, 29 Aug 2002 17:04:32 +1000
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <200208290704.g7T74WA22198@localhost.localdomain>

This is a multipart MIME message.

Content-Type: text/plain; charset=us-ascii

For what it's worth, the attached (simple) script will 'de-spamassassin'
an email message. I use it on my 'spam' folder to get test messages of 
various ugly MIME things that spam and viruses let through...

It's not pretty, but it does the job (for me, anyway)

Anthony Baxter     <>   
It's never too late to have a happy childhood.

Content-Type: text/plain ; name=""; charset=us-ascii
Content-Disposition: attachment; filename=""

def deSA(fp):
    import email, re
    m = email.message_from_string(
    if m['X-Spam-Status']:
	if m['X-Spam-Status'].startswith('No'):
	    del m['X-Spam-Status']
	    del m['X-Spam-Level']
	    del m['X-Spam-Status']
	    del m['X-Spam-Level']
	    del m['X-Spam-Flag']
	    del m['X-Spam-Checker-Version']

	    pct = m['X-Spam-Prev-Content-Type']
	    if pct:
		del m['X-Spam-Prev-Content-Type']
		m['Content-Type'] = pct

	    pcte = m['X-Spam-Prev-Content-Transfer-Encoding']
	    if pcte:
		del m['Content-Transfer-Encoding']
		m['Content-Transfer-Encoding'] = pcte
		del m['X-Spam-Prev-Content-Transfer-Encoding']

	    body = m.get_payload()

	    subj = m['Subject']
	    del m['Subject']
	    m['Subject'] = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '', subj)

	    newbody = []
	    at_start = 1
	    for line in body.splitlines():
		if at_start and line.startswith('SPAM: '):
		elif at_start:
		    at_start = 0

    return m

if __name__ == "__main__":
    import sys
    print deSA(open(sys.argv[1]))


From  Thu Aug 29 10:00:58 2002
From: (Michael Hudson)
Date: 29 Aug 2002 10:00:58 +0100
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib/test,1.41,1.42,1.3,1.4
In-Reply-To:'s message of "Wed, 28 Aug 2002 09:36:13 -0700"
References: <>
Message-ID: <> writes:

> Update of /cvsroot/python/python/dist/src/Lib/test
> In directory usw-pr-cvs1:/tmp/cvs-serv31010/Lib/test
> Modified Files:
> Log Message:
> Quite down some FutureWarnings.

Barry, is this why these tests have started to fail?


  I love the way Microsoft follows standards.  In much the same
  manner that fish follow migrating caribou.           -- Paul Tomblin

From  Thu Aug 29 13:38:02 2002
From: (Barry A. Warsaw)
Date: Thu, 29 Aug 2002 08:38:02 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib/test,1.41,1.42,1.3,1.4
References: <>
Message-ID: <>

>>>>> "SM" == Skip Montanaro <> writes:

    SM>     12 tests failed: test___all__ test_cookie test_descrtut
    SM> test_difflib test_doctest test_doctest2 test_generators
    SM> test_grammar test_inspect test_pyclbr test_sundry
    SM> test_tokenize

    SM> Several failed looking for a missing attribute "testmod",
    SM> e.g.:

>>>>> "MH" == Michael Hudson <> writes:

    >> Update of /cvsroot/python/python/dist/src/Lib/test In directory
    >> usw-pr-cvs1:/tmp/cvs-serv31010/Lib/test Modified Files:
    >> Log Message: Quite down some
    >> FutureWarnings.

    MH> Barry, is this why these tests have started to fail?

Anything's possible.  They weren't failing for me yesterday before I
checked them in, but on my home machine now I see failures for
test_grammar, test_strptime, and test_tokenize (and test_linuxaudiodev
but that's always failed for me).  I definitely don't see the other
failures that Skip reports.

I'll investigate.

From  Thu Aug 29 14:18:11 2002
From: (Barry A. Warsaw)
Date: Thu, 29 Aug 2002 09:18:11 -0400
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib/test,1.41,1.42,1.3,1.4
References: <>
Message-ID: <>

>>>>> "BAW" == Barry A Warsaw <> writes:

    BAW> before I checked them in, but on my home machine now I see
    BAW> failures for test_grammar, test_strptime, and test_tokenize
    BAW> (and test_linuxaudiodev but that's always failed for me).  I
    BAW> definitely don't see the other failures that Skip reports.

The test_tokenize failure was easy, the output file had changed.
The test_grammar failures make sense given the changing semantics of
those constants.  I've checked in a change that basically commented
the hex -1 and oct -1 tests since those seem to be testing something
that won't be true.  Someone should double check that this is the
right fix.

The test_strptime failure has "gone away".  It was:

test test_strptime failed -- Traceback (most recent call last):
  File "/home/barry/projects/python/Lib/test/", line 176, in test_hour
    self.failUnless(strp_output[3] == self.time_tuple[3], "testing of '%%I %%p' directive failed; '%s' -> %s != %s" % (strf_output, strp_output[3], self.time_tuple[3]))
  File "/home/barry/projects/python/Lib/", line 268, in failUnless
    if not expr: raise self.failureException, msg
AssertionError: testing of '%I %p' directive failed; '12 PM' -> 24 != 12

I don't get why this was failing and now is not, but don't have time
right now to look at it.

I now have no unexpected skips or failures.

From  Thu Aug 29 14:20:00 2002
From: (Raymond Hettinger)
Date: Thu, 29 Aug 2002 09:20:00 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
References: <><> <> <02c401c24f21$ceed79e0$>
Message-ID: <007e01c24f5e$c84e73e0$5fb63bd0@othello>

From: "David Abrahams" <>

> >   TP> + The ability to override the random number generator.  Python's
> >   TP>   default WH generator is showing its age as machines get
> >   TP>   faster; it's simply not adequate anymore for long-running
> >   TP>   programs making heavy use of it on a fast box.  Combinatorial
> >   TP>   algorithms in particular do tend to make heavy use of it.
> >   TP>   (Speaking of which, "someone" should look into grabbing one of
> >   TP>   the Mersenne Twister extensions for Python -- that's the
> >   TP>   current state of *that* art).
> FWIW, in case "someone" cares:
> It's a nice library architecture, designed and implemented by people who
> know the domain, and I think it should be applicable to Python.

I'm willing to implement this one.


From  Thu Aug 29 15:06:53 2002
From: (Guido van Rossum)
Date: Thu, 29 Aug 2002 10:06:53 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Thu, 29 Aug 2002 01:23:07 EDT."
References: <> <> <> <> <> <> <> <> <> <> <> <>
Message-ID: <>

> 1. Optimize BaseSet._update(iterable) by checking for two special cases where a C-speed update method is already available and the
> entries are known in advance to be immutable:
>             . . .
>             if isinstance(iterable, BaseSet):
>                 self._data.update(iterable._data)
>                 return
>             if isinstance(iterable, dict):
>                 self._data.update(iterable)
>                 return
>             . . .


> 2.  Eliminate the binary sanity checks which verify for operators that 'other' is a BaseSet. If 'other' isn't a BaseSet, try using
> it, directly or by coercing to a set, as an iterable:
> >>> Set('abracadabra') | 'alacazam'
> Set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])
> This improves usability because the second argument did not have to be pre-wrapped with Set.  It improves speed, for some
> operations, by using the iterable directly and not having to build an equivalent dictionary.

No.  This has been proposed before.  I think it's a bad idea, just as

   [1,2,3] + "abc"

is a bad idea.

If you want this, it's easy enough to do

 s = Set('abracadabra')

> 3.  Have ImmutableSet keep a reference to the original iterable.  Add an ImmutableSet.refresh() method that rebuilds ._data from
> the iterable.  Add a Set.refresh() method that triggers ImmutableSet.refresh() where possible.  The goal is to improve the
> usability of sets of sets where the inner sets have been updated after the outer set was created.
> >>> inner = Set('abracadabra')
> >>> outer = Set([inner])
> >>> inner.add('z')                 # now the outer set is out-of-date
> >>> outer.refresh()               # now it is current
> >>> outer
> Set([ImmutableSet(['a', 'c', 'r', 'z', 'b', 'd'])])
> This would only work for restartable iterables -- a file object would not be so easily refreshed.

This *appears* to be messing with the immutability.  If I wrote:

  a = range(3)
  s1 = ImmutableSet(a)
  s2 = Set([s1])

What would the value of s1 be?

I think I understand your use case (the example in the docs, where an
employee is added), but I think we should think harder about what to
do about that.  Possibly it's not a good example of how sets are used
(even if it's a good example of how sets work).

--Guido van Rossum (home page:

From  Thu Aug 29 15:25:50 2002
From: (Guido van Rossum)
Date: Thu, 29 Aug 2002 10:25:50 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: Your message of "Thu, 29 Aug 2002 09:20:00 EDT."
References: <> <> <> <02c401c24f21$ceed79e0$>
Message-ID: <>

> > FWIW, in case "someone" cares:
> > It's a nice library architecture, designed and implemented by people who
> > know the domain, and I think it should be applicable to Python.
> I'm willing to implement this one.

Please do!  (Have you got much experience with random number

--Guido van Rossum (home page:

From  Thu Aug 29 15:36:18 2002
From: (Raymond Hettinger)
Date: Thu, 29 Aug 2002 10:36:18 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
References: <> <> <> <02c401c24f21$ceed79e0$>              <007e01c24f5e$c84e73e0$5fb63bd0@othello>  <>
Message-ID: <00a301c24f69$709a0e60$5fb63bd0@othello>

> > > FWIW, in case "someone" cares:
> > > It's a nice library architecture, designed and implemented by people who
> > > know the domain, and I think it should be applicable to Python.
> >
> > I'm willing to implement this one.
> Please do!  (Have you got much experience with random number
> generation?)

Yes, but my experience is out-of-date.  I've read Knuth (esp the part on testing generators), done numerical analysis, written
simulations and high-end crypto, etc.   The Mersenne Twister algorithm is new to me -- studying it is part of my motivation to
volunteer to implement it.

From  Thu Aug 29 15:42:49 2002
From: (Guido van Rossum)
Date: Thu, 29 Aug 2002 10:42:49 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: Your message of "Thu, 29 Aug 2002 10:36:18 EDT."
References: <> <> <> <02c401c24f21$ceed79e0$> <007e01c24f5e$c84e73e0$5fb63bd0@othello> <>
Message-ID: <>

> > > > FWIW, in case "someone" cares:
> > > > It's a nice
> > > > library architecture, designed and implemented by people who
> > > > know the domain, and I think it should be applicable to
> > > > Python.
> > >
> > > I'm willing to implement this one.
> >
> > Please do!  (Have you got much experience with random number
> > generation?)
> Yes, but my experience is out-of-date.  I've read Knuth (esp the
> part on testing generators), done numerical analysis, written
> simulations and high-end crypto, etc.  The Mersenne Twister
> algorithm is new to me -- studying it is part of my motivation to
> volunteer to implement it.

Cool!  You & Tim will have something to talk about.

--Guido van Rossum (home page:

From  Thu Aug 29 15:43:09 2002
From: (Skip Montanaro)
Date: Thu, 29 Aug 2002 09:43:09 -0500
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <200208290704.g7T74WA22198@localhost.localdomain>
References: <>
Message-ID: <>

(trimming the cc list a bit, since this is drifting a bit away from strictly
discussing the current algorithm.)

    Anthony> For what it's worth, the attached (simple) script will
    Anthony> 'de-spamassassin' an email message. I use it on my 'spam'
    Anthony> folder to get test messages of various ugly MIME things that
    Anthony> spam and viruses let through...

Thanks, that helps me as well, as I need to delete the X-VM-* headers
Emacs's VM mail package inserts.  While spamassassin -d does what you are
doing, it can be easily extended to elide other headers as well.

One thing worth noting before everybody starts using it to massage their
mailboxes is that the email package contains a bug which causes it to
occasionally delete whitespace when reformatting headers.  For example, in
one example, the header went from

    Received: from ([])
              (InterMail vM. 201-253-122-126-106-20020509) with ESMTP
              id <>;
              Tue, 20 Aug 2002 16:54:24 -0400


    Received: from ([])
            (InterMail vM. 201-253-122-126-106-20020509) with ESMTPid
            Tue, 20 Aug 2002 16:54:24 -0400

Note that in the second version there is no space between "ESMTP" and "id",
which had previously been separated by a newline and several spaces.

I filed a bug report about it a few days ago:


From  Thu Aug 29 15:53:07 2002
From: (Raymond Hettinger)
Date: Thu, 29 Aug 2002 10:53:07 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
References: <> <> <> <> <> <> <> <> <> <> <> <>              <007501c24f1c$29565ec0$2261accf@othello>  <>
Message-ID: <00b101c24f6b$c9e7c5a0$5fb63bd0@othello>

> > 2.  Eliminate the binary sanity checks which verify for operators that 'other' is a BaseSet. If 'other' isn't a BaseSet, try
> > it, directly or by coercing to a set, as an iterable:
> >
> > >>> Set('abracadabra') | 'alacazam'
> > Set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])
> >
> > This improves usability because the second argument did not have to be pre-wrapped with Set.  It improves speed, for some
> > operations, by using the iterable directly and not having to build an equivalent dictionary.
> No.  This has been proposed before.  I think it's a bad idea, just as
>    [1,2,3] + "abc"
> is a bad idea.

I see the wisdom in preventing weirdness.  The real motivation was to get to play nicely with other set implementations.
Right now, it can only interact with instances of BaseClass.  And, even if someone subclasses BaseClass, they currently *must*
have a self._data attribute that is a dictionary.  This prevents non-dictionary based extensions.

> > 3.  Have ImmutableSet keep a reference to the original iterable.  Add an ImmutableSet.refresh() method that rebuilds ._data
> > the iterable.  Add a Set.refresh() method that triggers ImmutableSet.refresh() where possible.  The goal is to improve the
> > usability of sets of sets where the inner sets have been updated after the outer set was created.
> >
> > >>> inner = Set('abracadabra')
> > >>> outer = Set([inner])
> > >>> inner.add('z')                 # now the outer set is out-of-date
> > >>> outer.refresh()               # now it is current
> > >>> outer
> > Set([ImmutableSet(['a', 'c', 'r', 'z', 'b', 'd'])])
> >
> > This would only work for restartable iterables -- a file object would not be so easily refreshed.
> This *appears* to be messing with the immutability.  If I wrote:
>   a = range(3)
>   s1 = ImmutableSet(a)
>   s2 = Set([s1])
>   a.append(4)
>   s2.refresh()
> What would the value of s1 be?

Hmm, I intended to have s1.refresh() return a new object for use in s2 while leaving s1 alone (being immutable and all).  Now, I
wonder if that was the right thing to do.  The answer lies in use cases for algorithms that need sets of sets.  If anyone knows
off the top of their head that would be great; otherwise, I seem to remember that some of that business was found in compiler
algorithms and graph packages.

From  Thu Aug 29 16:29:40 2002
From: (Skip Montanaro)
Date: Thu, 29 Aug 2002 10:29:40 -0500
Subject: [Python-Dev] Re: [Python-checkins] python/dist/src/Lib/test,1.41,1.42,1.3,1.4
In-Reply-To: <>
References: <>
Message-ID: <>

    >> Modified Files:
    >> Log Message:
    >> Quite down some FutureWarnings.

    Michael> Barry, is this why these tests have started to fail?

Whatever Barry and Guido did fixed the problems with those files.  The other
failures were all caused by a cvs conflict in my locally modified version of  Oddly enough, "cvs up" didn't report a "C", just an "M", so I
didn't even think to look there for problems.


From  Thu Aug 29 16:26:51 2002
From: (Guido van Rossum)
Date: Thu, 29 Aug 2002 11:26:51 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: Your message of "Thu, 29 Aug 2002 10:53:07 EDT."
References: <> <> <> <> <> <> <> <> <> <> <> <> <007501c24f1c$29565ec0$2261accf@othello> <>
Message-ID: <>

(How about limiting our lines to 72 characters?)

> > > Eliminate the binary sanity checks which verify for operators
> > > that 'other' is a BaseSet. If 'other' isn't a BaseSet, try using
> > > it, directly or by coercing to a set, as an iterable:
> > >
> > > >>> Set('abracadabra') | 'alacazam'
> > > Set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])
> > >
> > > This improves usability because the second argument did not have
> > > to be pre-wrapped with Set.  It improves speed, for some
> > > operations, by using the iterable directly and not having to
> > > build an equivalent dictionary.
> >
> > No.  This has been proposed before.  I think it's a bad idea, just as
> >
> >    [1,2,3] + "abc"
> >
> > is a bad idea.
> I see the wisdom in preventing weirdness.  The real motivation was
> to get to play nicely with other set implementations.  Right
> now, it can only interact with instances of BaseClass.  And, even if
> someone subclasses BaseClass, they currently *must* have a
> self._data attribute that is a dictionary.  This prevents
> non-dictionary based extensions.

I've thought of (and I think I even posted) a different way to
accomplish the latter, *if* *and* *when* it becomes necessary.  Here
it is again:

- BaseSet becomes a true abstract class.  I don't care if it has dummy
  methods that raise NotImplementedError, but the set of operations it
  stands for should be documented.  I propose that it should stand for
  only the published operations of ImmutableSet.  Other set
  implementations can then derive from BaseSet.

- The implementation currently in BaseSet is moved to a new internal
  class, e.g. _CoreSet, which derives from BaseSet.

- Set and ImmutableSet derive from _CoreSet.

- The binary operators (and sundry other places as needed) make *two*
  checks for the 'other' argument:

  - If it is a _CoreSet instance, do what's currently done, taking a
    shortcut knowing the implementation.

  - Otherwise, if it is a BaseSet instance, implement the operation
    using only the published set API.


  def __or__(self, other):
      if isinstance(other, _CoreSet):
          result = self.__class__(self._data)
          return result
      if isinstance(other, BaseSet):
          result = self.__class__(self._data)
      return NotImplemented

This effectively makes BaseSet a protocol.  I realize that there is
some resistance to using inheritance from a designated abstract base
class as a way to indicate that a class implements a given protocol;
but since we don't have other solutions in place, I think this is a
reasonable solution.  Trying to sniff whether the other argument
implements a set protocol by testing the presence of specific APIs
seems awkward, especially since most set APIs (__or__ etc.) are
heavily overloaded by types that aren't sets at all.

> > > Have ImmutableSet keep a reference to the original iterable.
> > > Add an ImmutableSet.refresh() method that rebuilds ._data from
> > > the iterable.  Add a Set.refresh() method that triggers
> > > ImmutableSet.refresh() where possible.  The goal is to improve
> > > the usability of sets of sets where the inner sets have been
> > > updated after the outer set was created.
> > >
> > > >>> inner = Set('abracadabra')
> > > >>> outer = Set([inner])
> > > >>> inner.add('z')                 # now the outer set is out-of-date
> > > >>> outer.refresh()               # now it is current
> > > >>> outer
> > > Set([ImmutableSet(['a', 'c', 'r', 'z', 'b', 'd'])])
> > >
> > > This would only work for restartable iterables -- a file object would not be so easily refreshed.
> >
> > This *appears* to be messing with the immutability.  If I wrote:
> >
> >   a = range(3)
> >   s1 = ImmutableSet(a)
> >   s2 = Set([s1])
> >   a.append(4)
> >   s2.refresh()
> >
> > What would the value of s1 be?
> Hmm, I intended to have s1.refresh() return a new object for use in
> s2 while leaving s1 alone (being immutable and all).  Now, I wonder
> if that was the right thing to do.  The answer lies in use cases for
> algorithms that need sets of sets.  If anyone knows off the top of
> their head that would be great; otherwise, I seem to remember that
> some of that business was found in compiler algorithms and graph
> packages.

Let's call YAGNI on this one.

--Guido van Rossum (home page:

From  Thu Aug 29 17:21:38 2002
From: (Tim Peters)
Date: Thu, 29 Aug 2002 12:21:38 -0400
Subject: [Python-Dev] Re: PEP 218 (sets); moving to Lib
In-Reply-To: <00b101c24f6b$c9e7c5a0$5fb63bd0@othello>
Message-ID: <>

[Raymond Hettinger]
> ...
> Hmm, I intended to have s1.refresh() return a new object for use
> in s2 while leaving s1 alone (being immutable and all).  Now, I
> wonder if that was the right thing to do.  The answer lies in use
> cases for algorithms that need sets of sets.  If anyone knows
> off the top of their head that would be great; otherwise, I seem
> to remember that some of that business was found in compiler
> algorithms and graph packages.

There's no real use case I know of for having a mutation of a set element
propagate to the set containing it.  Sets in Python are collections of
values, not collections of object ids (sets in Icon are collections of
object ids, and, e.g., Set([[], []]) in Icon is a set with two elements).
Value semantics darned near require copying, or fancier copy on write, under
the covers, and value semantics are most useful for sets of sets.  Once the
value has been established, you want to guarantee it never changes, not make
it easy to change it by accident <wink>.

From Anthony Baxter <>  Thu Aug 29 17:31:42 2002
From: Anthony Baxter <> (Anthony Baxter)
Date: Fri, 30 Aug 2002 02:31:42 +1000
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <200208291631.g7TGVgd28718@localhost.localdomain>

>>> Skip Montanaro wrote
> One thing worth noting before everybody starts using it to massage their
> mailboxes is that the email package contains a bug which causes it to
> occasionally delete whitespace when reformatting headers. 

There's one other known problem - seriously misformatted MIME (as
seen in spam, and email from Microsoft Entourage) causes the email
package to barf out. I plan, at some point, to try and make a "if
it fails, just leave the body as one chunk of text" mode, but it's
a long long way down my list of priorities. 

Anthony Baxter     <>   
It's never too late to have a happy childhood.

From  Thu Aug 29 18:13:07 2002
From: (Eric S. Raymond)
Date: Thu, 29 Aug 2002 13:13:07 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
References: <> <>
Message-ID: <>

Tim Peters <>:
> Spammers often generate random "word-like" gibberish at the ends of msgs,
> and "rd" is one of the random two-letter combos that appears in the spam
> corpus.  Perhaps it would be good to ignore "words" with fewer than W
> characters (to be determined by experiment).

Bogofilter throws out words of length one and two.

> I expect that including the headers would have given these much better
> chances of getting through, given Robin and Alex's posting histories.
> Still, the idea of counting words multiple times is open to question, and
> experiments both ways are in order.

And bogofilter includes the headers.  This is important, since
otherwise you don't rate things like spamhaus addresses and sender
		<a href="">Eric S. Raymond</a>

From  Thu Aug 29 18:54:30 2002
From: (Tim Peters)
Date: Thu, 29 Aug 2002 13:54:30 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <>

[Eric S. Raymond]
> Bogofilter throws out words of length one and two.

Right, I saw that.  It's something I'll run experiments against later.  I'm
running a 5x5 test grid (skipping the diagonal), and as was also true in
speech recognition, if I had been running against just one spam+ham training
corpora and just one spam+ham prediction set, I would have erroneously
concluded that various things either are improvements, are regressions, or
don't matter.  But some ideas obtained from staring at mistakes from one
test run turn out to be irrelevant, or even counter-productive, if applied
to other test runs.  The idea that some notion of "word" is important seems
highly defensible <wink>, but beyond that I discount claims that aren't
derived from a similarly paranoid testing setup.

> ...
> And bogofilter includes the headers.  This is important, since
> otherwise you don't rate things like spamhaus addresses and sender
> names.

Of course -- the reasons I'm not using headers in these particular tests
have been spelled out several times.  They'll get added later, but for now I
don't have a large enough test set where doing so doesn't render the
classifier's job trivial.

From  Thu Aug 29 19:26:59 2002
From: (Raymond Hettinger)
Date: Thu, 29 Aug 2002 14:26:59 -0400
Subject: [Python-Dev] Mersenne Twister
References: <> <> <> <02c401c24f21$ceed79e0$> <007e01c24f5e$c84e73e0$5fb63bd0@othello> <>              <00a301c24f69$709a0e60$5fb63bd0@othello>  <>
Message-ID: <003b01c24f89$aa62a2e0$3ad8accf@othello>

I'm sketching out an approach to the Mersenne Twister and 
wanted to make sure it is in line with what you want.

-- Write it in pure python as a drop-in replacement for Wichman-Hill.

-- Add a version number argument to Random() which defaults to two.
    If set to one, use the old generator so that it is possible to recreate
    sequences from earlier versions of Python.  Note, the code is much
    shorter if we drop this requirement.  On the plus side, it gives more
    than backwards compatability, it gives the ability to re-run a
    simulation with another generator to assure that the result isn't
    a fluke related to a generator design flaw.

-- Document David Abrahams's link to as the
    reference implementation and as a place for
    more information.  Key-off of the MT19337 version as the most
    recent stable evolution.

-- Move the existing in-module test-suite into a unittest.  Add a new,
   separate unittest suite with tests specific to MT (recreating a few 
   sequences produced by reference implementations) and with a battery
   of Knuth style tests.  The validation results are at:

-- When we're done, have a python link put on the Mersenne Twister
    Home Page (the second link above).

-- Write, test and document the generator first.  Afterwards, explore
    techniques for creating multiple independent streams:


----- Original Message ----- 
From: "Guido van Rossum" <>
To: "Raymond Hettinger" <>
Cc: <>
Sent: Thursday, August 29, 2002 10:42 AM
Subject: Re: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]

> > > > > FWIW, in case "someone" cares:
> > > > > It's a nice
> > > > > library architecture, designed and implemented by people who
> > > > > know the domain, and I think it should be applicable to
> > > > > Python.
> > > >
> > > > I'm willing to implement this one.
> > >
> > > Please do!  (Have you got much experience with random number
> > > generation?)
> > 
> > Yes, but my experience is out-of-date.  I've read Knuth (esp the
> > part on testing generators), done numerical analysis, written
> > simulations and high-end crypto, etc.  The Mersenne Twister
> > algorithm is new to me -- studying it is part of my motivation to
> > volunteer to implement it.
> Cool!  You & Tim will have something to talk about.
> --Guido van Rossum (home page:

From  Thu Aug 29 19:32:16 2002
From: (Jeremy Hylton)
Date: Thu, 29 Aug 2002 14:32:16 -0400
Subject: [Python-Dev] Mersenne Twister
In-Reply-To: <003b01c24f89$aa62a2e0$3ad8accf@othello>
References: <>
Message-ID: <>

Why not wrap the existing C implementation?  I think a wrapper has two
advantages.  We get to reuse the existing implementation, without
worry for transliteration errors.  We also get better performance.


From  Thu Aug 29 19:46:42 2002
From: (Guido van Rossum)
Date: Thu, 29 Aug 2002 14:46:42 -0400
Subject: [Python-Dev] Re: Mersenne Twister
In-Reply-To: Your message of "Thu, 29 Aug 2002 14:26:59 EDT."
References: <> <> <> <02c401c24f21$ceed79e0$> <007e01c24f5e$c84e73e0$5fb63bd0@othello> <> <00a301c24f69$709a0e60$5fb63bd0@othello> <>
Message-ID: <>

> -- Write it in pure python as a drop-in replacement for Wichman-Hill.

Yup.  I think the seed arguments are different though -- MT takes a
single int, while whrandom takes three ints in range(256).

> -- Add a version number argument to Random() which defaults to two.
>     If set to one, use the old generator so that it is possible to recreate
>     sequences from earlier versions of Python.  Note, the code is much
>     shorter if we drop this requirement.  On the plus side, it gives more
>     than backwards compatability, it gives the ability to re-run a
>     simulation with another generator to assure that the result isn't
>     a fluke related to a generator design flaw.

I think this is useful.  But I'd like to hear what Tim has to say.

> -- Document David Abrahams's link to 
> as the
>     reference implementation and

Hm.  What part of that file contains the actual algorithm?  I gues the
function void mersenne_twister<DataType,n,m,r,a,u,s,b,t,c,l,val>::twist()

> as a place for
>     more information.  Key-off of the MT19337 version as the most
>     recent stable evolution.

Sure.  It would be nice to have at least *some* documentation in-line
in case those links disappear.  Maybe you can quote the relevant C++
code from the Boost version (with attribution) in a comment.

> -- Move the existing in-module test-suite into a unittest.  Add a new,
>    separate unittest suite with tests specific to MT (recreating a few 
>    sequences produced by reference implementations) and with a battery
>    of Knuth style tests.  The validation results are at:  

It might be fun to have some heavy duty tests (which take hours or
days to run) checked in but not run by default.  We usually do this by
not naming the test file; it can then be run manually.

> -- When we're done, have a python link put on the Mersenne Twister
>     Home Page (the second link above).

Sounds like they would be only too eager to comply. :-)

> -- Write, test and document the generator first.  Afterwards, explore
>     techniques for creating multiple independent streams:

Isn't that trivial if you follow the WH implementation strategy which
stores all the state in a class instance?

--Guido van Rossum (home page:

From  Thu Aug 29 19:59:25 2002
From: (Tim Peters)
Date: Thu, 29 Aug 2002 14:59:25 -0400
Subject: [Python-Dev] Re: A `cogen' module [was: Re: PEP 218 (sets); moving to Lib]
In-Reply-To: <00a301c24f69$709a0e60$5fb63bd0@othello>
Message-ID: <>

[Raymond Hettinger]
> ... but my experience is out-of-date.  I've read Knuth (esp the
> part on testing generators),

Knuth is behind the times here.  Better:


But if you're folding in the Twister, you don't have to test ab initio --
you just have to make sure the test vector they supply produces the results
they say it should.

> done numerical analysis, written simulations and high-end crypto, etc.
> The Mersenne Twister algorithm is new to me -- studying it is part of my
> motivation to volunteer to implement it.

It's been implmented for Python several times already (in Python code, and
as a C extension).  This is more of an integration and API task than a
write-code-from-scratch task.  Do visit the authors' home page:

Indeed, the authors have been reduced <wink> to announcing "yet another
Python module ...".

Note that a subtle weakness was discovered in the seed initialization early
this year, so that implementations older than that are suspect on this
count.  BTW, Knuth's lagged Fibonacci generator turned out to have the same
kind of initialization weakness, and was corrected in the ninth printing of
Vol 2:

If this stuff interests you <wink>, Ivan Frohne (a statistician) wrote a
wonderful pure-Python random-number package several years ago, including a
pure Python implementation of the Twister, and several other
stronger-than-WH (0, 1) base generators.  It's hard to keep track of that
package -- and of Ivan.  This may be the most recent version:

From  Thu Aug 29 20:00:03 2002
From: (Raymond Hettinger)
Date: Thu, 29 Aug 2002 15:00:03 -0400
Subject: [Python-Dev] Mersenne Twister
References: <><><><02c401c24f21$ceed79e0$><007e01c24f5e$c84e73e0$5fb63bd0@othello><><00a301c24f69$709a0e60$5fb63bd0@othello><><003b01c24f89$aa62a2e0$3ad8accf@othello> <>
Message-ID: <002001c24f8e$48d27e60$83ec7ad1@othello>

> Why not wrap the existing C implementation?  I think a wrapper has two
> advantages.  We get to reuse the existing implementation, without
> worry for transliteration errors.  We also get better performance.

On the plus side, it gives a chance to write a pure C helper 
function for creating many random numbers at a time.

On the minus side, random number generation is a much disputed
topic, occassionly requiring full disclosure of seeds and source.
Having the code in makes it more visible than
burying it in the C code.

The C code I saw is covered by a BSD license -- I don't 
know if that's an issue or not.

As for implementation difficulty or accuracy, the code is so short 
and clear that there isn't a savings from re-using the C code.


From  Thu Aug 29 20:14:30 2002
From: (Andrew P. Lentvorski)
Date: Thu, 29 Aug 2002 12:14:30 -0700 (PDT)
Subject: [Python-Dev] Mersenne Twister
In-Reply-To: <003b01c24f89$aa62a2e0$3ad8accf@othello>
Message-ID: <>

On Thu, 29 Aug 2002, Raymond Hettinger wrote:

> -- Add a version number argument to Random() which defaults to two.

Why not have a WHRandom() and a MersenneRandom() instance inside module
random?  That way you can even give a future behavior warning that
Random() is about to change and people can either choose the particular
generator they want or accept the default.

To my mind, this is a case of explicit (actually naming the generator
types) is better than implicit (version number?  Where's my documentation?
Which generator is which version?)  Maybe this isn't a big deal now, but I
can believe that we might accumulate another RNG or two (there are some
good reasons to want *weaker* or correlated RNGs) and having a weaker
generator with a *later* version number is just bound to cause havoc.


From  Thu Aug 29 20:21:46 2002
From: (Raymond Hettinger)
Date: Thu, 29 Aug 2002 15:21:46 -0400
Subject: [Python-Dev] Rehashing in PyDict_Copy
Message-ID: <004701c24f91$51ee0200$83ec7ad1@othello>

Is there a reason that dict.copy() runs like an update()?
It creates a new dict object, then re-hashes and inserts
every element one-by-one, complete with collisions.

I would have expected a single pass to update refcounts,
an allocation for identical size, and a memcpy to polish
it off.


From  Thu Aug 29 20:29:21 2002
From: (Guido van Rossum)
Date: Thu, 29 Aug 2002 15:29:21 -0400
Subject: [Python-Dev] Mersenne Twister
In-Reply-To: Your message of "Thu, 29 Aug 2002 12:14:30 PDT."
References: <>
Message-ID: <>

> > -- Add a version number argument to Random() which defaults to two.
> Why not have a WHRandom() and a MersenneRandom() instance inside module
> random?  That way you can even give a future behavior warning that
> Random() is about to change and people can either choose the particular
> generator they want or accept the default.
> To my mind, this is a case of explicit (actually naming the generator
> types) is better than implicit (version number?  Where's my documentation?
> Which generator is which version?)  Maybe this isn't a big deal now, but I
> can believe that we might accumulate another RNG or two (there are some
> good reasons to want *weaker* or correlated RNGs) and having a weaker
> generator with a *later* version number is just bound to cause havoc.

Hm, I hadn't realized that the random.Random class doesn't import the
whrandom module but simply reimplements it.

Here's an idea.

class BaseRandom: implements the end user methods: randrange(),
choice(), normalvariate(), etc., except random(), which is an abstract
method, raising NotImplementedError.

class WHRandom and class MersenneRandom: add the specific random
number generator implementation, as random().

Random is an alias for the random generator class of the day,
currently MersenneRandom.

Details: can MersenneRandom support jumpahead()?  Should it support
whseed(), which is provided only for backwards compatibility?

If someone pickles a Random instance with Python 2.2 and tries to
unpickle it with Python 2.3, this will fail, because (presumably) the
state for MersenneRandom is different from the state for WHRandom.
Perhaps there should be a module-level call to make Random an alias
for WHRandom rather than for MersenneRandom.

--Guido van Rossum (home page:

From  Thu Aug 29 20:30:52 2002
From: (Guido van Rossum)
Date: Thu, 29 Aug 2002 15:30:52 -0400
Subject: [Python-Dev] Rehashing in PyDict_Copy
In-Reply-To: Your message of "Thu, 29 Aug 2002 15:21:46 EDT."
References: <004701c24f91$51ee0200$83ec7ad1@othello>
Message-ID: <>

> Is there a reason that dict.copy() runs like an update()?
> It creates a new dict object, then re-hashes and inserts
> every element one-by-one, complete with collisions.
> I would have expected a single pass to update refcounts,
> an allocation for identical size, and a memcpy to polish
> it off.

After you've inserted and removed many elements into a dict, the
elements may not be in the best order, and there may be many "deleted"
markers.  The update() strategy avoids copying such cruft.

--Guido van Rossum (home page:

From  Thu Aug 29 20:45:41 2002
From: (Tim Peters)
Date: Thu, 29 Aug 2002 15:45:41 -0400
Subject: [Python-Dev] Mersenne Twister
In-Reply-To: <003b01c24f89$aa62a2e0$3ad8accf@othello>
Message-ID: <>

[Raymond Hettinger]
> I'm sketching out an approach to the Mersenne Twister and
> wanted to make sure it is in line with what you want.
> -- Write it in pure python as a drop-in replacement for Wichman-Hill.

I'd rather the Twister were in C -- it's low-level bit-fiddling, and Python
isn't well-suited to high-throughput bit fiddling.  IIRC, Ivan Frohne
eventually got his pure-Python Twister implementation (see earlier msg) to
within 15% of whrandom's speed, but it *could* be 10x faster w/o even

> -- Add a version number argument to Random() which defaults to two.
>    If set to one, use the old generator so that it is possible
>    to recreate sequences from earlier versions of Python.  Note, the code
>    is much shorter if we drop this requirement.  On the plus side, it
>    more than backwards compatability,

Backwards compatability is essential, at least in the sense that there's
*some* way to exactly reproduce results from earlier Python releases.  The
seed function in was very weak (for reasons explained in, and I did make that incompatible when improving it, but also
introduced whseed so that there was *some* way to reproduce older results
bit-for-bit.  I've gotten exactly one complaint about the incompatibility
since then.

Note that new generators are intended to be introduced via sublassing of

    # Specific to Wichmann-Hill generator.  Subclasses wishing to use a
    # different core generator should override the seed(), random(),
    # getstate(), setstate() and jumpahead() methods.

I don't believe the jumpahead() method can be usefully implemented for the

>    it gives the ability to re-run a simulation with another generator to
>    assure that the result isn't a fluke related to a generator design

That's very important in real life.  Ivan Frohne's design (and alternative
generators) should be considered here.

> -- Document David Abrahams's link to
> as the
>     reference implementation and
> as a place for
>     more information.  Key-off of the MT19337 version as the most
>     recent stable evolution.

I'd simply use the authors' source code.  You don't get bonus points for
ripping out layers of C++ templates someone else gilded the lily under

> -- Move the existing in-module test-suite into a unittest.

Easier said than done.  The test suite doesn't do anything except print
results -- it has no intelligence about whether the results are good or bad.
It was expected that a bona fide expert would stare at the output.

>    Add a new, separate unittest suite with tests specific to MT
>    a few sequences produced by reference implementations)

I doubt you need a distinct test file for that.

>    and with a battery of Knuth style tests.

They're far behind the current state of the art, and, as Knuth mentions in
an exercise, it's "term project" level effort to implement them even so.

>    The validation results are at:
> -- When we're done, have a python link put on the Mersenne Twister
>     Home Page (the second link above).

Yes!  Cool idea.

> -- Write, test and document the generator first.

As opposed to what, eating dinner first <wink>?

>     Afterwards, explore
>     techniques for creating multiple independent streams:

Agreed that should be delayed.

From  Thu Aug 29 22:23:37 2002
From: (Tim Peters)
Date: Thu, 29 Aug 2002 17:23:37 -0400
Subject: [Python-Dev] Mersenne Twister
In-Reply-To: <>
Message-ID: <>

> Here's an idea.
> class BaseRandom: implements the end user methods: randrange(),
> choice(), normalvariate(), etc., except random(), which is an abstract
> method, raising NotImplementedError.

That's fine, and close to the intended <wink> way to extend this.
BaseRandom should also leave as abstract seed(), getstate(), setstate() (but
should implement __getstate__ and __setstate__ -- Random already does this
correctly), and jumpahead().

> class WHRandom and class MersenneRandom: add the specific random
> number generator implementation, as random().

seed, getstate, setstate and jumpahead are also initimately connected to the
details of the base generator, so need also to be supplied by subclasses.

> Random is an alias for the random generator class of the day,
> currently MersenneRandom.

I like that better than what we've got now -- I would like to say that
Random may vary across releases, as the state of the art advances, so that
naive users are far less likely to fool themselves.  But users also need to
force use of a specific generator at times, and this scheme caters to both.

> Details: can MersenneRandom support jumpahead()?

Yes, but I don't think *usefully*.  That is, I doubt it could do it faster
than calling the base random() N times.  jumpahead() is easy to implement
efficiently for linear congruential generators, and possible to implement
efficiently for pure lagged Fibonacci generators, but that's about it.

> Should it support whseed(), which is provided only for backwards
> compatibility?

It should not.  whseed() shouldn't even be used by WHRandom users!  It's
solely for bit-for-bit reproducibility of an older and weaker scheme.

> If someone pickles a Random instance with Python 2.2 and tries to
> unpickle it with Python 2.3, this will fail, because (presumably) the
> state for MersenneRandom is different from the state for WHRandom.

That's in part why the concrete classes have to implement getstate() and
setstate() appropriately.  Note that every 2.2 Random pickle starts with the
integer 1 (the then-- and now --value of Random.VERSION).  That's enough
clue so that all future Python versions can know which flavor of random
pickle they're looking at.  A Twister pickle should start with some none-1

> Perhaps there should be a module-level call to make Random an alias
> for WHRandom rather than for MersenneRandom.

I suppose it would also have to replace all the other module globals
appropriately.  I'm thinking of the

_inst = Random()
seed = _inst.seed
random = _inst.random
uniform = _inst.uniform
randint = _inst.randint
choice = _inst.choice
randrange = _inst.randrange
shuffle = _inst.shuffle
normalvariate = _inst.normalvariate
lognormvariate = _inst.lognormvariate
cunifvariate = _inst.cunifvariate
expovariate = _inst.expovariate
vonmisesvariate = _inst.vonmisesvariate
gammavariate = _inst.gammavariate
stdgamma = _inst.stdgamma
gauss = _inst.gauss
betavariate = _inst.betavariate
paretovariate = _inst.paretovariate
weibullvariate = _inst.weibullvariate
getstate = _inst.getstate
setstate = _inst.setstate
jumpahead = _inst.jumpahead
whseed = _inst.whseed

cruft at the end here.

From  Thu Aug 29 22:39:52 2002
From: (Tim Peters)
Date: Thu, 29 Aug 2002 17:39:52 -0400
Subject: [Python-Dev] Re: Mersenne Twister
In-Reply-To: <>
Message-ID: <>

> -- Write, test and document the generator first.  Afterwards, explore
>     techniques for creating multiple independent streams:

> Isn't that trivial if you follow the WH implementation strategy which
> stores all the state in a class instance?

"Independent" has more than one meaning.  The implementation meanings you
have in mind (two instances of the generator don't share state, and nothing
one does affects what the other does) are indeed trivially achieved via
attaching all state to an instance.

A different meaning of "independent" is "statistically uncorrelated", and
that's more what the link is aiming at.  It's never easy to get multiple,
statistically independent streams.  For example, using WH's jumpahead, it's
possible that you pick a large number N, jumpahead(N) in one instance of a
WH generator, and then the two streams turn out to be perfectly correlated.
That's trivially so if you pick N to be a multiple of WH's period, but there
are smaller values of N that also suffer.  One reason to run a simulation
multiple times with distinct generators is that it's pragmatically
impossible to outguess all this stuff.  Two generators are a sanity check;
three can break the impasse when the first two deliver significantly
different results; four is clearly excessive <wink>.

From  Fri Aug 30 00:28:36 2002
From: (Jeremy Hylton)
Date: Thu, 29 Aug 2002 19:28:36 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
Message-ID: <>

I noticed that one frequently executed line in the mainloop is testing
whether either the ticker has dropped to 0 or if there are
things_to_do.  Would it be kosher to just drop the ticker to 0 whenever
things_to_do is set to true?  Then you'd only need to check the ticker
each time through.


Index: ceval.c
RCS file: /cvsroot/python/python/dist/src/Python/ceval.c,v
retrieving revision 2.332
diff -c -c -r2.332 ceval.c
*** ceval.c	23 Aug 2002 14:11:35 -0000	2.332
--- ceval.c	29 Aug 2002 23:19:18 -0000
*** 378,383 ****
--- 378,384 ----
  Py_AddPendingCall(int (*func)(void *), void *arg)
+ 	PyThreadState *tstate;
  	static int busy = 0;
  	int i, j;
  	/* XXX Begin critical section */
*** 395,400 ****
--- 396,404 ----
  	pendingcalls[i].func = func;
  	pendingcalls[i].arg = arg;
  	pendinglast = j;
+ 	tstate = PyThreadState_GET();
+ 	tstate->ticker = 0;
  	things_to_do = 1; /* Signal main loop */
  	busy = 0;
  	/* XXX End critical section */
*** 669,675 ****
  		   async I/O handler); see Py_AddPendingCall() and
  		   Py_MakePendingCalls() above. */
! 		if (things_to_do || --tstate->ticker < 0) {
  			tstate->ticker = tstate->interp->checkinterval;
  			if (things_to_do) {
  				if (Py_MakePendingCalls() < 0) {
--- 673,679 ----
  		   async I/O handler); see Py_AddPendingCall() and
  		   Py_MakePendingCalls() above. */
! 		if (--tstate->ticker < 0) {
  			tstate->ticker = tstate->interp->checkinterval;
  			if (things_to_do) {
  				if (Py_MakePendingCalls() < 0) {

From  Fri Aug 30 01:06:09 2002
From: (Guido van Rossum)
Date: Thu, 29 Aug 2002 20:06:09 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: Your message of "Thu, 29 Aug 2002 19:28:36 EDT."
References: <>
Message-ID: <>

> I noticed that one frequently executed line in the mainloop is testing
> whether either the ticker has dropped to 0 or if there are
> things_to_do.  Would it be kosher to just drop the ticker to 0 whenever
> things_to_do is set to true?  Then you'd only need to check the ticker
> each time through.

I think not -- Py_AddPendingCall() is supposed to be called from
interrupts and other low-level stuff, where you don't know whose
thread state you would get.  Too bad, it was a nice idea.

--Guido van Rossum (home page:

From  Fri Aug 30 02:23:54 2002
From: (Greg Ewing)
Date: Fri, 30 Aug 2002 13:23:54 +1200 (NZST)
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
Message-ID: <>

> > Would it be kosher to just drop the ticker to 0 whenever
> > things_to_do is set to true?
> I think not -- Py_AddPendingCall() is supposed to be called from
> interrupts and other low-level stuff, where you don't know whose
> thread state you would get.

Could you have just one ticker, instead of one
per thread?

Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | A citizen of NewZealandCorp, a	  |
Christchurch, New Zealand	   | wholly-owned subsidiary of USA Inc.  |	   +--------------------------------------+

From  Fri Aug 30 02:37:29 2002
From: (Skip Montanaro)
Date: Thu, 29 Aug 2002 20:37:29 -0500
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
References: <>
Message-ID: <15726.52313.734491.272985@gargle.gargle.HOWL>

    Greg> Could you have just one ticker, instead of one per thread?

That would make ticker really count down checkinterval ticks.  Also, of
possible interest is this declaration and comment in longobject.c:

    static int ticker;  /* XXX Could be shared with ceval? */

Any time C code would want to read or update ticker, it would have the GIL,
right?  Sounds like you could get away with a single ticker.  The long int
implementation appears to do just fine with only one...


From  Fri Aug 30 02:51:20 2002
From: (Tim Peters)
Date: Thu, 29 Aug 2002 21:51:20 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
Message-ID: <>

> I noticed that one frequently executed line in the mainloop is testing
> whether either the ticker has dropped to 0 or if there are
> things_to_do.  Would it be kosher to just drop the ticker to 0 whenever
> things_to_do is set to true?  Then you'd only need to check the ticker
> each time through.

> I think not -- Py_AddPendingCall() is supposed to be called from
> interrupts and other low-level stuff, where you don't know whose
> thread state you would get.  Too bad, it was a nice idea.

Well ... does each tstate really need its own ticker?  If that were a
property of the interpreter instead ("number of ticks until it's time for
this interpreter to switch threads"), shared across all threads running in
that interpreter, then I bet the visible semantics would be much the same.
The GIL is always held when the ticker is decremented or reset, so there's
nothing non-deterministic in sharing it.

From  Fri Aug 30 03:02:43 2002
From: (Skip Montanaro)
Date: Thu, 29 Aug 2002 21:02:43 -0500
Subject: [Python-Dev] single ticker patch
Message-ID: <15726.53827.402861.875268@gargle.gargle.HOWL>

Here's a single ticker patch:

It also gets rid of the private ticker in longobject.c.


From  Fri Aug 30 03:43:28 2002
From: (Tim Peters)
Date: Thu, 29 Aug 2002 22:43:28 -0400
Subject: [Python-Dev] Mersenne Twister
In-Reply-To: <002001c24f8e$48d27e60$83ec7ad1@othello>
Message-ID: <>

[Raymond Hettinger]
> ...
> The C code I saw is covered by a BSD license -- I don't
> know if that's an issue or not.

That's fine, provided it doesn't have the dreaded "advertising clause".  I
personally don't care whether it does -- it's the FSF that has bug up their
butt about that one.  I expect we'd have to reproduce their copyright notice
in the docs somewhere; yup:

     2. Redistributions in binary form must reproduce the above copyright
        notice, this list of conditions and the following disclaimer in
        the documentation and/or other materials provided with the

I think we *ought* to perform a similar courtesy for, e.g., the Tcl/Tk and
zlib components shipped with the Python Windows installer too.

> As for implementation difficulty or accuracy, the code is so short
> and clear that there isn't a savings from re-using the C code.

That isn't the point here.  If you use Nishimura and Matsumoto's code as
close to verbatim as possible, then that's the perfect answer to your
earlier point:

> On the minus side, random number generation is a much disputed
> topic, occassionly requiring full disclosure of seeds and source.

Nothing *could* be more fully disclosed than their source code:  it's
extremely well known to every worker in the field, and has gotten critical
review from the smartest eyeballs in the world.

From  Fri Aug 30 05:23:15 2002
From: (David Goodger)
Date: Fri, 30 Aug 2002 00:23:15 -0400
Subject: [Python-Dev] ANN: New PEP Format: reStructuredText
Message-ID: <>

With many thanks to Barry Warsaw for his help and patience, I am
pleased to announce that a new format for PEPs (Python Enhancement
Proposals) has been deployed.  The new format is reStructuredText, a
lightweight what-you-see-is-what-you-get plaintext markup syntax and
parser component of the Docutils project.  From the new PEP 12:

    ReStructuredText is offered as an alternative to plaintext PEPs,
    to allow PEP authors more functionality and expressivity, while
    maintaining easy readability in the source text.  The processed
    HTML form makes the functionality accessible to readers: live
    hyperlinks, styled text, tables, images, and automatic tables of
    contents, among other advantages.

The following PEPs have been marked up with reStructuredText:

- PEP 12 -- Sample reStructuredText PEP Template

- PEP 256 -- Docstring Processing System Framework

- PEP 257 -- Docstring Conventions

- PEP 258 -- Docutils Design Specification

- PEP 287 -- reStructuredText Docstring Format

- PEP 290 -- Code Migration and Modernization

In addition, the text of PEP 1 and PEP 9 has been revised.

Authors of new PEPs are invited to consider using the new format, and
authors of existing PEPs are invited to convert their PEPs to
reStructuredText to take advantage of the many enhancements over the
plaintext format.  I, along with the other Docutils developers and
users, will be happy to assist.  Please send questions to:

The latest project snapshot can always be downloaded from:

(This is required to process the PEP source into HTML.  It requires
at least Python 2.0; Python 2.1 or later is recommended.)

Docutils and reStructuredText are under active development.  Input is
very welcome, especially HTML rendering/stylesheet issues with
different browsers.  We welcome new contributors.  If you'd like to
get involved, please visit:

David Goodger  <>  Open-source projects:
  - Python Docutils:
    (includes reStructuredText:
  - The Go Tools Project:

From  Fri Aug 30 05:23:42 2002
From: (David Goodger)
Date: Fri, 30 Aug 2002 00:23:42 -0400
Subject: [Python-Dev] PEP 12 -- Sample reStructuredText PEP Template
Message-ID: <>

This PEP presents an alternative format to that of PEP 9.
Feedback is welcome.

-- David Goodger

PEP: 12
Title: Sample reStructuredText PEP Template
Version: $Revision: 1.3 $
Last-Modified: $Date: 2002/08/30 04:11:20 $
Author: David Goodger <>,
        Barry A. Warsaw <>
Status: Active
Type: Informational
Content-Type: text/x-rst
Created: 05-Aug-2002
Post-History: 30-Aug-2002


This PEP provides a boilerplate or sample template for creating your
own reStructuredText PEPs.  In conjunction with the content guidelines
in PEP 1 [1]_, this should make it easy for you to conform your own
PEPs to the format outlined below.

Note: if you are reading this PEP via the web, you should first grab
the text (reStructuredText) source of this PEP in order to complete

To get the source of this (or any) PEP, look at the top of the HTML
page and click on the link titled "PEP Source".

If you would prefer not to use markup in your PEP, please see PEP 9,
"Sample Plaintext PEP Template" [2]_.


PEP submissions come in a wide variety of forms, not all adhering
to the format guidelines set forth below.  Use this template, in
conjunction with the format guidelines below, to ensure that your
PEP submission won't get automatically rejected because of form.

ReStructuredText is offered as an alternative to plaintext PEPs, to
allow PEP authors more functionality and expressivity, while
maintaining easy readability in the source text.  The processed HTML
form makes the functionality accessible to readers: live hyperlinks,
styled text, tables, images, and automatic tables of contents, among
other advantages.  For an example of a PEP marked up with
reStructuredText, see PEP 287.

How to Use This Template

To use this template you must first decide whether your PEP is going
to be an Informational or Standards Track PEP.  Most PEPs are
Standards Track because they propose a new feature for the Python
language or standard library.  When in doubt, read PEP 1 for details
or contact the PEP editors <>.

Once you've decided which type of PEP yours is going to be, follow the
directions below.

- Make a copy of this file (``.txt`` file, **not** HTML!) and perform
  the following edits.

- Replace the "PEP: 9" header with "PEP: XXX" since you don't yet have
  a PEP number assignment.

- Change the Title header to the title of your PEP.

- Leave the Version and Last-Modified headers alone; we'll take care
  of those when we check your PEP into CVS.

- Change the Author header to include your name, and optionally your
  email address.  Be sure to follow the format carefully: your name
  must appear first, and it must not be contained in parentheses.
  Your email address may appear second (or it can be omitted) and if
  it appears, it must appear in angle brackets.  It is okay to
  obfuscate your email address.

- If there is a mailing list for discussion of your new feature, add a
  Discussions-To header right after the Author header.  You should not
  add a Discussions-To header if the mailing list to be used is either or, or if discussions
  should be sent to you directly.  Most Informational PEPs don't have
  a Discussions-To header.

- Change the Status header to "Draft".

- For Standards Track PEPs, change the Type header to "Standards

- For Informational PEPs, change the Type header to "Informational".

- For Standards Track PEPs, if your feature depends on the acceptance
  of some other currently in-development PEP, add a Requires header
  right after the Type header.  The value should be the PEP number of
  the PEP yours depends on.  Don't add this header if your dependent
  feature is described in a Final PEP.

- Change the Created header to today's date.  Be sure to follow the
  format carefully: it must be in ``dd-mmm-yyyy`` format, where the
  ``mmm`` is the 3 English letter month abbreviation, i.e. one of Jan,
  Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec.

- For Standards Track PEPs, after the Created header, add a
  Python-Version header and set the value to the next planned version
  of Python, i.e. the one your new feature will hopefully make its
  first appearance in.  Do not use an alpha or beta release
  designation here.  Thus, if the last version of Python was 2.2 alpha
  1 and you're hoping to get your new feature into Python 2.2, set the
  header to::

      Python-Version: 2.2

- Leave Post-History alone for now; you'll add dates to this header
  each time you post your PEP to or  If you posted your PEP to the lists on
  August 14, 2001 and September 3, 2001, the Post-History header would
  look like::

      Post-History: 14-Aug-2001, 03-Sept-2001

  You must manually add new dates and check them in.  If you don't
  have check-in privileges, send your changes to the PEP editors.

- Add a Replaces header if your PEP obsoletes an earlier PEP.  The
  value of this header is the number of the PEP that your new PEP is
  replacing.  Only add this header if the older PEP is in "final"
  form, i.e. is either Accepted, Final, or Rejected.  You aren't
  replacing an older open PEP if you're submitting a competing idea.

- Now write your Abstract, Rationale, and other content for your PEP,
  replacing all this gobbledygook with your own text. Be sure to
  adhere to the format guidelines below, specifically on the
  prohibition of tab characters and the indentation requirements.

- Update your References and Copyright section.  Usually you'll place
  your PEP into the public domain, in which case just leave the
  Copyright section alone.  Alternatively, you can use the `Open
  Publication License`__, but public domain is still strongly


- Leave the Emacs stanza at the end of this file alone, including the
  formfeed character ("^L", or ``\f``).

- Send your PEP submission to the PEP editors at

ReStructuredText PEP Formatting Requirements

The following is a PEP-specific summary of reStructuredText syntax.
For the sake of simplicity and brevity, much detail is omitted.  For
more detail, see `Resources`_ below.  `Literal blocks`_ (in which no
markup processing is done) are used for examples throughout, to
illustrate the plaintext markup.


You must adhere to the Emacs convention of adding two spaces at the
end of every sentence.  You should fill your paragraphs to column 70,
but under no circumstances should your lines extend past column 79.
If your code samples spill over column 79, you should rewrite them.

Tab characters must never appear in the document at all.  A PEP should
include the standard Emacs stanza included by example at the bottom of
this PEP.

Section Headings

PEP headings must begin in column zero and the initial letter of each
word must be capitalized as in book titles.  Acronyms should be in all
capitals.  Section titles must be adorned with an underline, a single
repeated punctuation character, which begins in column zero and must
extend at least as far as the right edge of the title text (4
characters minimum).  First-level section titles are underlined with
"=" (equals signs), second-level section titles with "-" (hyphens),
and third-level section titles with "'" (single quotes or
apostrophes).  For example::

    First-Level Title

    Second-Level Title

    Third-Level Title

If there are more than three levels of sections in your PEP, you may
insert overline/underline-adorned titles for the first and second
levels as follows::

    First-Level Title (optional)

    Second-Level Title (optional)

    Third-Level Title

    Fourth-Level Title

    Fifth-Level Title

You shouldn't have more than five levels of sections in your PEP.  If
you do, you should consider rewriting it.

You must use two blank lines between the last line of a section's body
and the next section heading.  If a subsection heading immediately
follows a section heading, a single blank line in-between is

The body of each section is not normally indented, although some
constructs do use indentation, as described below.  Blank lines are
used to separate constructs.


Paragraphs are left-aligned text blocks separated by blank lines.
Paragraphs are not indented unless they are part of an indented
construct (such as a block quote or a list item).

Inline Markup

Portions of text within paragraphs and other text blocks may be
styled.  For example::

    Text may be marked as *emphasized* (single asterisk markup,
    typically shown in italics) or **strongly emphasized** (double
    asterisks, typically boldface).  ``Inline literals`` (using double
    backquotes) are typically rendered in a monospaced typeface.  No
    further markup recognition is done within the double backquotes,
    so they're safe for any kind of code snippets.

Block Quotes

Block quotes consist of indented body elements.  For example::

    This is a paragraph.

        This is a block quote.

        A block quote may contain many paragraphs.

Block quotes are used to quote extended passages from other sources.
Block quotes may be nested inside other body elements.  Use 4 spaces
per indent level.

Literal Blocks

    In the text below, double backquotes are used to denote inline
    literals.  "``::``" is written so that the colons will appear in a
    monospaced font; the backquotes (``) are markup, not part of the
    text.  See "Inline Markup" above.

    By the way, this is a comment, described in "Comments" below.

Literal blocks are used for code samples or preformatted ASCII art. To
indicate a literal block, preface the indented text block with
"``::``" (two colons).  The literal block continues until the end of
the indentation.  Indent the text block by 4 spaces.  For example::

    This is a typical paragraph.  A literal block follows.


        for a in [5,4,3,2,1]:   # this is program code, shown as-is
            print a
        print "it's..."
        # a literal block continues until the indentation ends

The paragraph containing only "``::``" will be completely removed from
the output; no empty paragraph will remain.  "``::``" is also
recognized at the end of any paragraph.  If immediately preceded by
whitespace, both colons will be removed from the output.  When text
immediately precedes the "``::``", *one* colon will be removed from
the output, leaving only one colon visible (i.e., "``::``" will be
replaced by "``:``").  For example, one colon will remain visible


        Literal block


Bullet list items begin with one of "-", "*", or "+" (hyphen,
asterisk, or plus sign), followed by whitespace and the list item
body.  List item bodies must be left-aligned and indented relative to
the bullet; the text immediately after the bullet determines the
indentation.  For example::

    This paragraph is followed by a list.

    * This is the first bullet list item.  The blank line above the
      first list item is required; blank lines between list items
      (such as below this paragraph) are optional.

    * This is the first paragraph in the second item in the list.

      This is the second paragraph in the second item in the list.
      The blank line above this paragraph is required.  The left edge
      of this paragraph lines up with the paragraph above, both
      indented relative to the bullet.

      - This is a sublist.  The bullet lines up with the left edge of
        the text blocks above.  A sublist is a new list so requires a
        blank line above and below.

    * This is the third item of the main list.

    This paragraph is not part of the list.

Enumerated (numbered) list items are similar, but use an enumerator
instead of a bullet.  Enumerators are numbers (1, 2, 3, ...), letters
(A, B, C, ...; uppercase or lowercase), or Roman numerals (i, ii, iii,
iv, ...; uppercase or lowercase), formatted with a period suffix
("1.", "2."), parentheses ("(1)", "(2)"), or a right-parenthesis
suffix ("1)", "2)").  For example::

    1. As with bullet list items, the left edge of paragraphs must

    2. Each list item may contain multiple paragraphs, sublists, etc.

       This is the second paragraph of the second list item.

       a) Enumerated lists may be nested.
       b) Blank lines may be omitted between list items.

Definition lists are written like this::

        Definition lists associate a term with a definition.

        The term is a one-line phrase, and the definition is one
        or more paragraphs or body elements, indented relative to
        the term.


Simple tables are easy and compact::

    =====  =====  =======
      A      B    A and B
    =====  =====  =======
    False  False  False
    True   False  False
    False  True   False
    True   True   True
    =====  =====  =======

There must be at least two columns in a table (to differentiate from
section titles).  Column spans use underlines of hyphens ("Inputs"
spans the first two columns)::

    =====  =====  ======
       Inputs     Output
    ------------  ------
      A      B    A or B
    =====  =====  ======
    False  False  False
    True   False  True
    False  True   True
    True   True   True
    =====  =====  ======

Text in a first-column cell starts a new row.  No text in the first
column indicates a continuation line; the rest of the cells may
consist of multiple lines.  For example::

    =====  =========================
    col 1  col 2
    =====  =========================
    1      Second column of row 1.
    2      Second column of row 2.
           Second line of paragraph.
    3      - Second column of row 3.

           - Second item in bullet
             list (row 3, column 2).
    =====  =========================


When referencing an external web page in the body of a PEP, you should
include the title of the page in the text, with either an inline
hyperlink reference to the URL or a footnote reference (see
`Footnotes`_ below).  Do not include the URL in the body text of the

Hyperlink references use backquotes and a trailing underscore to mark
up the reference text; backquotes are optional if the reference text
is a single word.  For example::

    In this paragraph, we refer to the `Python web site`_.

An explicit target provides the URL.  Put targets in a References
section at the end of the PEP, or immediately after the reference.
Hyperlink targets begin with two periods and a space (the "explicit
markup start"), followed by a leading underscore, the reference text,
a colon, and the URL (absolute or relative)::

    .. _Python web site:

The reference text and the target text must match (although the match
is case-insensitive and ignores differences in whitespace).  Note that
the underscore trails the reference text but precedes the target text.
If you think of the underscore as a right-pointing arrow, it points
*away* from the reference and *toward* the target.

The same mechanism can be used for internal references.  Every unique
section title implicitly defines an internal hyperlink target.  We can
make a link to the Abstract section like this::

    Here is a hyperlink reference to the `Abstract`_ section.  The
    backquotes are optional since the reference text is a single word;
    we can also just write: Abstract_.

Footnotes containing the URLs from external targets will be generated
automatically at the end of the References section of the PEP, along
with footnote references linking the reference text to the footnotes.

Text of the form "PEP x" or "RFC x" (where "x" is a number) will be
linked automatically to the appropriate URLs.


Footnote references consist of a left square bracket, a number, a
right square bracket, and a trailing underscore::

    This sentence ends with a footnote reference [1]_.

Whitespace must precede the footnote reference.  Leave a space between
the footnote reference and the preceding word.

When referring to another PEP, include the PEP number in the body
text, such as "PEP 1".  The title may optionally appear.  Add a
footnote reference following the title.  For example::

    Refer to PEP 1 [2]_ for more information.

Add a footnote that includes the PEP's title and author.  It may
optionally include the explicit URL on a separate line, but only in
the References section.  Footnotes begin with ".. " (the explicit
markup start), followed by the footnote marker (no underscores),
followed by the footnote body.  For example::


    .. [2] PEP 1, "PEP Purpose and Guidelines", Warsaw, Hylton

If you decide to provide an explicit URL for a PEP, please use this as
the URL template::

PEP numbers in URLs must be padded with zeros from the left, so as to
be exactly 4 characters wide, however PEP numbers in the text are
never padded.

During the course of developing your PEP, you may have to add, remove,
and rearrange footnote references, possibly resulting in mismatched
references, obsolete footnotes, and confusion.  Auto-numbered
footnotes allow more freedom.  Instead of a number, use a label of the
form "#word", where "word" is a mnemonic consisting of alphanumerics
plus internal hyphens, underscores, and periods (no whitespace or
other characters are allowed).  For example::

    Refer to PEP 1 [#PEP-1]_ for more information.


    .. [#PEP-1] PEP 1, "PEP Purpose and Guidelines", Warsaw, Hylton

Footnotes and footnote references will be numbered automatically, and
the numbers will always match.  Once a PEP is finalized, auto-numbered
labels should be replaced by numbers for simplicity.


If your PEP contains a diagram, you may include it in the processed
output using the "image" directive::

    .. image:: diagram.png

Any browser-friendly graphics format is possible: .png, .jpeg, .gif,
.tiff, etc.

Since this image will not be visible to readers of the PEP in source
text form, you should consider including a description or ASCII art
alternative, using a comment (below).


A comment block is an indented block of arbitrary text immediately
following an explicit markup start: two periods and whitespace.  Leave
the ".." on a line by itself to ensure that the comment is not
misinterpreted as another explicit markup construct.  Comments are not
visible in the processed document.  For the benefit of those reading
your PEP in source form, please consider including a descriptions of
or ASCII art alternatives to any images you include.  For example::

     .. image:: dataflow.png

        Data flows from the input module, through the "black box"
        module, and finally into (and through) the output module.

The Emacs stanza at the bottom of this document is inside a comment.

Escaping Mechanism

reStructuredText uses backslashes ("``\``") to override the special
meaning given to markup characters and get the literal characters
themselves.  To get a literal backslash, use an escaped backslash
("``\\``").  There are two contexts in which backslashes have no
special meaning: `literal blocks`_ and inline literals (see `Inline
Markup`_ above).  In these contexts, no markup recognition is done,
and a single backslash represents a literal backslash, without having
to double up.

If you find that you need to use a backslash in your text, consider
using inline literals or a literal block instead.

Habits to Avoid

Many programmers who are familiar with TeX often write quotation marks
like this::

    `single-quoted' or ``double-quoted''

Backquotes are significant in reStructuredText, so this practice
should be avoided.  For ordinary text, use ordinary 'single-quotes' or
"double-quotes".  For inline literal text (see `Inline Markup`_
above), use double-backquotes::

    ``literal text: in here, anything goes!``


Many other constructs and variations are possible.  For more details
about the reStructuredText markup, in increasing order of
thoroughness, please see:

* `A ReStructuredText Primer`__, a gentle introduction.


* `Quick reStructuredText`__, a users' quick reference.


* `reStructuredText Markup Specification`__, the final authority.


The processing of reStructuredText PEPs is done using Docutils_.  If
you have a question or require assistance with reStructuredText or
Docutils, please `post a message`_ to the `Docutils-Users mailing
list`_.  The `Docutils project web site`_ has more information.

.. _Docutils:
.. _post a message:
.. _Docutils-Users mailing list:
.. _Docutils project web site:


.. [1] PEP 1, PEP Purpose and Guidelines, Warsaw, Hylton

.. [2] PEP 9, Sample Plaintext PEP Template, Warsaw


This document has been placed in the public domain.

   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70

From  Fri Aug 30 05:58:34 2002
From: (David Goodger)
Date: Fri, 30 Aug 2002 00:58:34 -0400
Subject: [Python-Dev] Re: ANN: New PEP Format: reStructuredText
In-Reply-To: <>
Message-ID: <>

Almost forgot.  The Docutils package must be installed in order to use on the new-format PEPs, for processing.  There are
instructions for getting and installing the Docutils package for this
purpose here:

This is the processed form of python/nondist/peps/README.txt in CVS.

David Goodger  <>  Open-source projects:
  - Python Docutils:
    (includes reStructuredText:
  - The Go Tools Project:

From  Fri Aug 30 06:54:14 2002
From: (Oren Tirosh)
Date: Fri, 30 Aug 2002 01:54:14 -0400
Subject: [Python-Dev] Mersenne Twister
In-Reply-To: <>
References: <002001c24f8e$48d27e60$83ec7ad1@othello> <>
Message-ID: <>

On Thu, Aug 29, 2002 at 10:43:28PM -0400, Tim Peters wrote:
> [Raymond Hettinger]
> > ...
> > The C code I saw is covered by a BSD license -- I don't
> > know if that's an issue or not.
> That's fine, provided it doesn't have the dreaded "advertising clause".  I
> personally don't care whether it does -- it's the FSF that has bug up their
> butt about that one.  I expect we'd have to reproduce their copyright notice

The Mersenne Twister distribution is now under the Artistic License.


From  Fri Aug 30 07:35:55 2002
From: (Delaney, Timothy)
Date: Fri, 30 Aug 2002 16:35:55 +1000
Subject: [Python-Dev] SF Bug #602245: os.popen() negative error code IOError
Message-ID: <>

I just submitted a bug at SF entitled 'os.popen() negative error code
IOError'. However, not knowing SF too well, I've messed up the formatting of
the test code, so here it is.

When a negative return code is received by the os.popen() family, an IOError
is raised when the last pipe from the 
process is closed. 

The following code demonstrates the problem: 

import sys 
import os 
import traceback 

import sys
import os
import traceback

if __name__ == '__main__':

    if len(sys.argv) == 1:
            r = os.popen('%s %s %s' % (sys.executable, sys.argv[0], -1,))
        except IOError:

            w, r = os.popen2('%s %s %s' % (sys.executable, sys.argv[0],
        except IOError:

            w, r, e = os.popen3('%s %s %s' % (sys.executable, sys.argv[0],
        except IOError:


---------- Run ----------
Traceback (most recent call last):
  File "Q:\Viper\src\webvis\tests\", line 11, in ?
IOError: (0, 'Error')
Traceback (most recent call last):
  File "Q:\Viper\src\webvis\tests\", line 18, in ?
IOError: (0, 'Error')
Traceback (most recent call last):
  File "Q:\Viper\src\webvis\tests\", line 26, in ?
IOError: (0, 'Error')

Tim Delaney

From  Fri Aug 30 09:18:26 2002
From: (Jack Jansen)
Date: Fri, 30 Aug 2002 10:18:26 +0200
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <15726.52313.734491.272985@gargle.gargle.HOWL>
Message-ID: <>

On Friday, August 30, 2002, at 03:37 , Skip Montanaro wrote:

>     Greg> Could you have just one ticker, instead of one per thread?
> That would make ticker really count down checkinterval ticks.  Also, of
> possible interest is this declaration and comment in longobject.c:
>     static int ticker;  /* XXX Could be shared with ceval? */
> Any time C code would want to read or update ticker, it would have the 
> GIL,
> right?

Not if the idea that lead to this thread (clearing ticker if something 
is put in things_to_do)
is implemented, because we may be in an interrupt routine at the time we 
fiddle things_to_do.

And I don't think we can be sure that even clearing is guaranteed to 
work (if another
thread is halfway a load-decrement-store sequence the clear could be 
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- Emma 
Goldman -

From  Fri Aug 30 13:33:29 2002
From: (Guido van Rossum)
Date: Fri, 30 Aug 2002 08:33:29 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: Your message of "Fri, 30 Aug 2002 10:18:26 +0200."
References: <>
Message-ID: <>

> And I don't think we can be sure that even clearing is guaranteed to
> work (if another thread is halfway a load-decrement-store sequence
> the clear could be lost).

I think that pretty much kills the idea.  has anybody checked whether
it causes a measurable speedup?  If not, I propose not to bother.

--Guido van Rossum (home page:

From  Fri Aug 30 13:38:52 2002
From: (Skip Montanaro)
Date: Fri, 30 Aug 2002 07:38:52 -0500
Subject: [Python-Dev] Re: SF Bug #602245: os.popen() negative error code IOError
In-Reply-To: <>
References: <>
Message-ID: <15727.26460.515083.862490@gargle.gargle.HOWL>

    Tim> I just submitted a bug at SF entitled 'os.popen() negative error
    Tim> code IOError'. However, not knowing SF too well, I've messed up the
    Tim> formatting of the test code, so here it is.

It's actually formatted okay, if you save or view source you can cut out
your test pretty easily.  Still, anytime you want to include more than two
or three lines of code, you're going to be much better off attaching the
code (which I just did).

Skip Montanaro

From  Fri Aug 30 14:59:04 2002
From: (Skip Montanaro)
Date: Fri, 30 Aug 2002 08:59:04 -0500
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
References: <15726.52313.734491.272985@gargle.gargle.HOWL>
Message-ID: <15727.31272.80804.453415@gargle.gargle.HOWL>

    Skip> Any time C code would want to read or update ticker, it would have
    Skip> the GIL, right?

    Jack> Not if the idea that lead to this thread (clearing ticker if
    Jack> something is put in things_to_do) is implemented, because we may
    Jack> be in an interrupt routine at the time we fiddle things_to_do.

    Jack> And I don't think we can be sure that even clearing is guaranteed
    Jack> to work (if another thread is halfway a load-decrement-store
    Jack> sequence the clear could be lost).

Hmm...  I guess you lost me.  The code that fiddles the ticker in ceval.c
clearly operates while the GIL is held.  I think the code in sysmodule.c
that updates the checkinterval works under that assumption as well.  The
other ticker in longobject.c I'm not so sure about.

The patch I submitted doesn't implement the ticker clear that Jeremy
originally suggested.  It just pulls the ticker and the checkinterval out of
the thread state and makes them two globals.  They are both manipulated in
otherwise the same way.


From  Fri Aug 30 15:13:52 2002
From: (Guido van Rossum)
Date: Fri, 30 Aug 2002 10:13:52 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: Your message of "Fri, 30 Aug 2002 08:59:04 CDT."
References: <15726.52313.734491.272985@gargle.gargle.HOWL> <>
Message-ID: <>

>     Skip> Any time C code would want to read or update ticker, it would have
>     Skip> the GIL, right?
>     Jack> Not if the idea that lead to this thread (clearing ticker if
>     Jack> something is put in things_to_do) is implemented, because we may
>     Jack> be in an interrupt routine at the time we fiddle things_to_do.
>     Jack> And I don't think we can be sure that even clearing is guaranteed
>     Jack> to work (if another thread is halfway a load-decrement-store
>     Jack> sequence the clear could be lost).
> Hmm...  I guess you lost me.  The code that fiddles the ticker in ceval.c
> clearly operates while the GIL is held.  I think the code in sysmodule.c
> that updates the checkinterval works under that assumption as well.  The
> other ticker in longobject.c I'm not so sure about.
> The patch I submitted doesn't implement the ticker clear that Jeremy
> originally suggested.  It just pulls the ticker and the checkinterval out of
> the thread state and makes them two globals.  They are both manipulated in
> otherwise the same way.
> Skip

Yeah, but the whole *point* would be to save an extra test and
(rarely-taken jump) by allowing Jeremy's suggestion to be implemented.
Otherwise I don't see much advantage to the patch (or do you see a

--Guido van Rossum (home page:

From  Fri Aug 30 15:29:06 2002
From: (Skip Montanaro)
Date: Fri, 30 Aug 2002 09:29:06 -0500
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
References: <15726.52313.734491.272985@gargle.gargle.HOWL>
Message-ID: <15727.33074.324120.988215@gargle.gargle.HOWL>

    Guido> Yeah, but the whole *point* would be to save an extra test and
    Guido> (rarely-taken jump) by allowing Jeremy's suggestion to be
    Guido> implemented.  Otherwise I don't see much advantage to the patch
    Guido> (or do you see a speed-up?).

Haven't tested it for speed.  You do save 8 bytes per thread state instance


From  Fri Aug 30 15:29:52 2002
From: (Guido van Rossum)
Date: Fri, 30 Aug 2002 10:29:52 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: Your message of "Fri, 30 Aug 2002 09:29:06 CDT."
References: <15726.52313.734491.272985@gargle.gargle.HOWL> <> <15727.31272.80804.453415@gargle.gargle.HOWL> <>
Message-ID: <>

> Haven't tested it for speed.  You do save 8 bytes per thread state
> instance though...

I don't care about the thread state size.  If it could speed Python up
by 5% I'd gladly add a kilobyte to each thread state...

--Guido van Rossum (home page:

From  Fri Aug 30 15:35:23 2002
From: (Jeremy Hylton)
Date: Fri, 30 Aug 2002 10:35:23 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
References: <15726.52313.734491.272985@gargle.gargle.HOWL>
Message-ID: <>

The difference I saw with only the ticker check in ceval was only
about 1% for pystone.  Python was always faster with the change, but
never by much.


From  Fri Aug 30 15:47:36 2002
From: (Tim Peters)
Date: Fri, 30 Aug 2002 10:47:36 -0400
Subject: [Python-Dev] Mersenne Twister
In-Reply-To: <>
Message-ID: <>

[Oren Tirosh[]
> The Mersenne Twister distribution is now under the Artistic License.

That's out of date.  When they updated the code earlier this year to fix the
initialization weakness, they also adpoted a new license:

    MT19337 with initialization improved 2002/1/26
    The initialization scheme of older versions of MT has a (small)
    problem, that MSBs are not well reflected to the state vector. Here
    is the latest version of initialization scheme, which we consider
    the newest standard. An initialization routine using an array of
    seeds is also available.
    We adopted BSD-license which we think most flexible, so this code
    is freely usable.

That's much less problematic than the Artistic License (which Stallman holds
in contempt; his objection to BSD+advertising_clause is so technical it's
hard to give a damn).

From  Fri Aug 30 16:06:06 2002
From: (Michael Hudson)
Date: 30 Aug 2002 16:06:06 +0100
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: Jeremy Hylton's message of "Fri, 30 Aug 2002 10:35:23 -0400"
References: <15726.52313.734491.272985@gargle.gargle.HOWL> <> <15727.31272.80804.453415@gargle.gargle.HOWL> <> <15727.33074.324120.988215@gargle.gargle.HOWL> <> <>
Message-ID: <>

Jeremy Hylton <> writes:

> The difference I saw with only the ticker check in ceval was only
> about 1% for pystone.  Python was always faster with the change, but
> never by much.

A bunch of 0.5% improvements add up.  If there's not much cost in
complexity, why not go for it?


6. Symmetry is a complexity-reducing concept (co-routines include
   subroutines); seek it everywhere.
  -- Alan Perlis,

From  Fri Aug 30 16:20:17 2002
From: (Tim Peters)
Date: Fri, 30 Aug 2002 11:20:17 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
Message-ID: <>

[Jack Jansen]
> Not if the idea that lead to this thread (clearing ticker if something
> is put in things_to_do) is implemented, because we may be in an
> interrupt routine at the time we fiddle things_to_do.
> And I don't think we can be sure that even clearing is guaranteed to
> work (if another thread is halfway a load-decrement-store sequence the
> clear could be lost).

So long as the ticker is declared volatile, the odds of setting ticker to 0
in Py_AddPendingCall during a "bad time" for --ticker are small, a window of
a couple machine instructions.  Ticker will eventually go to 0 regardless.
It's not like things_to_do isn't ignored for "long" stretches of time now
either:  Py_MakePendingCalls returns immediately unless the thread with the
GIL just happens to be the main thread.  Even if it is the main thread,
there's another race there with some non-main thread happening to call
Py_AddPendingCall at the same time.

Opening another hole of a couple machine instructions shouldn't make much
difference, although Py_MakePendingCalls should also be changed then to
reset ticker to 0 in its "early exit because the coincidences I'm relying on
haven't happened yet" cases.

From  Fri Aug 30 16:22:01 2002
From: (Guido van Rossum)
Date: Fri, 30 Aug 2002 11:22:01 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: Your message of "Fri, 30 Aug 2002 11:20:17 EDT."
References: <>
Message-ID: <>

> So long as the ticker is declared volatile, the odds of setting ticker to 0
> in Py_AddPendingCall during a "bad time" for --ticker are small, a window of
> a couple machine instructions.  Ticker will eventually go to 0 regardless.
> It's not like things_to_do isn't ignored for "long" stretches of time now
> either:  Py_MakePendingCalls returns immediately unless the thread with the
> GIL just happens to be the main thread.  Even if it is the main thread,
> there's another race there with some non-main thread happening to call
> Py_AddPendingCall at the same time.

Good point.  (Though some apps set the check interval to 1000; well,
that would still be fast enough.)

> Opening another hole of a couple machine instructions shouldn't make much
> difference, although Py_MakePendingCalls should also be changed then to
> reset ticker to 0 in its "early exit because the coincidences I'm relying on
> haven't happened yet" cases.

OK, let's try it then.

--Guido van Rossum (home page:

From  Fri Aug 30 16:24:20 2002
From: (Tim Peters)
Date: Fri, 30 Aug 2002 11:24:20 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
Message-ID: <>

[Jeremy Hylton]
> The difference I saw with only the ticker check in ceval was only
> about 1% for pystone.  Python was always faster with the change, but
> never by much.

[Michael Hudson]
> A bunch of 0.5% improvements add up.  If there's not much cost in
> complexity, why not go for it?

There isn't, and we should <wink>.  I'd do it even if it slowed things by
1%:  reducing the test+branch count on the critical path will *eventually*
pay off.  The SET_LINENO removal worked in the other direction, and that
proved a timing mini-disaster under MSVC6.  Doing even random things in the
right direction may very well nudge MSVC6 back into the local minimum it got
knocked out of.

From  Fri Aug 30 16:35:32 2002
From: (Steve Holden)
Date: Fri, 30 Aug 2002 11:35:32 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
References: <15726.52313.734491.272985@gargle.gargle.HOWL> <> <15727.31272.80804.453415@gargle.gargle.HOWL> <> <15727.33074.324120.988215@gargle.gargle.HOWL> <> <> <>
Message-ID: <055301c2503a$e1cfea60$>

> Jeremy Hylton <> writes:
> > The difference I saw with only the ticker check in ceval was only
> > about 1% for pystone.  Python was always faster with the change, but
> > never by much.
> A bunch of 0.5% improvements add up.  If there's not much cost in
> complexity, why not go for it?

Yeah, right, we just need 200 of them and we're laughing. Computation in
infinitesimal time.

suddenly-dwim-mode-is-possib-ly y'rs  - steve
Steve Holden                        
Python Web Programming              
Previous .sig file retired to          

From  Fri Aug 30 16:41:49 2002
From: (Skip Montanaro)
Date: Fri, 30 Aug 2002 10:41:49 -0500
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
References: <>
Message-ID: <>

    >> Opening another hole of a couple machine instructions shouldn't make
    >> much difference, although Py_MakePendingCalls should also be changed
    >> then to reset ticker to 0 in its "early exit because the coincidences
    >> I'm relying on haven't happened yet" cases.

    Guido> OK, let's try it then.

You mean I just wasted my time running pystones? ;-)

Just the same, here's the output, after and before.  For each setting, I ran
pystones twice manually, then the three reported times:

    with patch:

    Pystone(1.1) time for 50000 passes = 7.52
    This machine benchmarks at 6648.94 pystones/second
    Pystone(1.1) time for 50000 passes = 7.51
    This machine benchmarks at 6657.79 pystones/second
    Pystone(1.1) time for 50000 passes = 7.5
    This machine benchmarks at 6666.67 pystones/second

    without patch:

    Pystone(1.1) time for 50000 passes = 7.69
    This machine benchmarks at 6501.95 pystones/second
    Pystone(1.1) time for 50000 passes = 7.68
    This machine benchmarks at 6510.42 pystones/second
    Pystone(1.1) time for 50000 passes = 7.67
    This machine benchmarks at 6518.9 pystones/second

I was quite surprised at the difference.  Someone definitely should check
this.  The patch is at

My guess is that the code is avoiding a lot of pointer dereferences.  Oh,
wait a minute.  I muffed a bit.  I initialized the ticker and checkinterval
variables to 100.  Should have been 10.

... a short time passes while Skip thanks God he's not rebuilding VTK ...

With _Py_CheckInterval set to 10 it's still not too shabby:

    Pystone(1.1) time for 50000 passes = 7.57
    This machine benchmarks at 6605.02 pystones/second
    Pystone(1.1) time for 50000 passes = 7.56
    This machine benchmarks at 6613.76 pystones/second
    Pystone(1.1) time for 50000 passes = 7.55
    This machine benchmarks at 6622.52 pystones/second

This is still without Jeremy's suggested change.

apples-and-oranges-ly, y'rs,


From  Fri Aug 30 17:13:14 2002
From: (Michael Hudson)
Date: 30 Aug 2002 17:13:14 +0100
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: "Steve Holden"'s message of "Fri, 30 Aug 2002 11:35:32 -0400"
References: <15726.52313.734491.272985@gargle.gargle.HOWL> <> <15727.31272.80804.453415@gargle.gargle.HOWL> <> <15727.33074.324120.988215@gargle.gargle.HOWL> <> <> <> <055301c2503a$e1cfea60$>
Message-ID: <>

"Steve Holden" <> writes:

> > A bunch of 0.5% improvements add up.  If there's not much cost in
> > complexity, why not go for it?
> >
> Yeah, right, we just need 200 of them and we're laughing. Computation in
> infinitesimal time.

Multiply up doesn't have the same ring to it, does it?


  I don't have any special knowledge of all this. In fact, I made all
  the above up, in the hope that it corresponds to reality.
                                            -- Mark Carroll,

From  Fri Aug 30 17:12:24 2002
From: (Tim Peters)
Date: Fri, 30 Aug 2002 12:12:24 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
Message-ID: <>

[Skip Montanaro]
> ...
> My guess is that the code is avoiding a lot of pointer dereferences.
> Oh, wait a minute.  I muffed a bit.  I initialized the ticker and
> checkinterval variables to 100.  Should have been 10.

Someone <wink> may wish to question the historical 10 too.  A few weeks ago
on, a number of programs were posted showing that, on Linux, the
thread scheduling is such the the *offer* to switch threads every 10
bytecodes was usually declined:  the thread that got the GIL was
overwhelmingly most often the thread that released it, so that the whole
dance was overwhelmingly most often pure overhead.  This may be different
under 2.3, where the pthreads GIL is implemented via a semaphore rather than
a condvar.  But in that case, actually switching threads every 10 bytecodes
is an awful lot of thread switching (10 bytecodes don't take as long as they
used to <wink>).

I don't know how to pick a good "one size fits all" value, but suspect 10 is
"clearly too small".  In app after app, people who discover
sys.setcheckinterval() discover soon after that performance improves if they
increase it.

From  Fri Aug 30 17:16:07 2002
From: (Guido van Rossum)
Date: Fri, 30 Aug 2002 12:16:07 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: Your message of "Fri, 30 Aug 2002 12:12:24 EDT."
References: <>
Message-ID: <>

> Someone <wink> may wish to question the historical 10 too.  A few weeks ago
> on, a number of programs were posted showing that, on Linux, the
> thread scheduling is such the the *offer* to switch threads every 10
> bytecodes was usually declined:  the thread that got the GIL was
> overwhelmingly most often the thread that released it, so that the whole
> dance was overwhelmingly most often pure overhead.  This may be different
> under 2.3, where the pthreads GIL is implemented via a semaphore rather than
> a condvar.  But in that case, actually switching threads every 10 bytecodes
> is an awful lot of thread switching (10 bytecodes don't take as long as they
> used to <wink>).
> I don't know how to pick a good "one size fits all" value, but suspect 10 is
> "clearly too small".  In app after app, people who discover
> sys.setcheckinterval() discover soon after that performance improves if they
> increase it.

Let's try 100 and see how that works.

--Guido van Rossum (home page:

From  Fri Aug 30 16:50:52 2002
From: (Jeremy Hylton)
Date: Fri, 30 Aug 2002 11:50:52 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <055301c2503a$e1cfea60$>
References: <15726.52313.734491.272985@gargle.gargle.HOWL>
Message-ID: <>

>>>>> "SH" == Steve Holden <> writes:

  >> Jeremy Hylton <> writes:
  >> > The difference I saw with only the ticker check in ceval was
  >> > only about 1% for pystone.  Python was always faster with the
  >> > change, but never by much.
  >> A bunch of 0.5% improvements add up.  If there's not much cost in
  >> complexity, why not go for it?

  SH> Yeah, right, we just need 200 of them and we're
  SH> laughing. Computation in infinitesimal time.

I think Xeno's laughing already.


From  Fri Aug 30 17:30:55 2002
From: (Tim Peters)
Date: Fri, 30 Aug 2002 12:30:55 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <>

> What's an acceptable false positive rate?

[Greg Ward]
> Speaking as one of the people who reviews suspected spam for
> and rescues false positives, I would say that the more relevant figure
> is: how much suspected spam do I have to review every morning?  < 10
> messages would be peachy; right now it's around 5-20 messages per day.

I must be missing something.  I would *hope* that you review *all* messages
claimed to be spam, in which case the number of msgs to be reviewed would,
in a perfectly accurate system, be equal to the number of spams received.

OTOH, the false positive rate doesn't have anything to do with the number of
spams received, it has to do with the number of non-spams received.

> Currently there are probably 1-3 FPs per day, although on a bad day
> there can be 5-10.  (Eg. on 2002-08-21, six mailman-users posts from the
> same guy were all caught, mainly because his ISP added X-AntiAbuse, and
> his messages were multipart/alternative with unwrapped plain text.  This
> is a perfect example of SpamAssassin screwing up royally.)  1-3 FPs/day
> I can live with, but the real burden is the manual review: I'd much
> rather have 5 FPs in a pool of 10 suspects than 1 FP out of 100
> suspects.

Maybe you don't want this kind of approach at all.  The classifier doesn't
have "gray areas" in practice:  it tends to give probabilites near 1, or
near 0, and there's very little in between -- a msg either has a
preponderance of spam indicators, or a preponderance of non-spam indicators.
You're simply not going to get a batch of "hmm, I'm not really sure about
these" out of it.  You would in a conventional Bayesian classifer, but
Graham's ignores almost all of the words, judging on only the most extreme
words present; when only extremes are fed in, the final result also tends to
be extreme (the only cases where that doesn't obtain are those where the
most extreme words it finds aren't extreme at all; e.g., a msg consisting
entirely of "the", "and" and "it" would get rated as 0.5).

>> What do we get from SpamAssassin?

> Recall the stats I posted this morning; the bulk of spam is in Chinese
> or Korean, and I have things setup so SpamAssassin never even sees it.
> I think the only way to meaningfully answer this question is to stash
> *everything* receives for a day or 10, spam and
> otherwise, and run it all through SA.

It would be good to have such a corpus regardless.

From  Fri Aug 30 17:45:44 2002
From: (Tim Peters)
Date: Fri, 30 Aug 2002 12:45:44 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <>

[Skip Montanaro]
> ...
> One thing I think would be worthwhile would be to run GBayes first, then
> only run stuff it thought was spam through SpamAssassin.  Only
> messages that both systems categorized as spam would drop into the spam
> folder.  This has a couple benefits over running one or the other in
> isolation:
>     * The training set for GBayes probably doesn't need to be as big

Training GBayes is cheap, and the more you feed it the less need to do
information-destroying transformations (like folding case or ignoring

>     * The two systems use substantially different approaches to
>       identifying spam,

Which could indeed be a killer-strong benefit.

>       so I suspect your false positive rate would go way down.

I'm already having a real problem with this just looking at content:  the
false positive rate is already so low that I can't make statistically
significant conclusions about things that may improve it (e.g., if I do
something that removes just *one* false positive in a test run on 4000 hams,
the false-positive rate falls by 12.5% -- I don't have enough false
positives to make fine-grained judgments.  And, indeed, every time I test a
change to the algorithm, the most *significant* thing I find is that it
turns up another class of blatant spam hiding in the ham corpus:  my
training data is still too dirty, and cleaning it up is labor-intensive).

>       False negatives would go up, but only testing can suggest by how
>       much.
>     * Since SA is dog slow most of the time, SA users get a big speedup,
>       since a substantially smaller fraction of your messages get run
>       through it.
> This sort of chaining is pretty trivial to setup with procmail.
> Dunno what the Windows set will do though.

There are different audiences here.  Greg is keen to have a better approach
for as a whole, while Barry is keen about that and about doing
something more generic for Mailman.  Windows isn't an issue for either of
those.  Everyone else can eat cake <wink>.

From  Fri Aug 30 18:10:26 2002
From: (John Williams)
Date: Fri, 30 Aug 2002 10:10:26 -0700 (PDT)
Subject: [Python-Dev] alternate reiter proposal
Message-ID: <>

Hi, this is my first post, so go easy on me! :-)

I got this idea from the "cogen" discussion, seeing how the lack of a
reliable re-iterability test makes it hard to write lazy multi-pass
algorithms.  Rather than (1) assuming iterators are re-iterable, (2)
"forcing" iterators to be re-iterable by eagerly converting them to
lists, or (3) trying to heuristically guess whether an iterator is
re-iterable, why not combine the best of (1) and (2) while avoiding (3)

This class will lazily convert an interator to a list on the first pass
and then iterate over the saved list on all subsequent passes.

class reiter(object):
    def __init__(self, iterable):
        self.iterator = iter(iterable)
        self.cache = []
    def __iter__(self):
        if self.iterator is None:
            return iter(self.cache)
            return self
    def next(self):
            element =
            return element
        except StopIteration:
            self.iterator = None

Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes

From  Fri Aug 30 18:29:50 2002
From: (Andrew Koenig)
Date: 30 Aug 2002 13:29:50 -0400
Subject: [Python-Dev] alternate reiter proposal
In-Reply-To: <>
References: <>
Message-ID: <>

John> This class will lazily convert an interator to a list on the first pass
John> and then iterate over the saved list on all subsequent passes.

John> class reiter(object):
John>     def __init__(self, iterable):
John>         self.iterator = iter(iterable)
John>         self.cache = []
John>     def __iter__(self):
John>         if self.iterator is None:
John>             return iter(self.cache)
John>         else:
John>             return self
John>     def next(self):
John>         try:
John>             element =
John>             self.cache.append(element)
John>             return element
John>         except StopIteration:
John>             self.iterator = None
John>             raise

Maybe I'm missing something here, but doesn't this fail if you
try to restart it before it has entirely consumed the input?

Andrew Koenig,,

From  Fri Aug 30 18:41:24 2002
From: (Tim Peters)
Date: Fri, 30 Aug 2002 13:41:24 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
Message-ID: <>

>> I don't know how to pick a good "one size fits all" value, but 
>> suspect 10 is "clearly too small".  In app after app, people who
>> discover sys.setcheckinterval() discover soon after that performance
>> improves if they increase it.

> Let's try 100 and see how that works.

+1 here.  Skip, you want to fold that into your patch?

From  Fri Aug 30 19:26:46 2002
From: (Skip Montanaro)
Date: Fri, 30 Aug 2002 13:26:46 -0500
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
References: <>
Message-ID: <>

    Guido> Let's try 100 and see how that works.

    Tim> +1 here.  Skip, you want to fold that into your patch?



From  Fri Aug 30 20:07:26 2002
From: (John Williams)
Date: Fri, 30 Aug 2002 12:07:26 -0700 (PDT)
Subject: [Python-Dev] alternate reiter proposal
In-Reply-To: <>
Message-ID: <>

--- "Andrew Koenig  -" wrote:

> Maybe I'm missing something here, but doesn't this fail if you
> try to restart it before it has entirely consumed the input?

I guess that depends on what your expectations are--calling it "reiter"
was probably not a good idea in that respect. I was really just trying
to solve the problem of implementing the cartesian product function
lazily and got a little ahead of myself proposing a solution to a
special case as a general-purpose solution.

I'd like to develop the basic idea into something with wider
applicability, since I'm fond of anything that makes it easier to
implement lazy algorthms. Somebody speak up if you think it's
worthwhile, otherwise I think I'll just let it drop since I'm fairly
certain *I* won't be needing it anytime soon.

Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes

From  Fri Aug 30 20:41:51 2002
From: (Tim Peters)
Date: Fri, 30 Aug 2002 15:41:51 -0400
Subject: [Python-Dev] Mining URLs for spam detection
In-Reply-To: <>
Message-ID: <>

I've gotten interesting results from this gimmick:

import re
url_re = re.compile(r"http://([^\s>'\"\x7f-\xff]+)", re.IGNORECASE)
urlfield_re = re.compile(r"[;?:@&=+,$.]")

def tokenize_url(string):
    for url in url_re.findall(string):
        for i, piece in enumerate(url.lower().split('/')):
            prefix = "url%d:" % i
            for chunk in urlfield_re.split(piece):
                yield prefix + chunk
    ... (and then do other tokenization) ...

So it splits a case-normalized http thingie via /, tags the first piece
"url0:", the second "url1:", and so on.  Within each piece, it splits on
separators, like '=' and '.'.

Two particular tokens generated this way then made it into the list of 15
words that most often survived to the end of the scoring step:

    url0:python   as a strong non-spam indicator
    url1:remove   as a strong spam indicator

The rest of the tokenization was unchanged, still doing MIME-ignorant
splitting on whitespace.  Just the http gimmick was added, and that alone
cut the false negative rate in half.  IOW, there's a *lot* of valuable info
in the http thingies!  Not being a Web Guy, I'm not sure how to extract the
most info from it.  If you've got suggestions for a better URL tagging
strategy, I'd love to hear them.

Cute:  If I tokenize *only* the http thingies, ignoring all other parts of
the text, the false positive rate is about 1%.  This is because most legit
msgs don't have any http thingies, so they get classified correctly as ham
(no tokens at all are generated for them).   This caught at least one spam
in the ham corpus (a bogus "false positive"):

prob = 0.999997392672
prob('url0:240') = 0.2
prob('url1:') = 0.612567
prob('url0:250') = 0.99
prob('url0:225') = 0.99
prob('url0:207') = 0.99
Sweet XXX!

An example of a real false positive was due to /F including this URL:

Oddly enough,

    prob('url0:132') = 0.99
    prob('url0:telia') = 0.99

so there was significant spam with "132" and "telia" in the first field of
an http thingie.

The false negative rate when tokenizing only http thingies zoomed to over
30%.  Curiously, the best way for a spam to evade this check is *not* to
disguise itself with numeric IPs.  Numbers end up looking suspicious.  But,
e.g., this looks netural:

    prob('url0:com') = 0.658328

and it never saw "shocking-incest" before.

From  Fri Aug 30 21:21:27 2002
From: (Jack Jansen)
Date: Fri, 30 Aug 2002 22:21:27 +0200
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
Message-ID: <>

On vrijdag, augustus 30, 2002, at 06:12 , Tim Peters wrote:

> [Skip Montanaro]
>> ...
>> My guess is that the code is avoiding a lot of pointer dereferences.
>> Oh, wait a minute.  I muffed a bit.  I initialized the ticker and
>> checkinterval variables to 100.  Should have been 10.
> Someone <wink> may wish to question the historical 10 too.  A 
> few weeks ago
> on, a number of programs were posted showing that, on Linux, the
> thread scheduling is such the the *offer* to switch threads every 10
> bytecodes was usually declined:  the thread that got the GIL was
> overwhelmingly most often the thread that released it, so that 
> the whole
> dance was overwhelmingly most often pure overhead.

And it costs!

Running pystone without another thread active I get 5500 
pystones out of my machine. Running it with another thread 
active (in a sleep(1000)) I get 4200.
After setcheckinterval(100) I'm back up to 5200.

For completeness' sake: with no other thread active raising 
setcheckinterval() doesn't make a difference (it's in the noise, 
in my measurement it was actually 0.5% slower).

We could get a serious speedup for multithreaded programs if we 
could raise the check interval. Some wild ideas:
- Use an exponential (or linear?) backoff. If you attempt to 
switch and nothing happens you double the check interval, up to 
a maximum. If you do switch you reset to the minimum.
- Combine the above with resetting (to zero? to minimum value if 
currently >= minimum?) the check interval on anything we know 
could influence thread schedule (releasing a lock, etc).
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Fri Aug 30 21:35:09 2002
From: (Tim Peters)
Date: Fri, 30 Aug 2002 16:35:09 -0400
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
Message-ID: <>

[Jack Jansen]
> On vrijdag, augustus 30, 2002, at 06:12 , Tim Peters wrote:

Jack, I'm never on vrijdag -- vrijdag is illegal in the US <wink>.

> ...
> And it costs!
> Running pystone without another thread active I get 5500
> pystones out of my machine. Running it with another thread
> active (in a sleep(1000)) I get 4200.
> After setcheckinterval(100) I'm back up to 5200.
> For completeness' sake: with no other thread active raising
> setcheckinterval() doesn't make a difference (it's in the noise,
> in my measurement it was actually 0.5% slower).
> We could get a serious speedup for multithreaded programs if we
> could raise the check interval.

Guido already agreed to try boosting it to 100.

> Some wild ideas:
> - Use an exponential (or linear?) backoff. If you attempt to
> switch and nothing happens you double the check interval, up to
> a maximum. If you do switch you reset to the minimum.

On a pthreads system under 2.3, using semaphores, chances are good it will
always switch.  But unless you're trying to fake soft realtime, it's a real
drag on performance to switch so often  We can't out-guess this, because it
depends on what the *app* wants.  Most apps aren't trying to fake soft
realtime, so favoring less frequent switches is a good default.

> - Combine the above with resetting (to zero? to minimum value if
> currently >= minimum?) the check interval on anything we know
> could influence thread schedule (releasing a lock, etc).

You need a model for what it is you're trying to optimize here.  I'm just
trying to cut useless overheads <wink>.

From  Sat Aug 31 02:39:09 2002
From: (Jack Jansen)
Date: Sat, 31 Aug 2002 03:39:09 +0200
Subject: [Python-Dev] tiny optimization in ceval mainloop
In-Reply-To: <>
Message-ID: <>

On vrijdag, augustus 30, 2002, at 10:35 , Tim Peters wrote:

> [Jack Jansen]
>> On vrijdag, augustus 30, 2002, at 06:12 , Tim Peters wrote:
> Jack, I'm never on vrijdag -- vrijdag is illegal in the US <wink>.

Oh? Didn't realize that, thought they hadn't gotten any further 
then outlawing
rookdag and drinkdag yet.

>> Some wild ideas:
>> - Use an exponential (or linear?) backoff. If you attempt to
>> switch and nothing happens you double the check interval, up to
>> a maximum. If you do switch you reset to the minimum.
> On a pthreads system under 2.3, using semaphores, chances are 
> good it will
> always switch.  But unless you're trying to fake soft realtime, 
> it's a real
> drag on performance to switch so often  We can't out-guess 
> this, because it
> depends on what the *app* wants.  Most apps aren't trying to fake soft
> realtime, so favoring less frequent switches is a good default.

Agreed. But the main application I was thinking of are along the 
lines of one thread
doing real computational work and the others doing GUI stuff or 
serving web-requests
or some such. I.e. while busy you care about response for other 
threads, but you don't
want to spend too many cycles on it.

I remember seeing bad artefacts of having a large value for the 
check interval, such as
bad responsiveness to control-C, but it could well be that this 
was MacPython-OS9

> You need a model for what it is you're trying [...]

I though you said you didn't have vrijdag?
- Jack Jansen        <> -
- If I can't dance I don't want to be part of your revolution -- 
Emma Goldman -

From  Sat Aug 31 17:44:06 2002
From: (Raymond Hettinger)
Date: Sat, 31 Aug 2002 12:44:06 -0400
Subject: [Python-Dev] Proposed Mixins for Wide Interfaces
Message-ID: <001101c2510d$9fce0920$5f66accf@othello>

How about adding some mixins to simplify the
implementation of some of the fatter interfaces?

class CompareMixin:
    Given an __eq__ method in a subclass, adds a __ne__ method
    Given __eq__ and __lt__, adds !=, <=, >, >=.

class MappingMixin:
    Given __setitem__, __getitem__,  and keys,
    implements values, items, update, get, setdefault, len,
    iterkeys, iteritems, itervalues, has_key, and __contains__.

    If __delitem__ is also supplied, implements clear, pop,
    and popitem.

    Takes advantage of __iter__ if supplied (recommended).
    Takes advantage of __contains__ or has_key if supplied

The idea is to make it easier to implement these interfaces.
Also, if the interfaces get expanded, the clients automatically

Raymond Hettinger

From David Abrahams" <  Sat Aug 31 17:48:29 2002
From: David Abrahams" < (David Abrahams)
Date: Sat, 31 Aug 2002 12:48:29 -0400
Subject: [Python-Dev] Proposed Mixins for Wide Interfaces
References: <001101c2510d$9fce0920$5f66accf@othello>
Message-ID: <0cd001c2510e$3c933eb0$>

From: "Raymond Hettinger" <>

> How about adding some mixins to simplify the
> implementation of some of the fatter interfaces?
> class CompareMixin:
>     """
>     Given an __eq__ method in a subclass, adds a __ne__ method
>     Given __eq__ and __lt__, adds !=, <=, >, >=.
>     """
> class MappingMixin:
>     """
>     Given __setitem__, __getitem__,  and keys,
>     implements values, items, update, get, setdefault, len,
>     iterkeys, iteritems, itervalues, has_key, and __contains__.
>     If __delitem__ is also supplied, implements clear, pop,
>     and popitem.
>     Takes advantage of __iter__ if supplied (recommended).
>     Takes advantage of __contains__ or has_key if supplied
>     (recommended).
>     """
> The idea is to make it easier to implement these interfaces.
> Also, if the interfaces get expanded, the clients automatically
> updated.

I think these are a great idea, *in the context of* an understanding of
what we want interfaces to be, say, and do. Are we there yet?

           David Abrahams * Boost Consulting *

From  Sat Aug 31 07:45:31 2002
From: (Tim Peters)
Date: Sat, 31 Aug 2002 02:45:31 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <>

This is a multi-part message in MIME format.

Content-type: text/plain; charset=iso-8859-1
Content-transfer-encoding: 7BIT

[Tim, predicting a false-positive rate]
> I expect we can end up below 0.1% here, and with a generous
> meaning for "not spam",

We're there now, and still ignoring the headers.

> but I think *some* of these examples show that the only way to get
> a 0% false-positive rate is to recode spamprob like so:
>     def spamprob(self, wordstream, evidence=False):
>         return 0.0


I'll check in what I've got after this.  Changes included:

+ Using the email pkg to decode (only) text parts of msgs, and,
  given multipart/alternative with both text/plain and text/html
  branches, ignoring the HTML part (else a newbie will never get
  a msg thru:  all HTML decorations have monster-high spam

+ Boosting MAX_DISCRIMINATORS, from 15 to 16.

+ Ignoring very short and very long "words" (this is Eurocentric).

+ Neither counting unique words once nor an unbounded number of times
  in the scoring.  A word is counted at most twice now.  This helps
  otherwise spamish msgs that have *some* highly relevant content, but
  doesn't, e.g., let spam through just because it says "Python" 80
  times at the start.  It helps the false negative rate more, although
  that may really be due to that UNKNOWN_SPAMPROB is too low
  (UNKNOWN_SPAMPROB is irrelevant to any of the false positives
  remaining, so I haven't run any tests varying that yet).

I'll attach a complete listing of all false positives across the 20,000 ham msgs I've
been using.  People using as an HTML clinic are out of luck.  I'd personally
call at least 5 of them spam, but I've been very reluctant to throw msgs out of the
"good" archive -- nobody would question the ones I did throw out and replace.

The false negative rate is still relatively high.  In part that comes from getting the
false positive rate so low (this is very much a tradeoff when both get low!), and in
part because the spam corpus has a surprising number of msgs with absolutely nothing in
the bodies.  The latter generate no tokens, so end up with "probability" 0.5.  The only
thing I tried that cut the false negative rate in a major way was the special
parsing+tagging of URLs in the body (see earlier msg), and that was a highly
significant aid (it cut the false negative rate in half).  There's good reason to hope
that adding headers into the scoring would slash the false negative rate.

Full results across all 20 runs; floats are percentages:

Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams
    testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams
    false positive: 0.025
    false negative: 2.10909090909
    testing against Data/Ham/Set3 & Data/Spam/Set3 ... 4000 hams & 2750 spams
    false positive: 0.05
    false negative: 2.47272727273
    testing against Data/Ham/Set4 & Data/Spam/Set4 ... 4000 hams & 2750 spams
    false positive: 0.1
    false negative: 2.50909090909
    testing against Data/Ham/Set5 & Data/Spam/Set5 ... 3999 hams & 2750 spams
    false positive: 0.0500125031258
    false negative: 2.8

Training on Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams
    testing against Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams
    false positive: 0.05
    false negative: 2.8
    testing against Data/Ham/Set3 & Data/Spam/Set3 ... 4000 hams & 2750 spams
    false positive: 0.075
    false negative: 2.47272727273
    testing against Data/Ham/Set4 & Data/Spam/Set4 ... 4000 hams & 2750 spams
    false positive: 0.15
    false negative: 2.36363636364
    testing against Data/Ham/Set5 & Data/Spam/Set5 ... 3999 hams & 2750 spams
    false positive: 0.0500125031258
    false negative: 2.43636363636

Training on Data/Ham/Set3 & Data/Spam/Set3 ... 4000 hams & 2750 spams
    testing against Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams
    false positive: 0.075
    false negative: 3.16363636364
    testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams
    false positive: 0.075
    false negative: 2.43636363636
    testing against Data/Ham/Set4 & Data/Spam/Set4 ... 4000 hams & 2750 spams
    false positive: 0.15
    false negative: 2.90909090909
    testing against Data/Ham/Set5 & Data/Spam/Set5 ... 3999 hams & 2750 spams
    false positive: 0.0750187546887
    false negative: 2.61818181818

Training on Data/Ham/Set4 & Data/Spam/Set4 ... 4000 hams & 2750 spams
    testing against Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams
    false positive: 0.1
    false negative: 2.65454545455
    testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams
    false positive: 0.1
    false negative: 1.81818181818
    testing against Data/Ham/Set3 & Data/Spam/Set3 ... 4000 hams & 2750 spams
    false positive: 0.1
    false negative: 2.25454545455
    testing against Data/Ham/Set5 & Data/Spam/Set5 ... 3999 hams & 2750 spams
    false positive: 0.0750187546887
    false negative: 2.50909090909

Training on Data/Ham/Set5 & Data/Spam/Set5 ... 3999 hams & 2750 spams
    testing against Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams
    false positive: 0.075
    false negative: 2.94545454545
    testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams
    false positive: 0.05
    false negative: 2.07272727273
    testing against Data/Ham/Set3 & Data/Spam/Set3 ... 4000 hams & 2750 spams
    false positive: 0.1
    false negative: 2.58181818182
    testing against Data/Ham/Set4 & Data/Spam/Set4 ... 4000 hams & 2750 spams
    false positive: 0.15
    false negative: 2.83636363636

The false positive rates vary by a factor of 6.  This isn't significant, because the
absolute numbers are so small; 0.025% is a single message, and it never gets higher
than 0.150%.  At these rates, I'd need test coropora about 10x larger to draw any fine
distinction among false positive rates with high confidence.

Content-type: application/x-zip-compressed;
Content-transfer-encoding: base64
Content-disposition: attachment;



From  Sat Aug 31 22:43:39 2002
From: (Tim Peters)
Date: Sat, 31 Aug 2002 17:43:39 -0400
Subject: [Python-Dev] The first trustworthy <wink> GBayes results
In-Reply-To: <>
Message-ID: <>

[Tim, to Paul Graham]
> ...
> I also noted earlier that FREE (all caps) is now one of the 15 words that
> most often makes it into the scorer's best-15 list, and cutting
> the legs off a clue like that is unattractive on the face of it.  So I'm
> loathe to fold case unless experiment proves that's an improvement, and it
> just doesn't look likely to do so.

Those experiments have been run now.  Folding case gave a slight but
significant improvement in the false negative rate.  It had no effect on the
false positive rate, but did change the *set* of messages flagged as false
positives:  conference announcments are no longer flagged (for their VISIT
highly off-topic messages do (e.g., talking about money is now
indistinguishable from screaming about MONEY).  So, overall, I'm leaving
case-folding in.  It does (of course) reduce the database size, and reduce
the amount of training data needed.  I have no idea what this does for
corpora in languages other than English (for that matter, I don't even know
what "fold case" *means* in other languages <wink>).

Experiment also showed that boosting the "unknown word" probability from 0.2
to 0.5 was a pure win:  it had no significant effect on the false positive
rate, but cut the false negative rate by a third.  The only change I've seen
that had a bigger effect on reducing false negatives was adding special
parsing and tagging for embedded URLs.