From vinay_sajip at yahoo.co.uk  Thu Nov  1 01:24:37 2012
From: vinay_sajip at yahoo.co.uk (Vinay Sajip)
Date: Thu, 1 Nov 2012 00:24:37 +0000 (UTC)
Subject: [Python-ideas] Function in shutil to clear contents of a directory
Message-ID: <loom.20121101T011302-450@post.gmane.org>

A couple of times recently on different projects, I've had the need to clear out
the contents of an existing directory. There doesn't appear to be any function
in shutil to do this. I've used

def clear_directory(path):
    for fn in os.listdir(path):
        fn = os.path.join(path, fn)
        if os.path.islink(fn) or os.path.isfile(fn):
            os.remove(fn)
        else:
            shutil.rmtree(fn)

One could just do shutil.rmtree(path) followed by os.mkdir(path), but that fails
if path is a symlink to a directory (disallowed by rmtree).

Is there some other obvious way to do this that I've missed? If not, I'd like to
see something like this added to shutil. What say?

Regards,

Vinay Sajip



From raymond.hettinger at gmail.com  Thu Nov  1 02:47:20 2012
From: raymond.hettinger at gmail.com (Raymond Hettinger)
Date: Wed, 31 Oct 2012 18:47:20 -0700
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <20121031113853.66fb0514@resist>
References: <20121031113853.66fb0514@resist>
Message-ID: <2BCA60E7-86D0-4C35-8B7F-0B40DA30B003@gmail.com>


On Oct 31, 2012, at 3:38 AM, Barry Warsaw <barry at python.org> wrote:

>  IWBNI you could write it like this:
> 
>    with (open('/etc/passwd') as p1,
>          open('/etc/passwd') as p2):
>          pass
> 
> This seems analogous to using parens to wrap long if-statements, but maybe
> there's some subtle corner of the grammar that makes this problematic (like
> 'with' treating the whole thing as a single context manager).
> 
> Of course, you can wrap with backslashes, but ick!

I would rather live a bashslash than have yet another change to the grammar
or have context-manager semantics added to tuples.

ISTM that this isn't a problem worth solving.  


Raymond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121031/6382b55c/attachment.html>

From grosser.meister.morti at gmx.net  Thu Nov  1 03:43:35 2012
From: grosser.meister.morti at gmx.net (=?ISO-8859-1?Q?Mathias_Panzenb=F6ck?=)
Date: Thu, 01 Nov 2012 03:43:35 +0100
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <509153E6.8020107@online.de>
References: <20121031113853.66fb0514@resist>
	<CAP7+vJKuAcvQ9U2xkv5tsPrfNr+HqmPKgpGxPOghhiH1E9Rasg@mail.gmail.com>
	<CADiSq7fJKPNMjqpxL1hEMuE5iUjo=3zOGWAt66qtKeW4h8zSXQ@mail.gmail.com>
	<CAP7+vJLoqBjQ9yUKM7bD3mucYzQU9c-nai7tMK7QHOc24y-6KQ@mail.gmail.com>
	<509153E6.8020107@online.de>
Message-ID: <5091E1D7.5060004@gmx.net>

On 10/31/2012 05:37 PM, Joachim K?nig wrote:
> On 31/10/2012 16:42, Guido van Rossum wrote:
>> Yeah, the problem is that when you see a '(' immediately after 'with',
>> you don't know whether that's just the start of a parenthesized
>> expression or the start of a (foo as bar, blah as blabla) syntactic
>> construct.
>
> but couldn't "with" be interpreted as an additional kind of opening parantheses (and "if", "for",
> "while",
> "elif", "else" too) and the ":" as the closing one?
>

What about "def", "class", "lambda" and "except"?

> I'm sure this has been asked a number of times but I couldn't find an answer.
>
> Joachim



From grosser.meister.morti at gmx.net  Thu Nov  1 03:46:36 2012
From: grosser.meister.morti at gmx.net (=?ISO-8859-1?Q?Mathias_Panzenb=F6ck?=)
Date: Thu, 01 Nov 2012 03:46:36 +0100
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <CAJ6cK1bRVYQB5_bOkgrgbCpnMi=dmwK-DR3p7Dd9XQApxdxJsw@mail.gmail.com>
References: <20121031113853.66fb0514@resist>
	<CAJ6cK1bRVYQB5_bOkgrgbCpnMi=dmwK-DR3p7Dd9XQApxdxJsw@mail.gmail.com>
Message-ID: <5091E28C.5040700@gmx.net>

On 10/31/2012 10:03 PM, Arnaud Delobelle wrote:
> On 31 October 2012 10:38, Barry Warsaw <barry at python.org> wrote:
>> with-statements have a syntactic quirk, which I think would be useful to fix.
>> This is true in Python 2.7 through 3.3, but it's likely not fixable until 3.4,
>> unless of course it's a bug <wink>.
>>
>> Legal:
>>
>>>>> with open('/etc/passwd') as p1, open('/etc/passwd') as p2: pass
>>
>> Not legal:
>>
>>>>> with (open('/etc/passwd') as p1, open('/etc/passwd') as p2): pass
>>
>> Why is this useful?  If you need to wrap this onto multiple lines, say to fit
>> it within line length limits.  IWBNI you could write it like this:
>>
>>      with (open('/etc/passwd') as p1,
>>            open('/etc/passwd') as p2):
>>            pass
>>
>> This seems analogous to using parens to wrap long if-statements, but maybe
>> there's some subtle corner of the grammar that makes this problematic (like
>> 'with' treating the whole thing as a single context manager).
>>
>> Of course, you can wrap with backslashes, but ick!
>
> No need for backslashes, just put the brackets in the right place:
>
>      with (
>              open('/etc/passwd')) as p1, (
>              open('/etc/passwd')) as p2:
>         pass
>
> ;)
>

Because that's not confusing. Why not write:

	with open('/etc/passwd'
			) as p1, open(
			'/etc/passwd') as p2:
		pass


From phd at phdru.name  Thu Nov  1 04:08:08 2012
From: phd at phdru.name (Oleg Broytman)
Date: Thu, 1 Nov 2012 07:08:08 +0400
Subject: [Python-ideas] Function in shutil to clear contents of a
 directory
In-Reply-To: <loom.20121101T011302-450@post.gmane.org>
References: <loom.20121101T011302-450@post.gmane.org>
Message-ID: <20121101030808.GA31190@iskra.aviel.ru>

On Thu, Nov 01, 2012 at 12:24:37AM +0000, Vinay Sajip <vinay_sajip at yahoo.co.uk> wrote:
> A couple of times recently on different projects, I've had the need to clear out
> the contents of an existing directory. There doesn't appear to be any function
> in shutil to do this. I've used
> 
> def clear_directory(path):
>     for fn in os.listdir(path):
>         fn = os.path.join(path, fn)
>         if os.path.islink(fn) or os.path.isfile(fn):
>             os.remove(fn)
>         else:
>             shutil.rmtree(fn)
> 
> One could just do shutil.rmtree(path) followed by os.mkdir(path), but that fails
> if path is a symlink to a directory (disallowed by rmtree).
> 
> Is there some other obvious way to do this that I've missed? If not, I'd like to
> see something like this added to shutil. What say?

1. Perhaps the best way to achieve that is to add a parameter (a flag)
to shutil.rmtree that would make rmtree to (not) remove the very path;
by default it must be True to be backward-compatible.

2.

> def clear_directory(path):
>     for fn in os.listdir(path):
>         fn = os.path.join(path, fn)
>         if os.path.islink(fn) or os.path.isfile(fn):

   There are other filesystem objects besides files and links. I think
it would be better to test for directory:

          if not os.path.isdir(fn):
>             os.remove(fn)
>         else:
>             shutil.rmtree(fn)

   rmtree, BTW, does the same.

Oleg.
-- 
     Oleg Broytman            http://phdru.name/            phd at phdru.name
           Programmers don't die, they just GOSUB without RETURN.


From barry at python.org  Thu Nov  1 07:33:16 2012
From: barry at python.org (Barry Warsaw)
Date: Thu, 1 Nov 2012 07:33:16 +0100
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <2BCA60E7-86D0-4C35-8B7F-0B40DA30B003@gmail.com>
References: <20121031113853.66fb0514@resist>
	<2BCA60E7-86D0-4C35-8B7F-0B40DA30B003@gmail.com>
Message-ID: <20121101073316.378bb28e@resist>

On Oct 31, 2012, at 06:47 PM, Raymond Hettinger wrote:

>I would rather live a bashslash than have yet another change to the grammar
>or have context-manager semantics added to tuples.

Yeesh.  Despite the wink I guess people took me seriously about the tuples
thing.  I mean, really, c'mon! :)

>ISTM that this isn't a problem worth solving.  

There are certainly other ways to refactor away the long lines.  It's an odd
corner of syntactic quirkery that really only came about as I began removing
nested() calls, and nested context managers are already probably pretty rare.
It doesn't bother me that it's not worth "fixing" but I'm glad the discussion
is now captured for eternity in a mail archive.

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121101/a4f9ca70/attachment.pgp>

From solipsis at pitrou.net  Thu Nov  1 11:48:06 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Thu, 1 Nov 2012 11:48:06 +0100
Subject: [Python-ideas] with-statement syntactic quirk
References: <20121031113853.66fb0514@resist>
	<2BCA60E7-86D0-4C35-8B7F-0B40DA30B003@gmail.com>
	<20121101073316.378bb28e@resist>
Message-ID: <20121101114806.11e70ee8@pitrou.net>

On Thu, 1 Nov 2012 07:33:16 +0100
Barry Warsaw <barry at python.org> wrote:
> On Oct 31, 2012, at 06:47 PM, Raymond Hettinger wrote:
> 
> >I would rather live a bashslash than have yet another change to the grammar
> >or have context-manager semantics added to tuples.
> 
> Yeesh.  Despite the wink I guess people took me seriously about the tuples
> thing.  I mean, really, c'mon! :)
> 
> >ISTM that this isn't a problem worth solving.  
> 
> There are certainly other ways to refactor away the long lines.  It's an odd
> corner of syntactic quirkery that really only came about as I began removing
> nested() calls, and nested context managers are already probably pretty rare.
> It doesn't bother me that it's not worth "fixing" but I'm glad the discussion
> is now captured for eternity in a mail archive.

Uh, what people seem to miss is that it's not only about nested context
managers. It can happen with a single context manager:

with (some_context_manager(many_arguments...)
      as my_variable):
    ...

# SyntaxError!


Regards

Antoine.




From steve at pearwood.info  Thu Nov  1 12:26:59 2012
From: steve at pearwood.info (Steven D'Aprano)
Date: Thu, 01 Nov 2012 22:26:59 +1100
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <20121101114806.11e70ee8@pitrou.net>
References: <20121031113853.66fb0514@resist>
	<2BCA60E7-86D0-4C35-8B7F-0B40DA30B003@gmail.com>
	<20121101073316.378bb28e@resist>
	<20121101114806.11e70ee8@pitrou.net>
Message-ID: <50925C83.9050507@pearwood.info>

On 01/11/12 21:48, Antoine Pitrou wrote:

> Uh, what people seem to miss is that it's not only about nested context
> managers. It can happen with a single context manager:
>
> with (some_context_manager(many_arguments...)
>        as my_variable):
>      ...
>
> # SyntaxError!

Have I missed something?

with some_context_manager(
         many_arguments,
         and_more_arguments,
         and_still_more_arguments
         ) as my_variable:
     ...


I'm still not seeing a problem that needs fixing.


-- 
Steven


From ncoghlan at gmail.com  Thu Nov  1 12:40:41 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 1 Nov 2012 21:40:41 +1000
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <20121031113853.66fb0514@resist>
References: <20121031113853.66fb0514@resist>
Message-ID: <CADiSq7fFnpm8kA6ewJvTD5W5Tdnon7R0crQWm6afW28XBmGz0g@mail.gmail.com>

On Wed, Oct 31, 2012 at 8:38 PM, Barry Warsaw <barry at python.org> wrote:
> with-statements have a syntactic quirk, which I think would be useful to fix.
> This is true in Python 2.7 through 3.3, but it's likely not fixable until 3.4,
> unless of course it's a bug <wink>.
>
> Legal:
>
>>>> with open('/etc/passwd') as p1, open('/etc/passwd') as p2: pass
>
> Not legal:
>
>>>> with (open('/etc/passwd') as p1, open('/etc/passwd') as p2): pass
>
> Why is this useful?  If you need to wrap this onto multiple lines, say to fit
> it within line length limits.  IWBNI you could write it like this:
>
>     with (open('/etc/passwd') as p1,
>           open('/etc/passwd') as p2):
>           pass
>
> This seems analogous to using parens to wrap long if-statements, but maybe
> there's some subtle corner of the grammar that makes this problematic (like
> 'with' treating the whole thing as a single context manager).
>
> Of course, you can wrap with backslashes, but ick!

I've been remiss in not mentioning the new alternative in 3.3 for
handling nesting of complex context management stacks:

    with contextlib.ExitStack() as cm:
        p1 = cm.enter_context(open('/etc/passwd'))
        p2 = cm.enter_context(open('/etc/passwd'))

(Note: ExitStack is really intended for cases where the number of
context managers involved varies dynamically, such as when you want to
make a CM optional, but you *can* use it for static cases if it seems
appropriate)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


From donspauldingii at gmail.com  Thu Nov  1 16:31:11 2012
From: donspauldingii at gmail.com (Don Spaulding)
Date: Thu, 1 Nov 2012 10:31:11 -0500
Subject: [Python-ideas] Async API: some code to review
In-Reply-To: <A7269F03D11BC245BD52843B195AC4F0019B24D3@TK5EX14MBXC292.redmond.corp.microsoft.com>
References: <CAP7+vJJhkGEK6=BQ6SNtCSZUgN21S6e90YOb7tLJgds2Le+rGA@mail.gmail.com>
	<k6lvdf$i5n$1@ger.gmane.org>
	<A7269F03D11BC245BD52843B195AC4F0019B1647@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJLpq05bTfQWGvQpMWj0tz=vP29iSPpsdNMy+rQwpFWSKw@mail.gmail.com>
	<k6m9bt$g3r$1@ger.gmane.org>
	<CAP7+vJ+atdR5cTKVrSYKoUrvzbzVH4GoPFLcm6LkPgSL3RmYOQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC0648@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJLLN15dzFZ4ynaUko-XEJMA9mvUUSromA3TsugaTnUe6Q@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC2180@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJjJfzSEJ-0dWRkGHrf65U3Xb69=-3mXrm0FToE4rYP5w@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B2232@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJJzZyf5U_JRv0rRdOJSnFzsZrhzCRvNfWwjffnUEjthnA@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B23F2@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJKD5O9oLRNHufnTYVJo-U_+Ha-iSgspcdEJDT-F8Vg0Aw@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B24D3@TK5EX14MBXC292.redmond.corp.microsoft.com>
Message-ID: <CAMaNpgVy-zrEKrLvenVM=234hqX80CfhWGsQS63W=33V+efv+g@mail.gmail.com>

On Wed, Oct 31, 2012 at 5:36 PM, Steve Dower <Steve.Dower at microsoft.com>wrote:
>
> Despite this intended application, I have tried to approach this design
> task independently to produce an API that will work for many cases,
> especially given the narrow focus on sockets. If people decide to get hung
> up on "the Microsoft way" or similar rubbish then I will feel vindicated
> for not mentioning it earlier :-) - it has not had any more influence on
> wattle than any of my other past experience has.
>

Oh, what sad times are these when passing ruffians can say 'The Microsoft
Way' at will to old developers. There is a pestilence upon this land!
Nothing is sacred. Even those who arrange and design async APIs are under
considerable hegemonic stress at this point in time.

/me crawls back under his rock.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121101/75cd0b15/attachment.html>

From guido at python.org  Thu Nov  1 16:44:48 2012
From: guido at python.org (Guido van Rossum)
Date: Thu, 1 Nov 2012 08:44:48 -0700
Subject: [Python-ideas] Async API: some code to review
In-Reply-To: <A7269F03D11BC245BD52843B195AC4F0019B24D3@TK5EX14MBXC292.redmond.corp.microsoft.com>
References: <CAP7+vJJhkGEK6=BQ6SNtCSZUgN21S6e90YOb7tLJgds2Le+rGA@mail.gmail.com>
	<k6lvdf$i5n$1@ger.gmane.org>
	<A7269F03D11BC245BD52843B195AC4F0019B1647@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJLpq05bTfQWGvQpMWj0tz=vP29iSPpsdNMy+rQwpFWSKw@mail.gmail.com>
	<k6m9bt$g3r$1@ger.gmane.org>
	<CAP7+vJ+atdR5cTKVrSYKoUrvzbzVH4GoPFLcm6LkPgSL3RmYOQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC0648@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJLLN15dzFZ4ynaUko-XEJMA9mvUUSromA3TsugaTnUe6Q@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC2180@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJjJfzSEJ-0dWRkGHrf65U3Xb69=-3mXrm0FToE4rYP5w@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B2232@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJJzZyf5U_JRv0rRdOJSnFzsZrhzCRvNfWwjffnUEjthnA@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B23F2@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJKD5O9oLRNHufnTYVJo-U_+Ha-iSgspcdEJDT-F8Vg0Aw@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B24D3@TK5EX14MBXC292.redmond.corp.microsoft.com>
Message-ID: <CAP7+vJKRQFXj_V2Y0ryyjmjuXpEn1i9S6ZQPkQUfOzzpGFRxfQ@mail.gmail.com>

On Wed, Oct 31, 2012 at 3:36 PM, Steve Dower <Steve.Dower at microsoft.com> wrote:
> Guido van Rossum wrote:
> There is only one reason to use 'yield from' and that is for the performance optimisation, which I do acknowledge and did observe in my own benchmarks.

Actually, it is not just optimization. The logic of the scheduler also
becomes much simpler.

> I know I've been vague about our intended application (deliberately so, to try and keep the discussion neutral), but I'll lay out some details.

Actually I wish you'd written this sooner. I don't know about you, but
my brain has a hard time understanding abstractions that are presented
without concrete use cases and implementations alongside; OTOH I
delight in taking a concrete mess and extract abstractions from it.
(The Twisted guys are also masters at this.)

So far I didn't really "get" the reasons you brought up for some of
complications you introduced (like multiple Future implementations).
Now I think I'm glimpsing your reasons.

> We're working on adding support for Windows 8 apps (formerly known as Metro) written in Python. These will use the new API (WinRT) which is highly asynchronous - even operations such as opening a file are only* available as an asynchronous function. The intention is to never block on the UI thread.

Interesting. The lack of synchronous wrappers does seem a step back,
but is probably useful as a forcing function given the desire to keep
the UI responsive at all times.

> (* Some synchronous Win32 APIs are still available from C++, but these are actively discouraged and restricted in many ways. Most of Win32 is not usable.)
>
> The model used for these async APIs is future-based: every *Async() function returns a future for a task that is already running. The caller is not allowed to wait on this future - the only option is to attach a callback. C# and VB use their async/await keywords (good 8 min intro video on those: http://www.visualstudiolaunch.com/vs2012vle/Theater?sid=1778) while JavaScript and C++ have multi-line lambda support.

Erik Meijer introduced me to async/await on Elba two months ago. I was
very excited to recognize exactly what I'd done for NDB with @tasklet
and yield, supported by the type checking.

> For Python, we are aiming for closer to the async/await model (which is also how we chose the names).

If we weren't so reluctant to introduce new keywords in Python we
might introduce await as an alias for yield from in the future.

> Incidentally, our early designs used yield from exclusively. It was only when we started discovering edge-cases where things broke, as well as the impact on code 'cleanliness', that we switched to yield.

Very interesting. I'd love to see a much longer narrative on this.
(You can send it to me directly if you feel it would distract the list
or if you feel it's inappropriate to share widely. I'll keep it under
my hat as long as you say so.)

> There are three aspects of this that work better and result in cleaner code with wattle than with tulip:
>
>  - event handlers can be "async-void", such that when the event is raised by the OS/GUI/device/whatever the handler can use asynchronous tasks without blocking the main thread.

I think this is "fire-and-forget"? I.e. you initiate an action and
then just let it run until completion without ever checking the
result? In tulip you currently do that by wrapping it in a Task and
calling its start() method. (BTW I think I'm going to get rid of
start() -- creating a Task should just start it.)

> In this case, the caller receives a future but ignores it because it does not care about the final result. (We could achieve this under 'yield from' by requiring a decorator, which would then probably prevent other Python code from calling the handler directly. There is very limited opportunity for us to reliably intercept this case.)

Are you saying that this property (you don't wait for the result) is
required by the operation rather than an option for the user? I'm only
familiar with the latter -- e.g. I can imagine firing off an operation
that writes a log entry somewhere but not caring about whether it
succeeded -- but I would still make it *possible* to check on the
operation if the caller cares (what if it's a very important log
message?).

If there's no option for the caller, the API should present itself as
a regular function/method and the task-spawning part should be hidden
inside it -- I see no need for the caller to know about this.

What exactly do you mean by "reliably intercept this case" ? A
concrete example would help.

>  - the event loop is implemented by the OS. Our Scheduler implementation does not need to provide an event loop, since we can submit() calls to the OS-level loop. This pattern also allows wattle to 'sit on top of' any other event loop, probably including Twisted and 0MQ, though I have not tried it (except with Tcl).

Ok, so what is the API offered by the OS event loop? I really want to
make sure that tulip can interface with strange event loops, and this
may be the most concrete example so far -- and it may be an important
one.

>  - Future objects can be marshalled directly from Python into Windows, completing the interop story.

What do you mean by marshalled here? Surely not the stdlib marshal
module. Do you just mean that Future objects can be recognized by the
foreign-function interface and wrapped by / copied into native Windows
8 datatypes?

I understand your event loop understands Futures? All of them? Or only
the ones of the specific type that it also returns?

> Even with tulip, we would probably still require a decorator for this case so that we can marshal regular generators as iterables (for which there is a specific type).

I can't quite follow you here, probably due to lack of imagination on
my part. Can you help me with a (somewhat) concrete example?

> Without a decorator, we would probably have to ban both cases to prevent subtly misbehaving programs.

Concrete example?

> At least with wattle, the user does not have to do anything different from any of their other @async functions.

This is because you can put type checks inside @async, which sees the
function object before it's called, rather than the scheduler, which
only sees what it returned, right? That's a trick I use in NDB as well
and I think tulip will end up requiring a decorator too -- but it will
just "mark" the function rather than wrap it in another one, unless
the function is not a generator (in which case it will probably have
to wrap it in something that is a generator). I could imagine a debug
version of the decorator that added wrappers in all cases though.

> Despite this intended application, I have tried to approach this design task independently to produce an API that will work for many cases, especially given the narrow focus on sockets. If people decide to get hung up on "the Microsoft way" or similar rubbish then I will feel vindicated for not mentioning it earlier :-) - it has not had any more influence on wattle than any of my other past experience has.

No worries about that. I agree that we need concrete examples that
takes us beyond the world of sockets; it's just that sockets are where
most of the interest lies (Tornado is a webserver, Twisted is often
admired because of its implementations of many internet protocols,
people benchmark async frameworks on how many HTTP requests per second
they can serve) and I haven't worked with any type of GUI framework in
a very long time. (Kudos for trying your way Tk!)

-- 
--Guido van Rossum (python.org/~guido)


From tjreedy at udel.edu  Thu Nov  1 17:33:07 2012
From: tjreedy at udel.edu (Terry Reedy)
Date: Thu, 01 Nov 2012 12:33:07 -0400
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <20121101114806.11e70ee8@pitrou.net>
References: <20121031113853.66fb0514@resist>
	<2BCA60E7-86D0-4C35-8B7F-0B40DA30B003@gmail.com>
	<20121101073316.378bb28e@resist>
	<20121101114806.11e70ee8@pitrou.net>
Message-ID: <k6u88h$dj8$1@ger.gmane.org>

On 11/1/2012 6:48 AM, Antoine Pitrou wrote:

> Uh, what people seem to miss is that it's not only about nested context
> managers. It can happen with a single context manager:
>
> with (some_context_manager(many_arguments...)
>        as my_variable):
>      ...
>
> # SyntaxError!

As it should be. With is not a(n overt) function and with statements 
should not look like function calls. With clauses are not expressions, 
so they cannot (should not) be arbitrarily surrounded by parentheses as 
expressions can. Multiple with clauses do not form a tuple and do not 
need parentheses to set off the tuple and modify expression precedence.

I understand better now why I did not like adding optional \-avoidance 
parentheses to import statements. Sprinkling statements with optional, 
non-expression parentheses does not fit the nature of statement syntax.

-- 
Terry Jan Reedy



From Steve.Dower at microsoft.com  Thu Nov  1 17:44:45 2012
From: Steve.Dower at microsoft.com (Steve Dower)
Date: Thu, 1 Nov 2012 16:44:45 +0000
Subject: [Python-ideas] Async API: some code to review
In-Reply-To: <CAP7+vJKRQFXj_V2Y0ryyjmjuXpEn1i9S6ZQPkQUfOzzpGFRxfQ@mail.gmail.com>
References: <CAP7+vJJhkGEK6=BQ6SNtCSZUgN21S6e90YOb7tLJgds2Le+rGA@mail.gmail.com>
	<k6lvdf$i5n$1@ger.gmane.org>
	<A7269F03D11BC245BD52843B195AC4F0019B1647@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJLpq05bTfQWGvQpMWj0tz=vP29iSPpsdNMy+rQwpFWSKw@mail.gmail.com>
	<k6m9bt$g3r$1@ger.gmane.org>
	<CAP7+vJ+atdR5cTKVrSYKoUrvzbzVH4GoPFLcm6LkPgSL3RmYOQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC0648@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJLLN15dzFZ4ynaUko-XEJMA9mvUUSromA3TsugaTnUe6Q@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC2180@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJjJfzSEJ-0dWRkGHrf65U3Xb69=-3mXrm0FToE4rYP5w@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B2232@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJJzZyf5U_JRv0rRdOJSnFzsZrhzCRvNfWwjffnUEjthnA@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B23F2@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJKD5O9oLRNHufnTYVJo-U_+Ha-iSgspcdEJDT-F8Vg0Aw@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B24D3@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJKRQFXj_V2Y0ryyjmjuXpEn1i9S6ZQPkQUfOzzpGFRxfQ@mail.gmail.com>
Message-ID: <A7269F03D11BC245BD52843B195AC4F0019B2815@TK5EX14MBXC292.redmond.corp.microsoft.com>

Guido van Rossum wrote:
> On Wed, Oct 31, 2012 at 3:36 PM, Steve Dower <Steve.Dower at microsoft.com> wrote:
>> Guido van Rossum wrote:
>> There is only one reason to use 'yield from' and that is for the performance
> optimisation, which I do acknowledge and did observe in my own benchmarks.
> 
> Actually, it is not just optimization. The logic of the scheduler also becomes
> much simpler.

I'd argue that it doesn't, it just happens that the implementation of 'yield from' in the interpreter happens to match the most common case. In any case, the affected area of code (which I haven't been calling 'scheduler', which seems to have caused some confusion elsewhere) only has to be written once and never touched again. It could even be migrated into C, which should significantly improve the performance. (In wattle, this is the _Awaiter class.) 
 
>> I know I've been vague about our intended application (deliberately so, to try
>> and keep the discussion neutral), but I'll lay out some details.
> 
> Actually I wish you'd written this sooner. I don't know about you, but my brain
> has a hard time understanding abstractions that are presented without concrete
> use cases and implementations alongside; OTOH I delight in taking a concrete
> mess and extract abstractions from it.
> (The Twisted guys are also masters at this.)
> 
> So far I didn't really "get" the reasons you brought up for some of
> complications you introduced (like multiple Future implementations).
> Now I think I'm glimpsing your reasons.

Part of the art of conversation is figuring out how the other participants need to hear something. My apologies for not figuring this out sooner :)

>> We're working on adding support for Windows 8 apps (formerly known as Metro)
>> written in Python. These will use the new API (WinRT) which is highly
>> asynchronous - even operations such as opening a file are only* available as an
>> asynchronous function. The intention is to never block on the UI thread.
> 
> Interesting. The lack of synchronous wrappers does seem a step back, but is
> probably useful as a forcing function given the desire to keep the UI responsive
> at all times.

Indeed. Based on the Win 8 apps I regularly use, it's worked well. On the other hand, updating CPython to avoid the synchronous ones (which I've done, and will be submitting for consideration soon, once I've been able to test on an ARM device) is less fun.

>> (* Some synchronous Win32 APIs are still available from C++, but these
>> are actively discouraged and restricted in many ways. Most of Win32 is
>> not usable.)
>>
>> The model used for these async APIs is future-based: every *Async() function
>> returns a future for a task that is already running. The caller is not allowed
>> to wait on this future - the only option is to attach a callback. C# and VB use
>> their async/await keywords (good 8 min intro video on those:
>> http://www.visualstudiolaunch.com/vs2012vle/Theater?sid=1778) while JavaScript
>> and C++ have multi-line lambda support.
> 
> Erik Meijer introduced me to async/await on Elba two months ago. I was very
> excited to recognize exactly what I'd done for NDB with @tasklet and yield,
> supported by the type checking.
>
>> For Python, we are aiming for closer to the async/await model (which is also
>> how we chose the names).
> 
> If we weren't so reluctant to introduce new keywords in Python we might
> introduce await as an alias for yield from in the future.

We discussed that internally and decided that it was unnecessary, or at least that it should be a proper keyword rather than an alias (as in, you can't use 'await' to delegate to a subgenerator). I'd rather see codef added first, since that (could) remove the need for the decorators.

>> Incidentally, our early designs used yield from exclusively. It was only when
> we started discovering edge-cases where things broke, as well as the impact on
> code 'cleanliness', that we switched to yield.
> 
> Very interesting. I'd love to see a much longer narrative on this.
> (You can send it to me directly if you feel it would distract the list or if you
> feel it's inappropriate to share widely. I'll keep it under my hat as long as
> you say so.)

If I get a chance to write something up then I will do that. I'll quite happily post it publicly, though it may go on my blog rather than here - this email is going to be long enough already. There is very little already written up since we discussed most of it at a whiteboard, though I do still have some early code iterations.

>> There are three aspects of this that work better and result in cleaner code
>> with wattle than with tulip:
>>
>> - event handlers can be "async-void", such that when the event is raised by
>> the OS/GUI/device/whatever the handler can use asynchronous tasks without
>> blocking the main thread.
> 
> I think this is "fire-and-forget"? I.e. you initiate an action and then just let
> it run until completion without ever checking the result? In tulip you currently
> do that by wrapping it in a Task and calling its start() method. (BTW I think
> I'm going to get rid of
> start() -- creating a Task should just start it.)

Yes, exactly. The only thing I dislike about tulip's current approach is that it requires two functions. If/when we support it, we'd provide a decorator that does the wrapping.

>> In this case, the caller receives a future but ignores it because it
>> does not care about the final result. (We could achieve this under
>> 'yield from' by requiring a decorator, which would then probably
>> prevent other Python code from calling the handler directly. There is
>> very limited opportunity for us to reliably intercept this case.)
> 
> Are you saying that this property (you don't wait for the result) is required by
> the operation rather than an option for the user? I'm only familiar with the
> latter -- e.g. I can imagine firing off an operation that writes a log entry
> somewhere but not caring about whether it succeeded -- but I would still make it
> *possible* to check on the operation if the caller cares (what if it's a very
> important log message?).
> 
> If there's no option for the caller, the API should present itself as a regular
> function/method and the task-spawning part should be hidden inside it -- I see
> no need for the caller to know about this.
>
> What exactly do you mean by "reliably intercept this case" ? A concrete example
> would help.

You're exactly right, there is no need for the original caller (for example, Windows itself) to know about the task. However, every incoming call initially comes through a COM interface that we provide (written in C) that will then invoke the Python function. This is our opportunity to intercept by looking at the returned value from the Python function before returning to the original caller.

Under wattle, we can type check here for a Future (or compatible interface), which is only ever used for async functions. On the other hand, we cannot reliable type-check for a generator to determine whether it is supposed to be async or supposed to be an iterator.

If the interface we implement expects an iterator then we can assume that we should treat the generator like that. However, if the user intended their code to be async and used 'yield from' with no decorator, we cannot provide any useful feedback: they will simply return a sequence of null pointers that is executed as quickly as the caller wants to - there is no scheduler involved in this case.

>> - the event loop is implemented by the OS. Our Scheduler implementation does
>> not need to provide an event loop, since we can submit() calls to the OS-level
>> loop. This pattern also allows wattle to 'sit on top of' any other event loop,
>> probably including Twisted and 0MQ, though I have not tried it (except with
>> Tcl).
> 
> Ok, so what is the API offered by the OS event loop? I really want to make sure
> that tulip can interface with strange event loops, and this may be the most
> concrete example so far -- and it may be an important one.

There are three main APIs involved:

* Windows.UI.Core.CoreDispatcher.run_async() (and run_idle_async(), which uses a low priority)
* Windows.System.Threading.ThreadPool.run_async()
* any API that returns a future (==an object implementing IAsyncInfo)

Strictly, the third category covers the first two, since they both return a future, but they are also the APIs that allow the user/developer to schedule work on or off the UI thread (respectively).

For wattle, they equate directly to Scheduler.submit, Scheduler.thread_pool.submit (which wasn't in the code, but was suggested in the write-up) and Future. 

>> - Future objects can be marshalled directly from Python into Windows,
>> completing the interop story.
> 
> What do you mean by marshalled here? Surely not the stdlib marshal module.

No.

>Do you just mean that Future objects can be recognized by the foreign-function
> interface and wrapped by / copied into native Windows 8 datatypes?

Yes, this is exactly what we would do. The FFI creates a WinRT object that forwards calls between Python and Windows as necessary. (This is a general mechanism that we use for many types, so it doesn't matter how the Future is created. On a related note, returning a Future from Python code into Windows will not be a common occurrence - it is far more common for Python to consume Futures that are passed in.)

> I understand your event loop understands Futures? All of them? Or only the ones
> of the specific type that it also returns?

It's based on an interface, so as long as we can provide (equivalents of) add_done_callback() and result() then the FFI will do the rest.

>> Even with tulip, we would probably still require a decorator for this case so
>> that we can marshal regular generators as iterables (for which there is a
>> specific type).
> 
> I can't quite follow you here, probably due to lack of imagination on my part.
> Can you help me with a (somewhat) concrete example?

Given a (Windows) prototype:

IIterable<String> GetItems();

We want to allow the Python function to be implemented as:

def get_items():
    for data in ['a', 'b', 'c']:
        yield data

This is a pretty straightforward mapping: Python returns a generator, which supports the same interface as IIterable, so we can marshal the object out and convert each element to a string.

The problem is when a (possibly too keen) user writes the following code:

def get_items():
    data = yield from get_data_async()
    return data

Now the returned generator is full of None, which we will happily convert to a sequence of empty strings (==null pointers in Win8). With wattle, the yielded objects would be Futures, which would still be converted to strings, but at least are obviously incorrect. Also, since the user should be in the habit of adding @async already, we can raise an error even earlier when the return value is a future and not a generator.

Unfortunately, nothing can fix this code (except maybe a new keyword):

def get_items():
    data = yield from get_data_async()
    for item in data:
        yield item 


>> Without a decorator, we would probably have to ban both cases to prevent
> subtly misbehaving programs.
> 
> Concrete example?

Given above. By banning both cases we would always raise TypeError when a generator is returned, even if an iterable or an async operation is expected, because we can't be sure which one we have.

>> At least with wattle, the user does not have to do anything different from any
>> of their other @async functions.
> 
> This is because you can put type checks inside @async, which sees the function
> object before it's called, rather than the scheduler, which only sees what it
> returned, right? That's a trick I use in NDB as well and I think tulip will end
> up requiring a decorator too -- but it will just "mark" the function rather than
> wrap it in another one, unless the function is not a generator (in which case it
> will probably have to wrap it in something that is a generator). I could imagine
> a debug version of the decorator that added wrappers in all cases though.

It's not so much the type checks inside @async - those are basically to support non-generator functions being wrapped (though there is little benefit to this apart from maintaining a consistent interface). The benefit is that the _returned object_ is always going to be some sort of Future. 

Because of the way that our FFI will work, a simple marker on the function would be sufficient for our interop purposes. However, I don't think it is a general enough solution (for example, if the caller is already in Python then they may not get to see the function before it is called - Twisted might be affected by this, though I'm not sure).

What might work best is allowing the replacement scheduler/pollster to provide or override the decorator somehow, though I don't see any convenient way to do this 


>> Despite this intended application, I have tried to approach this design task
>> independently to produce an API that will work for many cases, especially given
>> the narrow focus on sockets. If people decide to get hung up on "the Microsoft
>> way" or similar rubbish then I will feel vindicated for not mentioning it
>> earlier :-) - it has not had any more influence on wattle than any of my other
>> past experience has.
> 
> No worries about that. I agree that we need concrete examples that takes us
> beyond the world of sockets; it's just that sockets are where most of the
> interest lies (Tornado is a webserver, Twisted is often admired because of its
> implementations of many internet protocols, people benchmark async frameworks on
> how many HTTP requests per second they can serve) and I haven't worked with any
> type of GUI framework in a very long time. (Kudos for trying your way Tk!)

I don't blame you for avoiding GUI frameworks... there are very few that work well. Hopefully when we fully support XAML-based GUIs that will change somewhat, at least for Windows developers.

Also, I didn't include the Tk scheduler in BitBucket, but just to illustrate the simplicity of wrapping an existing loop I've posted the full code below (it still has some old names in it):

import contexts

class TkContext(contexts.CallableContext):
    def __init__(self, app):
        self.app = app

    @staticmethod
    def invoke(callable, args, kwargs):
        callable(*args, **kwargs)

    def submit(self, callable, *args, **kwargs):
        '''Adds a callable to invoke within this context.'''
        self.app.after(0, TkContext.invoke, callable, args, kwargs)


Cheers,
Steve


From tjreedy at udel.edu  Thu Nov  1 18:50:56 2012
From: tjreedy at udel.edu (Terry Reedy)
Date: Thu, 01 Nov 2012 13:50:56 -0400
Subject: [Python-ideas] Async API: some code to review
In-Reply-To: <A7269F03D11BC245BD52843B195AC4F0019B2815@TK5EX14MBXC292.redmond.corp.microsoft.com>
References: <CAP7+vJJhkGEK6=BQ6SNtCSZUgN21S6e90YOb7tLJgds2Le+rGA@mail.gmail.com>
	<CAP7+vJ+atdR5cTKVrSYKoUrvzbzVH4GoPFLcm6LkPgSL3RmYOQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC0648@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJLLN15dzFZ4ynaUko-XEJMA9mvUUSromA3TsugaTnUe6Q@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC2180@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJjJfzSEJ-0dWRkGHrf65U3Xb69=-3mXrm0FToE4rYP5w@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B2232@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJJzZyf5U_JRv0rRdOJSnFzsZrhzCRvNfWwjffnUEjthnA@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B23F2@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJKD5O9oLRNHufnTYVJo-U_+Ha-iSgspcdEJDT-F8Vg0Aw@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B24D3@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJKRQFXj_V2Y0ryyjmjuXpEn1i9S6ZQPkQUfOzzpGFRxfQ@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B2815@TK5EX14MBXC292.redmond.corp.microsoft.com>
Message-ID: <k6ucqf$pcf$1@ger.gmane.org>

On 11/1/2012 12:44 PM, Steve Dower wrote:

>>> C# and VB use
>>> their async/await keywords (good 8 min intro video on those:
>>> http://www.visualstudiolaunch.com/vs2012vle/Theater?sid=1778

Thanks for the link. It make much of this discussion more concrete for 
me. As a potential user, the easy async = @async, await = yield from 
transformation (additions) is what I would like for Python.

I do realize that the particular task was picked to be easy and that 
other things might be harder (on Windows), and that Python has the 
additional problem of working on multiple platforms. But I think 'make 
easy things easy and difficult things possible' applies here.

I have no problem with 'yield from' instead of 'await' = 'wait for'. 
Actually, the caller of the movie list fetcher did *not* wait for the 
entire list to be fetched, even asynchronously. Rather, it displayed 
items as they were available (yielded). So the app does less waiting, 
and 'yield as available' is what 'await' does in that example.

Actually, I do not see how just adding 4 keywords would necessarily have 
the effect it did. I imagine there is a bit more to the story than was 
shown, like the 'original' code being carefully written so that the 
change would have the effect it did. The video is, after all, an 
advertorial. Nonetheless, it was impressive.

-- 
Terry Jan Reedy



From Steve.Dower at microsoft.com  Thu Nov  1 19:07:56 2012
From: Steve.Dower at microsoft.com (Steve Dower)
Date: Thu, 1 Nov 2012 18:07:56 +0000
Subject: [Python-ideas] Async API: some code to review
In-Reply-To: <k6ucqf$pcf$1@ger.gmane.org>
References: <CAP7+vJJhkGEK6=BQ6SNtCSZUgN21S6e90YOb7tLJgds2Le+rGA@mail.gmail.com>
	<CAP7+vJ+atdR5cTKVrSYKoUrvzbzVH4GoPFLcm6LkPgSL3RmYOQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC0648@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJLLN15dzFZ4ynaUko-XEJMA9mvUUSromA3TsugaTnUe6Q@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC2180@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJjJfzSEJ-0dWRkGHrf65U3Xb69=-3mXrm0FToE4rYP5w@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B2232@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJJzZyf5U_JRv0rRdOJSnFzsZrhzCRvNfWwjffnUEjthnA@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B23F2@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJKD5O9oLRNHufnTYVJo-U_+Ha-iSgspcdEJDT-F8Vg0Aw@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B24D3@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJKRQFXj_V2Y0ryyjmjuXpEn1i9S6ZQPkQUfOzzpGFRxfQ@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B2815@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<k6ucqf$pcf$1@ger.gmane.org>
Message-ID: <A7269F03D11BC245BD52843B195AC4F0019B3891@TK5EX14MBXC292.redmond.corp.microsoft.com>

Terry Reedy wrote:
> On 11/1/2012 12:44 PM, Steve Dower wrote:
>>>> C# and VB use
>>>> their async/await keywords (good 8 min intro video on those:
>>>> http://www.visualstudiolaunch.com/vs2012vle/Theater?sid=1778
>
> [SNIP]
>
> Actually, I do not see how just adding 4 keywords would necessarily have 
> the effect it did. I imagine there is a bit more to the story than was shown,
> like the 'original' code being carefully written so that the change would
> have the effect it did. The video is, after all, an advertorial. Nonetheless,
> it was impressive.

It is certainly a dramatic demo, and you are right to be skeptical. The "carefully written" part is that the code already used paging as part of its query - the "give me movies from 1950" request is actually a series of "give me 10 movies from 1950 starting from {0, 10, 20, 30, ...}" requests (this is why you see the progress counter go up by 10 each time) - and it's already updating the UI between each page. The "4 keywords" also activate a significant amount of compiler machinery that actually rewrites the original code, much like the conversion to a generator, so there is quite a bit of magic.

There are plenty of videos at http://channel9.msdn.com/search?term=async+await that go much deeper into how it all works, including the 3rd-party extensibility mechanisms. 

(And apologies about the video only being available with Silverlight - I didn't realise this when I originally posted it. The videos at the later link are much more readily available, but also very deeply technical and quite long.)

Cheers,
Steve




From ubershmekel at gmail.com  Thu Nov  1 23:06:20 2012
From: ubershmekel at gmail.com (Yuval Greenfield)
Date: Fri, 2 Nov 2012 00:06:20 +0200
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <CADiSq7fFnpm8kA6ewJvTD5W5Tdnon7R0crQWm6afW28XBmGz0g@mail.gmail.com>
References: <20121031113853.66fb0514@resist>
	<CADiSq7fFnpm8kA6ewJvTD5W5Tdnon7R0crQWm6afW28XBmGz0g@mail.gmail.com>
Message-ID: <CANSw7KzqACFjifrz0LwEoTiQqnTKv0LtierJxRQbp5V_wOEqmQ@mail.gmail.com>

On Thu, Nov 1, 2012 at 1:40 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> I've been remiss in not mentioning the new alternative in 3.3 for
> handling nesting of complex context management stacks:
>
>     with contextlib.ExitStack() as cm:
>         p1 = cm.enter_context(open('/etc/passwd'))
>         p2 = cm.enter_context(open('/etc/passwd'))
>
> (Note: ExitStack is really intended for cases where the number of
> context managers involved varies dynamically, such as when you want to
> make a CM optional, but you *can* use it for static cases if it seems
> appropriate)
>
>
Go's "defer" is quite a neat solution for these hassles if anyone's in the
mood for a time machine discussion.

http://golang.org/doc/effective_go.html#defer

Yuval Greenfield
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121102/4dd7f154/attachment.html>

From cs at zip.com.au  Thu Nov  1 23:14:31 2012
From: cs at zip.com.au (Cameron Simpson)
Date: Fri, 2 Nov 2012 09:14:31 +1100
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <CABicbJJWSHvyFNh4BBkTczZ9=gSLCaP59LqMN4dhAYao1pEn3g@mail.gmail.com>
References: <CABicbJJWSHvyFNh4BBkTczZ9=gSLCaP59LqMN4dhAYao1pEn3g@mail.gmail.com>
Message-ID: <20121101221431.GA24191@cskk.homeip.net>

On 31Oct2012 08:17, Devin Jeanpierre <jeanpierreda at gmail.com> wrote:
| Is there a reason the tokenizer can't ignore newlines and
| indentation/deindentation between with/etc. and the trailing colon?
| This would solve the problem in general, without ambiguous syntax.

To my mind one of the attractive features of the current syntax is that
forgetting the colon causes an immediate complaint. Once one allows an
arbitrary number of lines between (if,while,with) and the colon one is
into the same zone as several other languages, where a common mistake
can cause a syntax comoplain many lines beyond where the mistake was
made.

I understand the attractiveness here, but I think I would prefer staying
with the status quo (overt brackets or icky trailing sloshes) to extend
the lines in a condition over opening the syntax to complaints far beyond
the mistake.

Just 2c,
-- 
Cameron Simpson <cs at zip.com.au>

If you don't know what your program is supposed to do, you'd better not start
writing it. - Edsger W. Dijkstra


From greg.ewing at canterbury.ac.nz  Thu Nov  1 23:37:59 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Fri, 02 Nov 2012 11:37:59 +1300
Subject: [Python-ideas] Async API: some code to review
In-Reply-To: <CAP7+vJKRQFXj_V2Y0ryyjmjuXpEn1i9S6ZQPkQUfOzzpGFRxfQ@mail.gmail.com>
References: <CAP7+vJJhkGEK6=BQ6SNtCSZUgN21S6e90YOb7tLJgds2Le+rGA@mail.gmail.com>
	<CAP7+vJLpq05bTfQWGvQpMWj0tz=vP29iSPpsdNMy+rQwpFWSKw@mail.gmail.com>
	<k6m9bt$g3r$1@ger.gmane.org>
	<CAP7+vJ+atdR5cTKVrSYKoUrvzbzVH4GoPFLcm6LkPgSL3RmYOQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC0648@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJLLN15dzFZ4ynaUko-XEJMA9mvUUSromA3TsugaTnUe6Q@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DC2180@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJjJfzSEJ-0dWRkGHrf65U3Xb69=-3mXrm0FToE4rYP5w@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B2232@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJJzZyf5U_JRv0rRdOJSnFzsZrhzCRvNfWwjffnUEjthnA@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B23F2@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJKD5O9oLRNHufnTYVJo-U_+Ha-iSgspcdEJDT-F8Vg0Aw@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019B24D3@TK5EX14MBXC292.redmond.corp.microsoft.com>
	<CAP7+vJKRQFXj_V2Y0ryyjmjuXpEn1i9S6ZQPkQUfOzzpGFRxfQ@mail.gmail.com>
Message-ID: <5092F9C7.1060504@canterbury.ac.nz>

Guido van Rossum wrote:
> If we weren't so reluctant to introduce new keywords in Python we
> might introduce await as an alias for yield from in the future.

Or 'cocall'. :-)

> I think tulip will end up requiring a decorator too -- but it will
> just "mark" the function rather than wrap it in another one, unless
> the function is not a generator (in which case it will probably have
> to wrap it in something that is a generator).

I don't see how that helps much, because the scheduler
doesn't see generators used in yield-from calls. There
is *no* way to catch the mistake of writing

    foo()

when you should have written

    yield from foo()

instead.

This is one way that codef/cocall (or some variant on it)
would help, by clearly diagnosing that mistake.

-- 
Greg


From ncoghlan at gmail.com  Thu Nov  1 23:41:37 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 2 Nov 2012 08:41:37 +1000
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <CANSw7KzqACFjifrz0LwEoTiQqnTKv0LtierJxRQbp5V_wOEqmQ@mail.gmail.com>
References: <20121031113853.66fb0514@resist>
	<CADiSq7fFnpm8kA6ewJvTD5W5Tdnon7R0crQWm6afW28XBmGz0g@mail.gmail.com>
	<CANSw7KzqACFjifrz0LwEoTiQqnTKv0LtierJxRQbp5V_wOEqmQ@mail.gmail.com>
Message-ID: <CADiSq7fPHSiO=kDMb=GiacJtO6tO4455tSj_+eOMLQqa3aX5Kg@mail.gmail.com>

On Nov 2, 2012 8:06 AM, "Yuval Greenfield" <ubershmekel at gmail.com> wrote:
>
> On Thu, Nov 1, 2012 at 1:40 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>>
>> I've been remiss in not mentioning the new alternative in 3.3 for
>> handling nesting of complex context management stacks:
>>
>>     with contextlib.ExitStack() as cm:
>>         p1 = cm.enter_context(open('/etc/passwd'))
>>         p2 = cm.enter_context(open('/etc/passwd'))
>>
>> (Note: ExitStack is really intended for cases where the number of
>> context managers involved varies dynamically, such as when you want to
>> make a CM optional, but you *can* use it for static cases if it seems
>> appropriate)
>>
>
> Go's "defer" is quite a neat solution for these hassles if anyone's in
the mood for a time machine discussion.
>
> http://golang.org/doc/effective_go.html#defer

Go was one of the reference points for the ExitStack design (it's a large
part of why the API also supports providing callbacks directly to the exit
stack, not just as context managers).

Cheers,
Nick.

--
Sent from my phone, thus the relative brevity :)
>
> Yuval Greenfield
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121102/545938dc/attachment.html>

From bruce at leapyear.org  Thu Nov  1 23:44:12 2012
From: bruce at leapyear.org (Bruce Leban)
Date: Thu, 1 Nov 2012 15:44:12 -0700
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <20121101221431.GA24191@cskk.homeip.net>
References: <CABicbJJWSHvyFNh4BBkTczZ9=gSLCaP59LqMN4dhAYao1pEn3g@mail.gmail.com>
	<20121101221431.GA24191@cskk.homeip.net>
Message-ID: <CAGu0Anuc0WSmo-LO7Hvm=R1iThk00o8Wu+kXj5_uG7K6+mQrAQ@mail.gmail.com>

On Thu, Nov 1, 2012 at 3:14 PM, Cameron Simpson <cs at zip.com.au> wrote:

> To my mind one of the attractive features of the current syntax is that
> forgetting the colon causes an immediate complaint.


I agree

I understand the attractiveness here, but I think I would prefer staying
> with the status quo (overt brackets or icky trailing sloshes) to extend
> the lines in a condition over opening the syntax to complaints far beyond
> the mistake.
>
>
Ditto except for the part about \ continuation being icky. I don't think
it's that bad. The two things that make \ continuation less attractive than
() continuation is that you can't put comments on the \-continued lines and
invisible trailing white space is a syntax error.

I don't understand the reason for either restriction.

with open('/etc/passwd') as p1, \      # source
     open('/etc/passwd') as p2:        # destination

seems more readable than

with open('/etc/passwd') as p1, \
     open('/etc/passwd') as p2:        # source, destination

A reasonable restriction (to my mind) would be to require at least two
spaces or a tab after a \ before a comment (although requiring just one
space would also be ok with me although I personally would always use
more). This change couldn't break existing code since \ is currently a
syntax error if followed by whitespace or a comment.

I would ignore whitespace after a final \ in a string, but would not allow
comments.

(Yes, I realize that better variable names would obviate the need for these
particular comments but comments are still useful sometimes :-)

--- Bruce
Follow me: http://www.twitter.com/Vroo http://www.vroospeak.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121101/8b002d36/attachment.html>

From sturla at molden.no  Fri Nov  2 22:29:09 2012
From: sturla at molden.no (Sturla Molden)
Date: Fri, 2 Nov 2012 22:29:09 +0100
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
Message-ID: <2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>

Den 19. okt. 2012 kl. 18:05 skrev Guido van Rossum <guido at python.org>:

> An issue in the design of the I/O loop is the strain between a
> ready-based and completion-based design. The typical Unix design
> (whether based on select or any of the poll variants) is usually
> ready-based; but on Windows, the only way to get high performance is
> to base it on IOCP, which is completion-based (i.e. you start a
> specific async operation, like writing N bytes, and the I/O loop tells
> you when it is done). I would like people to be able to write fast
> event handling programs on Windows too, and ideally the only change
> would be the implementation of the I/O loop. But I don't know how
> tenable that is given the dramatically different style used by IOCP
> and the need to use native Windows API for all async I/O -- it sounds
> like we could only do this if the library providing the I/O loop
> implementation also wrapped all I/O operations, andthat may be a bit
> much.


Not really, no.

IOCP might be the easiest way to get high performance on Windows, but certainly not the only.

IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.

Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.

Then the problem is polling for "ready-to-read" and "ready-to-write". The annoying part is that different types of files (disk files, sockets, pipes, named pipes, hardware devices) must be polled with different Windows API calls ? but there are non-blocking calls to poll them all. For this reason, Cygwin's select function spawn one thread to poll each type of file. Threads are very cheap on Windows, and polling loops can use Sleep(0) to relese the remainder of their time-slice, so this kind of polling is not very expensive. However, if we use a thread-pool for the polling, instead of spawing new threads on each call to select, we would be doing more or less the same as Windows built-in IOCPs, except we are signalling "ready" instead of "finished". 

Thus, I think it is possible to get high performance without IOCP. But Microsoft has only implemented a select call for sockets. My suggestion would be to forget about IOCP and implement select for more than just sockets on Windows. The reason for this is that select and IOCP are signalling on different side of the I/O operation (ready vs. completed). So programs based on select ans IOCP tend to have opposite logics with respect to scheduling I/O. And as the general trend today is to develop for Unix and then port to Windows (as most programmers find the Windows API annoying), I think it would be better to port select (and perhaps poll and epoll) to Windows than provide IOCP to Python. 


Sturla

From solipsis at pitrou.net  Fri Nov  2 23:14:17 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Fri, 2 Nov 2012 23:14:17 +0100
Subject: [Python-ideas] The async API of the future
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
Message-ID: <20121102231417.12407875@pitrou.net>

On Fri, 2 Nov 2012 22:29:09 +0100
Sturla Molden <sturla at molden.no> wrote:
> 
> IOCP might be the easiest way to get high performance on Windows, but certainly not the only.
> 
> IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.
> 
> Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.

Hmm, but the basic problem with WaitForMultipleObjects is that it has a
hard limit of 64 objects you can wait on.

Regards

Antoine.




From sturla at molden.no  Sat Nov  3 00:21:43 2012
From: sturla at molden.no (Sturla Molden)
Date: Sat, 3 Nov 2012 00:21:43 +0100
Subject: [Python-ideas] The async API of the future
In-Reply-To: <20121102231417.12407875@pitrou.net>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
Message-ID: <E2D5E1C4-EA8A-48CE-A2B9-5ED48C527049@molden.no>

Den 2. nov. 2012 kl. 23:14 skrev Antoine Pitrou <solipsis at pitrou.net>:

> On Fri, 2 Nov 2012 22:29:09 +0100
> Sturla Molden <sturla at molden.no> wrote:
>> 
>> IOCP might be the easiest way to get high performance on Windows, but certainly not the only.
>> 
>> IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.
>> 
>> Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.
> 
> Hmm, but the basic problem with WaitForMultipleObjects is that it has a
> hard limit of 64 objects you can wait on.
> 

Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously. At the end of the loop, call Sleep(0) to avoid burning the CPU. A small number of threads could also be used to run this loop in parallel.

Sturla







> Regards
> 
> Antoine.
> 
> 
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas


From solipsis at pitrou.net  Sat Nov  3 00:30:36 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sat, 3 Nov 2012 00:30:36 +0100
Subject: [Python-ideas] The async API of the future
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<E2D5E1C4-EA8A-48CE-A2B9-5ED48C527049@molden.no>
Message-ID: <20121103003036.74621d59@pitrou.net>

On Sat, 3 Nov 2012 00:21:43 +0100
Sturla Molden <sturla at molden.no> wrote:
> Den 2. nov. 2012 kl. 23:14 skrev Antoine Pitrou <solipsis at pitrou.net>:
> 
> > On Fri, 2 Nov 2012 22:29:09 +0100
> > Sturla Molden <sturla at molden.no> wrote:
> >> 
> >> IOCP might be the easiest way to get high performance on Windows, but certainly not the only.
> >> 
> >> IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.
> >> 
> >> Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.
> > 
> > Hmm, but the basic problem with WaitForMultipleObjects is that it has a
> > hard limit of 64 objects you can wait on.
> > 
> 
> Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously.

Well, that's basically O(number of objects), isn't it?

Regards

Antoine.




From sturla at molden.no  Sat Nov  3 00:10:26 2012
From: sturla at molden.no (Sturla Molden)
Date: Sat, 3 Nov 2012 00:10:26 +0100
Subject: [Python-ideas] The async API of the future
In-Reply-To: <20121102231417.12407875@pitrou.net>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
Message-ID: <49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>


Den 2. nov. 2012 kl. 23:14 skrev Antoine Pitrou <solipsis at pitrou.net>:

> On Fri, 2 Nov 2012 22:29:09 +0100
> Sturla Molden <sturla at molden.no> wrote:
>> 
>> IOCP might be the easiest way to get high performance on Windows, but certainly not the only.
>> 
>> IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.
>> 
>> Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.
> 
> Hmm, but the basic problem with WaitForMultipleObjects is that it has a
> hard limit of 64 objects you can wait on.
> 


So you nest them in a tree, each node having up to 64 children... 

The root allows us to wait for 64 objects, the first branch allows us to wait for 4096, and the second 262144...

For example, if 4096 wait objects are enough, we can use a pool of 64 threads. Each thread calls WaitForMultipleObjects on up to 64 wait objects, and signals to the master when it wakes up.

Sturla



From grosser.meister.morti at gmx.net  Sat Nov  3 00:47:41 2012
From: grosser.meister.morti at gmx.net (=?ISO-8859-1?Q?Mathias_Panzenb=F6ck?=)
Date: Sat, 03 Nov 2012 00:47:41 +0100
Subject: [Python-ideas] Support data: URLs in urllib
In-Reply-To: <CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
References: <5090B0FC.1030801@gmx.net>
	<CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
Message-ID: <50945B9D.8010002@gmx.net>

On 10/31/2012 08:54 AM, Paul Moore wrote:
> On Wednesday, 31 October 2012, Mathias Panzenb?ck wrote:
>
>     Sometimes it would be handy to read data:-urls just like any other url. While it is pretty easy
>     to parse a data: url yourself I think it would be nice if urllib could do this for you.
>
>     Example data url parser:
>
> [...]
>
> IIUC, this should be possible with a custom opener. While it might be nice to have this in the
> stdlib, it would also be a really useful recipe to have in the docs, showing how to create and
> install a simple custom opener into the default set of openers (so that urllib.request gains the
> ability to handle data rules automatically). Would you be willing to submit a doc patch to cover this?
>
> Paul


Ok, I wrote something here:
https://gist.github.com/4004353

I wrote two versions. One that just returns an io.BytesIO and one that returns a DataResponse 
(derived from ioBytesIO), that has a few properties/methods like HTTPResponse: msg, headers, length, 
getheader and getheaders and also an additinal mediatype

I also added two examples, one that writes the binary data read to stdout (stdout reopened as "wb") 
and one that reads the text data in the defined encoding (requires the version with the 
DataResponse) and writes it to stdout as string.

Which version do you think is the best for the recipe? I guess losing the mediatype (and thus the 
charset) is not so good, therefore the version with the DataResponse is better? Maybe with a note 
that if you don't need the mediatype you can simply return an io.BytesIO as well? How does one 
submit a doc patch anyway? Is there a hg repo for the documentation and a web interface through 
which one can submit a pull request?

Note:
Handling of buggy data URLs is buggy. E.g. missing padding characters at the end of the URL raise an 
exception. Browsers like Firefox and Chrome correct the padding (Chrome only if the padding is 
completely missing, Firefox corrects/ignores any garbage at the end). I could correct the padding as 
well, but I'd rather not perform such magic.

RFC 4648[1] (Base64 Data Encoding) states that specifications referring to it have to explicitly 
state if there are characters that can be ignored or if the padding is not required. RFC 2397[2] 
(data URL scheme) does not state any such thing, but it doesn't specifically refer to RFC 4648 
either (as it was written before RFC 4648). Chrome and Firefox ignore any kind of white space in 
data URLs. I think that is a good idea, because it let's you wrap long data URLs in image tags. 
binascii.a2b_base64 ignores white spaces anyway, so I don't have to do something there.

Firefox and Chrome both allow %-encoding of base64 characters like "/", "+" and "=". That this 
should work is not mentioned in the data URL RFC, but I think one can assume as much.

Also note that a minimal base64 data URL is "data:;base64," and not "data:base64," (note the ";"). 
The later would specify the (illegal) mime type "base64" and not a base64 encoding. This is handled 
correctly by my example code.


	-panzi

[1] http://tools.ietf.org/html/rfc4648#section-3
[2] http://tools.ietf.org/html/rfc2397


From sturla at molden.no  Sat Nov  3 00:50:15 2012
From: sturla at molden.no (Sturla Molden)
Date: Sat, 3 Nov 2012 00:50:15 +0100
Subject: [Python-ideas] The async API of the future
In-Reply-To: <20121103003036.74621d59@pitrou.net>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<E2D5E1C4-EA8A-48CE-A2B9-5ED48C527049@molden.no>
	<20121103003036.74621d59@pitrou.net>
Message-ID: <15F02388-6EC1-4413-A42A-800F92804144@molden.no>



Sendt fra min iPad

Den 3. nov. 2012 kl. 00:30 skrev Antoine Pitrou <solipsis at pitrou.net>:

> On Sat, 3 Nov 2012 00:21:43 +0100
> Sturla Molden <sturla at molden.no> wrote:
>> Den 2. nov. 2012 kl. 23:14 skrev Antoine Pitrou <solipsis at pitrou.net>:
>> 
>>> On Fri, 2 Nov 2012 22:29:09 +0100
>>> Sturla Molden <sturla at molden.no> wrote:
>>>> 
>>>> IOCP might be the easiest way to get high performance on Windows, but certainly not the only.
>>>> 
>>>> IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.
>>>> 
>>>> Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.
>>> 
>>> Hmm, but the basic problem with WaitForMultipleObjects is that it has a
>>> hard limit of 64 objects you can wait on.
>> 
>> Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously.
> 
> Well, that's basically O(number of objects), isn't it?
> 

Yes, but nesting would be O(log64 n).










From solipsis at pitrou.net  Sat Nov  3 00:54:00 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sat, 3 Nov 2012 00:54:00 +0100
Subject: [Python-ideas] The async API of the future
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<E2D5E1C4-EA8A-48CE-A2B9-5ED48C527049@molden.no>
	<20121103003036.74621d59@pitrou.net>
	<15F02388-6EC1-4413-A42A-800F92804144@molden.no>
Message-ID: <20121103005400.6fb1735f@pitrou.net>

On Sat, 3 Nov 2012 00:50:15 +0100
Sturla Molden <sturla at molden.no> wrote:
> >> 
> >> Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously.
> > 
> > Well, that's basically O(number of objects), isn't it?
> > 
> 
> Yes, but nesting would be O(log64 n).

No, you still have O(n) calls to WaitForMultipleObjects, just arranged
differently.
(in other words, the depth of your tree is O(log n), but its number of
nodes is O(n))

Regards

Antoine.




From greg.ewing at canterbury.ac.nz  Sat Nov  3 00:54:34 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sat, 03 Nov 2012 12:54:34 +1300
Subject: [Python-ideas] The async API of the future
In-Reply-To: <2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
Message-ID: <50945D3A.7000407@canterbury.ac.nz>

Sturla Molden wrote:
> Windows certainly has a function to select among multiple wait objects,
> called WaitForMultipleObjects.
> 
> Then the problem is polling for "ready-to-read" and "ready-to-write". The
> annoying part is that different types of files (disk files, sockets, pipes,
> named pipes, hardware devices) must be polled with different Windows API
> calls

I don't follow. Isn't the point of WaitForMultipleObjects that you
can make a single call that blocks until any kind of object is
ready?

-- 
Greg


From guido at python.org  Sat Nov  3 00:59:49 2012
From: guido at python.org (Guido van Rossum)
Date: Fri, 2 Nov 2012 16:59:49 -0700
Subject: [Python-ideas] The async API of the future
In-Reply-To: <2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
Message-ID: <CAP7+vJKXgmTXA7JnHw0=uGst5P=mxv3HhFMxh71GDGOn4ZFQDQ@mail.gmail.com>

Working code or it didn't happen. (And it should scale too.)

--Guido van Rossum (sent from Android phone)
On Nov 2, 2012 2:58 PM, "Sturla Molden" <sturla at molden.no> wrote:

> Den 19. okt. 2012 kl. 18:05 skrev Guido van Rossum <guido at python.org>:
>
> > An issue in the design of the I/O loop is the strain between a
> > ready-based and completion-based design. The typical Unix design
> > (whether based on select or any of the poll variants) is usually
> > ready-based; but on Windows, the only way to get high performance is
> > to base it on IOCP, which is completion-based (i.e. you start a
> > specific async operation, like writing N bytes, and the I/O loop tells
> > you when it is done). I would like people to be able to write fast
> > event handling programs on Windows too, and ideally the only change
> > would be the implementation of the I/O loop. But I don't know how
> > tenable that is given the dramatically different style used by IOCP
> > and the need to use native Windows API for all async I/O -- it sounds
> > like we could only do this if the library providing the I/O loop
> > implementation also wrapped all I/O operations, andthat may be a bit
> > much.
>
>
> Not really, no.
>
> IOCP might be the easiest way to get high performance on Windows, but
> certainly not the only.
>
> IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e.
> asynchronous) i/o. There is nothing IOCP can do that cannot be done with a
> pool of threads and non-blocking read or write operations.
>
> Windows certainly has a function to select among multiple wait objects,
> called WaitForMultipleObjects. If open files are associated with event
> objects signalling "ready-to-read" or "ready-to-write", that is the basic
> machinery of an Unix select() function.
>
> Then the problem is polling for "ready-to-read" and "ready-to-write". The
> annoying part is that different types of files (disk files, sockets, pipes,
> named pipes, hardware devices) must be polled with different Windows API
> calls ? but there are non-blocking calls to poll them all. For this reason,
> Cygwin's select function spawn one thread to poll each type of file.
> Threads are very cheap on Windows, and polling loops can use Sleep(0) to
> relese the remainder of their time-slice, so this kind of polling is not
> very expensive. However, if we use a thread-pool for the polling, instead
> of spawing new threads on each call to select, we would be doing more or
> less the same as Windows built-in IOCPs, except we are signalling "ready"
> instead of "finished".
>
> Thus, I think it is possible to get high performance without IOCP. But
> Microsoft has only implemented a select call for sockets. My suggestion
> would be to forget about IOCP and implement select for more than just
> sockets on Windows. The reason for this is that select and IOCP are
> signalling on different side of the I/O operation (ready vs. completed). So
> programs based on select ans IOCP tend to have opposite logics with respect
> to scheduling I/O. And as the general trend today is to develop for Unix
> and then port to Windows (as most programmers find the Windows API
> annoying), I think it would be better to port select (and perhaps poll and
> epoll) to Windows than provide IOCP to Python.
>
>
> Sturla
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121102/cf62225f/attachment.html>

From sturla at molden.no  Sat Nov  3 01:19:33 2012
From: sturla at molden.no (Sturla Molden)
Date: Sat, 3 Nov 2012 01:19:33 +0100
Subject: [Python-ideas] The async API of the future
In-Reply-To: <50945D3A.7000407@canterbury.ac.nz>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<50945D3A.7000407@canterbury.ac.nz>
Message-ID: <F8A64520-F8DF-4179-A8BC-1C99C784B9E5@molden.no>


Den 3. nov. 2012 kl. 00:54 skrev Greg Ewing <greg.ewing at canterbury.ac.nz>:

> 
> I don't follow. Isn't the point of WaitForMultipleObjects that you
> can make a single call that blocks until any kind of object is
> ready?


WaitForMultipleObjects will wait for a "wait object" to be signalled ? i.e. a thread, process, event, mutex, or semaphore handle.

The Unix select() function signals that a file object is ready for read or write. There are different functions to poll file objects for readyness in Windows, depending on their type. That is different from Unix which treats all files the same.

When WaitForMultipleObjects is used with overlapped i/o and IOCP, the OVERLAPPED struct has an event object that is signalled on completion (the hEvent member). It is not a wait on the file handle itself. WaitForMultipleObjects cannot wait for a file.

Sturla




From shibturn at gmail.com  Sat Nov  3 01:32:00 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Sat, 03 Nov 2012 00:32:00 +0000
Subject: [Python-ideas] The async API of the future
In-Reply-To: <49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
Message-ID: <k71om1$4uq$1@ger.gmane.org>

On 02/11/2012 11:10pm, Sturla Molden wrote:
> So you nest them in a tree, each node having up to 64 children...
>
> The root allows us to wait for 64 objects, the first branch allows us to wait for 4096, and the second 262144...
>
> For example, if 4096 wait objects are enough, we can use a pool of 64 threads. Each thread calls WaitForMultipleObjects on up to 64 wait objects, and signals to the master when it wakes up.

Windows already has RegisterWaitForSingleObject() which basically does 
what you describe:

http://msdn.microsoft.com/en-gb/library/windows/desktop/ms685061%28v=vs.85%29.aspx

--
Richard.



From sturla at molden.no  Sat Nov  3 10:22:46 2012
From: sturla at molden.no (Sturla Molden)
Date: Sat, 3 Nov 2012 10:22:46 +0100
Subject: [Python-ideas] The async API of the future
In-Reply-To: <k71om1$4uq$1@ger.gmane.org>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
Message-ID: <FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>



Den 3. nov. 2012 kl. 01:32 skrev Richard Oudkerk <shibturn at gmail.com>:

> On 02/11/2012 11:10pm, Sturla Molden wrote:
>> So you nest them in a tree, each node having up to 64 children...
>> 
>> The root allows us to wait for 64 objects, the first branch allows us to wait for 4096, and the second 262144...
>> 
>> For example, if 4096 wait objects are enough, we can use a pool of 64 threads. Each thread calls WaitForMultipleObjects on up to 64 wait objects, and signals to the master when it wakes up.
> 
> Windows already has RegisterWaitForSingleObject() which basically does what you describe:
> 
> http://msdn.microsoft.com/en-gb/library/windows/desktop/ms685061%28v=vs.85%29.aspx
> 

No, it does something completely different. It registers a callback function for a single event object and waits. We were talking about multiplexing a wait for more than 64 objects.

Sturla





From ncoghlan at gmail.com  Sat Nov  3 10:35:50 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sat, 3 Nov 2012 19:35:50 +1000
Subject: [Python-ideas] The async API of the future
In-Reply-To: <49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
Message-ID: <CADiSq7cq5SJfbiVdWvZ879honm_pbq6OGQ5AJMmFqFtWNW6ZmQ@mail.gmail.com>

On Sat, Nov 3, 2012 at 9:10 AM, Sturla Molden <sturla at molden.no> wrote:
> The root allows us to wait for 64 objects, the first branch allows us to wait for 4096, and the second 262144...
>
> For example, if 4096 wait objects are enough, we can use a pool of 64 threads. Each thread calls WaitForMultipleObjects on up to 64 wait objects, and signals to the master when it wakes up.

Given that the purposes of using async IO is to improve scalability on
a single machine by a factor of 100 or more beyond what is typically
possible with threads or processes, hard capping the scaling
improvement on Windows at 64x the thread limit by relying on
WaitForMultipleObjects seems to be rather missing the point.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


From sturla at molden.no  Sat Nov  3 10:37:41 2012
From: sturla at molden.no (Sturla Molden)
Date: Sat, 3 Nov 2012 10:37:41 +0100
Subject: [Python-ideas] The async API of the future
In-Reply-To: <20121103005400.6fb1735f@pitrou.net>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<E2D5E1C4-EA8A-48CE-A2B9-5ED48C527049@molden.no>
	<20121103003036.74621d59@pitrou.net>
	<15F02388-6EC1-4413-A42A-800F92804144@molden.no>
	<20121103005400.6fb1735f@pitrou.net>
Message-ID: <3BCCACF3-B24E-4CF6-AA16-8837224BCA2D@molden.no>

Den 3. nov. 2012 kl. 00:54 skrev Antoine Pitrou <solipsis at pitrou.net>:

>>>> Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously.
>>> 
>>> Well, that's basically O(number of objects), isn't it?
>> 
>> Yes, but nesting would be O(log64 n).
> 
> No, you still have O(n) calls to WaitForMultipleObjects, just arranged
> differently.
> (in other words, the depth of your tree is O(log n), but its number of
> nodes is O(n))
> 
> 

True, but is the time latency O(n) or O(log n)? 

Also, from what I read, the complexity of select.poll is O(n) with respect to file handles, so this should not be any worse (O(log n) katency wait, O(n) polling) I think. 

Another interesting strategy for high-performance on Windows 64: Just use blocking i/o and one thread per client. The stack-space limitation is a 32-bit problem, and Windows 64 has no problem scheduling an insane number of threads. Even desktop computers today can have 16 GB of RAM, so there is virtually no limitation on the number of i/o threads Windows 64 can multiplex.

But would it scale with Python threads and the GIL as well? You would be better to answer that.

Sturla












From sturla at molden.no  Sat Nov  3 10:44:59 2012
From: sturla at molden.no (Sturla Molden)
Date: Sat, 3 Nov 2012 10:44:59 +0100
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CADiSq7cq5SJfbiVdWvZ879honm_pbq6OGQ5AJMmFqFtWNW6ZmQ@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<CADiSq7cq5SJfbiVdWvZ879honm_pbq6OGQ5AJMmFqFtWNW6ZmQ@mail.gmail.com>
Message-ID: <B6476551-E0A0-42E9-B0AA-F82772AD200B@molden.no>

Den 3. nov. 2012 kl. 10:35 skrev Nick Coghlan <ncoghlan at gmail.com>:

> On Sat, Nov 3, 2012 at 9:10 AM, Sturla Molden <sturla at molden.no> wrote:
>> The root allows us to wait for 64 objects, the first branch allows us to wait for 4096, and the second 262144...
>> 
>> For example, if 4096 wait objects are enough, we can use a pool of 64 threads. Each thread calls WaitForMultipleObjects on up to 64 wait objects, and signals to the master when it wakes up.
> 
> Given that the purposes of using async IO is to improve scalability on
> a single machine by a factor of 100 or more beyond what is typically
> possible with threads or processes, hard capping the scaling
> improvement on Windows at 64x the thread limit by relying on
> WaitForMultipleObjects seems to be rather missing the point.
> 

The only thread limitation on Windows 64 is the amount of RAM. 

IOCPs are also thread-based (they are actually user-space thread pools).


Sturla

From solipsis at pitrou.net  Sat Nov  3 11:14:18 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sat, 3 Nov 2012 11:14:18 +0100
Subject: [Python-ideas] The async API of the future
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<E2D5E1C4-EA8A-48CE-A2B9-5ED48C527049@molden.no>
	<20121103003036.74621d59@pitrou.net>
	<15F02388-6EC1-4413-A42A-800F92804144@molden.no>
	<20121103005400.6fb1735f@pitrou.net>
	<3BCCACF3-B24E-4CF6-AA16-8837224BCA2D@molden.no>
Message-ID: <20121103111418.1dae7525@pitrou.net>

On Sat, 3 Nov 2012 10:37:41 +0100
Sturla Molden <sturla at molden.no> wrote:
> Den 3. nov. 2012 kl. 00:54 skrev Antoine Pitrou <solipsis at pitrou.net>:
> 
> >>>> Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously.
> >>> 
> >>> Well, that's basically O(number of objects), isn't it?
> >> 
> >> Yes, but nesting would be O(log64 n).
> > 
> > No, you still have O(n) calls to WaitForMultipleObjects, just arranged
> > differently.
> > (in other words, the depth of your tree is O(log n), but its number of
> > nodes is O(n))
> > 
> > 
> 
> True, but is the time latency O(n) or O(log n)? 

Right, that's the difference. However, I think here we are concerned
about CPU load on the server, not individual latency (as long as it is
acceptable, e.g. lower than 5 ms).

> Also, from what I read, the complexity of select.poll is O(n) with respect to file handles, so this should not be any worse (O(log n) katency wait, O(n) polling) I think. 

epoll and kqueue are better than O(number of objects) though.

> Another interesting strategy for high-performance on Windows 64: Just use blocking i/o and one thread per client. The stack-space limitation is a 32-bit problem, and Windows 64 has no problem scheduling an insane number of threads. Even desktop computers today can have 16 GB of RAM, so there is virtually no limitation on the number of i/o threads Windows 64 can multiplex.

That's still a huge waste of RAM, isn't it?
Also, by relying on preemptive threading you have to use Python's
synchronization primitives (locks, etc.), and I don't know how these
would scale.

> But would it scale with Python threads and the GIL as well? You would be better to answer that.

I haven't done any tests with a large number of threads, but the GIL
certainly has a (per-thread as well as per-context switch) overhead.

Regards

Antoine.




From p.f.moore at gmail.com  Sat Nov  3 11:40:30 2012
From: p.f.moore at gmail.com (Paul Moore)
Date: Sat, 3 Nov 2012 10:40:30 +0000
Subject: [Python-ideas] Support data: URLs in urllib
In-Reply-To: <50945B9D.8010002@gmx.net>
References: <5090B0FC.1030801@gmx.net>
	<CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
	<50945B9D.8010002@gmx.net>
Message-ID: <CACac1F_P4L7b26fu1sh7hz0QMLKRP-vpLAx45MGBOgd9JNOoow@mail.gmail.com>

On 2 November 2012 23:47, Mathias Panzenb?ck
<grosser.meister.morti at gmx.net> wrote:
> Which version do you think is the best for the recipe? I guess losing the
> mediatype (and thus the charset) is not so good, therefore the version with
> the DataResponse is better? Maybe with a note that if you don't need the
> mediatype you can simply return an io.BytesIO as well? How does one submit a
> doc patch anyway? Is there a hg repo for the documentation and a web
> interface through which one can submit a pull request?

You should probably be consistent with urllib's behaviour for other
URLs - from the documentation of urlopen:

"""
This function returns a file-like object that works as a context
manager, with two additional methods from the urllib.response module

geturl() ? return the URL of the resource retrieved, commonly used to
determine if a redirect was followed
info() ? return the meta-information of the page, such as headers, in
the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)
Raises URLError on errors.
"""

To create a doc patch, open a feature request on bugs.python.org and
attach a patch. The documentation is in the core Python repository,
from hg.python.org. You can clone that and use Mercurial to generate a
patch, but there's no "pull request" mechanism that I know of.

Paul


From shibturn at gmail.com  Sat Nov  3 12:20:54 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Sat, 03 Nov 2012 11:20:54 +0000
Subject: [Python-ideas] The async API of the future
In-Reply-To: <FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
	<FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
Message-ID: <k72umn$qc0$1@ger.gmane.org>

On 03/11/2012 9:22am, Sturla Molden wrote:
> No, it does something completely different. It registers a callback function for a single event object and waits. We were talking about multiplexing a wait for more than 64 objects.

By using an appropriate callback you easily implement something like 
WaitForMultipleObjects() which does not have the 64 handle limit 
(without having to explicitly start any threads).

More usefully if the callback posts a message to an IOCP then it lets 
you use the IOCP to wait on non-IO things.

--
Richard



From sturla at molden.no  Sat Nov  3 12:47:53 2012
From: sturla at molden.no (Sturla Molden)
Date: Sat, 3 Nov 2012 12:47:53 +0100
Subject: [Python-ideas] The async API of the future
In-Reply-To: <20121103111418.1dae7525@pitrou.net>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<E2D5E1C4-EA8A-48CE-A2B9-5ED48C527049@molden.no>
	<20121103003036.74621d59@pitrou.net>
	<15F02388-6EC1-4413-A42A-800F92804144@molden.no>
	<20121103005400.6fb1735f@pitrou.net>
	<3BCCACF3-B24E-4CF6-AA16-8837224BCA2D@molden.no>
	<20121103111418.1dae7525@pitrou.net>
Message-ID: <8136D88B-5345-4260-BD03-1D286C799938@molden.no>

Den 3. nov. 2012 kl. 11:14 skrev Antoine Pitrou <solipsis at pitrou.net>:

>> 
>> True, but is the time latency O(n) or O(log n)? 
> 
> Right, that's the difference. However, I think here we are concerned
> about CPU load on the server, not individual latency (as long as it is
> acceptable, e.g. lower than 5 ms).

Ok, I can do som tests on Windows :-)

> 
>> Also, from what I read, the complexity of select.poll is O(n) with respect to file handles, so this should not be any worse (O(log n) katency wait, O(n) polling) I think. 
> 
> epoll and kqueue are better than O(number of objects) though.

I know, they claim the wait to be about O(1). I guess that magic happens in the kernel. 

With IOCP on Windows there is a thread-pool that continuously polls the i/o tasks for completion. So I think IOCPs might approach O(n) at some point. 

I assume as long as we are staying in user-space, there will always be a O(n) overhead somewhere. To avoid it one would need the kernel to trigger callbacks from hardware interrupts, which presumably is what epoll and kqueue do. But at least on Windows, anything except "one-thread-per-client" involves O(n) polling by user-space threads (even IOCPs and RegisterWaitForSingleObject do that). The kernel only schedules threads, it does not trigger i/o callbacks from hardware.

But who in their right mind use Windows for these kind of servers anyway?


> 
>> Another interesting strategy for high-performance on Windows 64: Just use blocking i/o and one thread per client. The stack-space limitation is a 32-bit problem, and Windows 64 has no problem scheduling an insane number of threads. Even desktop computers today can have 16 GB of RAM, so there is virtually no limitation on the number of i/o threads Windows 64 can multiplex.
> 
> That's still a huge waste of RAM, isn't it?

That is depending on perspective :-)If threads are a simpler design pattern than IOCPs, the latter is a huge waste of work hours. Which is cheaper today? Think to some extent IOCPs solves a problem related to 32-bit address spaces or limited RAM. But if RAM is cheaper than programming effort, just go ahead and waste as much as you need :-)

Also, those who need these kind of servers can certainly afford to buy enough RAM.



> Also, by relying on preemptive threading you have to use Python's
> synchronization primitives (locks, etc.), and I don't know how these
> would scale.
> 
>> But would it scale with Python threads and the GIL as well? You would be better to answer that.
> 
> I haven't done any tests with a large number of threads, but the GIL
> certainly has a (per-thread as well as per-context switch) overhead.
> 

That is the thing, plain Windows threads and Python threads in huge numbers might not behave similarly. It would be interesting to test.


Sturla

From itamar at futurefoundries.com  Sat Nov  3 13:02:09 2012
From: itamar at futurefoundries.com (Itamar Turner-Trauring)
Date: Sat, 3 Nov 2012 08:02:09 -0400
Subject: [Python-ideas] The async API of the future
In-Reply-To: <2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
Message-ID: <CAOp9P3rqJHC2_5BT0RZorRdGfuf-jRW1nFRBB6wCHnMiZs=Uyw@mail.gmail.com>

On Fri, Nov 2, 2012 at 5:29 PM, Sturla Molden <sturla at molden.no> wrote:

> Thus, I think it is possible to get high performance without IOCP. But
> Microsoft has only implemented a select call for sockets. My suggestion
> would be to forget about IOCP and implement select for more than just
> sockets on Windows. The reason for this is that select and IOCP are
> signalling on different side of the I/O operation (ready vs. completed). So
> programs based on select ans IOCP tend to have opposite logics with respect
> to scheduling I/O. And as the general trend today is to develop for Unix
> and then port to Windows (as most programmers find the Windows API
> annoying), I think it would be better to port select (and perhaps poll and
> epoll) to Windows than provide IOCP to Python.


Twisted supports both select()-style loops and IOCP, in a way that is
transparent to user code. They key is presenting an async API to users
(e.g. Protocol.dataReceived gets called with bytes), rather than e.g.
trying to pretend they're talking to a socket-like object you can call
recv() on.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121103/3b746e27/attachment.html>

From itamar at futurefoundries.com  Sat Nov  3 13:18:01 2012
From: itamar at futurefoundries.com (Itamar Turner-Trauring)
Date: Sat, 3 Nov 2012 08:18:01 -0400
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAOp9P3rqJHC2_5BT0RZorRdGfuf-jRW1nFRBB6wCHnMiZs=Uyw@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<CAOp9P3rqJHC2_5BT0RZorRdGfuf-jRW1nFRBB6wCHnMiZs=Uyw@mail.gmail.com>
Message-ID: <CAOp9P3qka9qxWeyn7wDLgoWF7MibA9P35xMzvknYQ25BNi0u+w@mail.gmail.com>

On Sat, Nov 3, 2012 at 8:02 AM, Itamar Turner-Trauring <
itamar at futurefoundries.com> wrote:

>
> Twisted supports both select()-style loops and IOCP, in a way that is
> transparent to user code. They key is presenting an async API to users
> (e.g. Protocol.dataReceived gets called with bytes), rather than e.g.
> trying to pretend they're talking to a socket-like object you can call
> recv() on.
>

Although, if you're using a yield based API (or coroutines) you can have a
recv()/read()-style API with IOCP as well.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121103/c0838c9a/attachment.html>

From sturla at molden.no  Sat Nov  3 13:35:38 2012
From: sturla at molden.no (Sturla Molden)
Date: Sat, 3 Nov 2012 13:35:38 +0100
Subject: [Python-ideas] The async API of the future
In-Reply-To: <k72umn$qc0$1@ger.gmane.org>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
	<FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
	<k72umn$qc0$1@ger.gmane.org>
Message-ID: <6841C4C6-B6B2-44A9-A773-637EE2839CDF@molden.no>

Den 3. nov. 2012 kl. 12:20 skrev Richard Oudkerk <shibturn at gmail.com>:

> On 03/11/2012 9:22am, Sturla Molden wrote:
>> No, it does something completely different. It registers a callback function for a single event object and waits. We were talking about multiplexing a wait for more than 64 objects.
> 
> By using an appropriate callback you easily implement something like WaitForMultipleObjects() which does not have the 64 handle limit (without having to explicitly start any threads).
> 

But it uses a thread-pool that polls the registered wait objects, so the overhead (with respect to latency) will still be O(n). It does not matter if you ask Windows to allocate a thread-pool for the polling or if you do the polling yourself. It is still user-space threads that polls N objects with O(n) complexity. But if you nest WaitForMultipleObjects, you can get the latency down to O(log n). 

IOCP is just an abstraction for a thread-pool and a FIFO. If you want to use a thread-pool and a FIFO to wait for something else than I/O there are easier ways. For example, you can use the queue functions in NT6 and enqueue whatever APC you want ? or just use a list of threads and a queue in Python.

Sturla



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121103/54a12396/attachment.html>

From solipsis at pitrou.net  Sat Nov  3 18:22:55 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sat, 3 Nov 2012 18:22:55 +0100
Subject: [Python-ideas] The async API of the future
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<E2D5E1C4-EA8A-48CE-A2B9-5ED48C527049@molden.no>
	<20121103003036.74621d59@pitrou.net>
	<15F02388-6EC1-4413-A42A-800F92804144@molden.no>
	<20121103005400.6fb1735f@pitrou.net>
	<3BCCACF3-B24E-4CF6-AA16-8837224BCA2D@molden.no>
	<20121103111418.1dae7525@pitrou.net>
	<8136D88B-5345-4260-BD03-1D286C799938@molden.no>
Message-ID: <20121103182255.70ea9c5a@pitrou.net>

On Sat, 3 Nov 2012 12:47:53 +0100
Sturla Molden <sturla at molden.no> wrote:
> > 
> >> Also, from what I read, the complexity of select.poll is O(n) with respect to file handles, so this should not be any worse (O(log n) katency wait, O(n) polling) I think. 
> > 
> > epoll and kqueue are better than O(number of objects) though.
> 
> I know, they claim the wait to be about O(1). I guess that magic happens in the
> kernel.

They are not O(1), they are O(number of ready objects).

> With IOCP on Windows there is a thread-pool that continuously polls the i/o tasks
> for completion. So I think IOCPs might approach O(n) at some point.

Well, I don't know about the IOCP implementation, but "continuously
polling the I/O tasks" sounds like a costly way to do it (what system
call would that use?). If the kernel cooperates, no continuous polling
should be required.

> That is depending on perspective :-)If threads are a simpler design pattern
> than IOCPs, the latter is a huge waste of work hours.

Er, the whole point of this discussion is to design a library so that
the developer does *not* have to deal with IOCPs.
As for "simpler design pattern", I think it's mostly a matter of habit.
Writing a network daemon with Twisted is not difficult. And making
multi-threaded code scale properly might not be trivial, depending on
the problem.

Regards

Antoine.




From shibturn at gmail.com  Sat Nov  3 22:20:18 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Sat, 03 Nov 2012 21:20:18 +0000
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAP7+vJKXgmTXA7JnHw0=uGst5P=mxv3HhFMxh71GDGOn4ZFQDQ@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<CAP7+vJKXgmTXA7JnHw0=uGst5P=mxv3HhFMxh71GDGOn4ZFQDQ@mail.gmail.com>
Message-ID: <k741qk$hat$1@ger.gmane.org>

On 02/11/2012 11:59pm, Guido van Rossum wrote:
> Working code or it didn't happen. (And it should scale too.)

I have some (mostly) working code which replaces tulip's "pollster" 
classes with "proactor" classes for select(), poll(), epoll() and IOCP.  See

 
https://bitbucket.org/sbt/tulip-proactor/changeset/c64ff42bf0f2679437838ee7795adb85

The IOCP proactor does not support ssl (or ipv6) so main.py does not 
succeed in downloading from xkcd.com using ssl.  Using the other 
proactors it works correctly.

The basic interface for the proactor looks like

     class Proactor:
         def recv(self, sock, n): ...
         def send(self, sock, buf): ...
         def connect(self, sock, address): ...
         def accept(self, sock): ...

         def poll(self, timeout=None): ...
         def pollable(self): ...

recv(), send(), connect() and accept() initiate io operations and return 
futures.  poll() returns a list of ready futures.  pollable() returns 
true if there are any outstanding operations registered with the 
proactor.  You use a pattern like

     f = proactor.recv(sock, 100)
     if not f.done():
         yield from scheduling.block_future(f)
     res = f.result()

--
Richard



From guido at python.org  Sat Nov  3 23:06:26 2012
From: guido at python.org (Guido van Rossum)
Date: Sat, 3 Nov 2012 15:06:26 -0700
Subject: [Python-ideas] The async API of the future
In-Reply-To: <k741qk$hat$1@ger.gmane.org>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<CAP7+vJKXgmTXA7JnHw0=uGst5P=mxv3HhFMxh71GDGOn4ZFQDQ@mail.gmail.com>
	<k741qk$hat$1@ger.gmane.org>
Message-ID: <CAP7+vJLs3tFrr7JYrXx_3br6i8D--rLU8PZH-uPOLaCOKL_WOQ@mail.gmail.com>

This is awesome! I have to make time to understand in more detail how
it works and what needs to change in the platform-independent API -- I
want to get to the point where the *only* thing you change is the
pollster/proactor (both kind of lame terms :-). I am guessing that the
socket operations (or the factory for the transport class) needs to be
made part of the pollster; the Twisted folks are telling me the same
thing.

FWIW, I've been studying other event loops. It's interesting to see
the similarities (and differences) between e.g. the tulip eventloop,
pyftpd's ioloop, Tornado's IOLoop, and 0MQ's IOLoop. The latter two
look very similar, except that 0MQ makes the poller pluggable, but
generally there are lots of similarities between the structure of all
four. Twisted, as usual, stands apart. :-)

--Guido

On Sat, Nov 3, 2012 at 2:20 PM, Richard Oudkerk <shibturn at gmail.com> wrote:
> On 02/11/2012 11:59pm, Guido van Rossum wrote:
>>
>> Working code or it didn't happen. (And it should scale too.)
>
>
> I have some (mostly) working code which replaces tulip's "pollster" classes
> with "proactor" classes for select(), poll(), epoll() and IOCP.  See
>
>
> https://bitbucket.org/sbt/tulip-proactor/changeset/c64ff42bf0f2679437838ee7795adb85
>
> The IOCP proactor does not support ssl (or ipv6) so main.py does not succeed
> in downloading from xkcd.com using ssl.  Using the other proactors it works
> correctly.
>
> The basic interface for the proactor looks like
>
>     class Proactor:
>         def recv(self, sock, n): ...
>         def send(self, sock, buf): ...
>         def connect(self, sock, address): ...
>         def accept(self, sock): ...
>
>         def poll(self, timeout=None): ...
>         def pollable(self): ...
>
> recv(), send(), connect() and accept() initiate io operations and return
> futures.  poll() returns a list of ready futures.  pollable() returns true
> if there are any outstanding operations registered with the proactor.  You
> use a pattern like
>
>     f = proactor.recv(sock, 100)
>     if not f.done():
>         yield from scheduling.block_future(f)
>     res = f.result()
>
> --
> Richard
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas



-- 
--Guido van Rossum (python.org/~guido)


From solipsis at pitrou.net  Sat Nov  3 23:39:26 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sat, 3 Nov 2012 23:39:26 +0100
Subject: [Python-ideas] SSL and IOCP
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<CAP7+vJKXgmTXA7JnHw0=uGst5P=mxv3HhFMxh71GDGOn4ZFQDQ@mail.gmail.com>
	<k741qk$hat$1@ger.gmane.org>
Message-ID: <20121103233926.5a7d5c45@pitrou.net>

On Sat, 03 Nov 2012 21:20:18 +0000
Richard Oudkerk <shibturn at gmail.com>
wrote:
> On 02/11/2012 11:59pm, Guido van Rossum wrote:
> > Working code or it didn't happen. (And it should scale too.)
> 
> I have some (mostly) working code which replaces tulip's "pollster" 
> classes with "proactor" classes for select(), poll(), epoll() and IOCP.  See
> 
>  
> https://bitbucket.org/sbt/tulip-proactor/changeset/c64ff42bf0f2679437838ee7795adb85
> 
> The IOCP proactor does not support ssl (or ipv6) so main.py does not 
> succeed in downloading from xkcd.com using ssl.  Using the other 
> proactors it works correctly.

It wouldn't be crazy to add an in-memory counterpart to SSLSocket in
Python 3.4 (*). It could re-use the same underlying _ssl._SSLSocket, but
initialized with a "memory BIO" in OpenSSL jargon. PyOpenSSL already
has something similar, which is used in Twisted.

(an in-memory SSL object probably only makes sense in non-blocking mode)

(*) patches welcome :-)

Regards

Antoine.




From greg.ewing at canterbury.ac.nz  Sun Nov  4 00:49:41 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sun, 04 Nov 2012 12:49:41 +1300
Subject: [Python-ideas] The async API of the future
In-Reply-To: <6841C4C6-B6B2-44A9-A773-637EE2839CDF@molden.no>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
	<FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
	<k72umn$qc0$1@ger.gmane.org>
	<6841C4C6-B6B2-44A9-A773-637EE2839CDF@molden.no>
Message-ID: <5095AD95.4010509@canterbury.ac.nz>

Sturla Molden wrote:
> But it uses a thread-pool that polls the registered wait objects, so the 
> overhead (with respect to latency) will still be O(n).

I'm not sure exactly what you mean by "polling" here. I'm
pretty sure that *none* of the mechanisms we're talking about
here (select, poll, kqueue, IOCP, WaitForMultipleWhatever, etc)
indulge in busy-waiting while looping over the relevant handles.
They all ultimately make use of hardware interrupts to wake up
a thread when something interesting happens.

The scaling issue, as I understand it, is that select() and
WaitForMultipleObjects() require you to pass in the entire list
of fds or handles on every call, so that there is an O(n) setup
cost every time you wait.

A more scaling-friendly API would let you pre-register the set
of interesting objects, so that the actual waiting call is
O(1). I believe this is the reason things like epoll, kqueue
and IOCP are considered more scalable.

-- 
Greg


From grosser.meister.morti at gmx.net  Sun Nov  4 02:54:10 2012
From: grosser.meister.morti at gmx.net (=?windows-1252?Q?Mathias_Panzenb=F6ck?=)
Date: Sun, 04 Nov 2012 02:54:10 +0100
Subject: [Python-ideas] Support data: URLs in urllib
In-Reply-To: <CACac1F_P4L7b26fu1sh7hz0QMLKRP-vpLAx45MGBOgd9JNOoow@mail.gmail.com>
References: <5090B0FC.1030801@gmx.net>
	<CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
	<50945B9D.8010002@gmx.net>
	<CACac1F_P4L7b26fu1sh7hz0QMLKRP-vpLAx45MGBOgd9JNOoow@mail.gmail.com>
Message-ID: <5095CAC2.6010309@gmx.net>

On 11/03/2012 11:40 AM, Paul Moore wrote:
> On 2 November 2012 23:47, Mathias Panzenb?ck
> <grosser.meister.morti at gmx.net> wrote:
>> Which version do you think is the best for the recipe? I guess losing the
>> mediatype (and thus the charset) is not so good, therefore the version with
>> the DataResponse is better? Maybe with a note that if you don't need the
>> mediatype you can simply return an io.BytesIO as well? How does one submit a
>> doc patch anyway? Is there a hg repo for the documentation and a web
>> interface through which one can submit a pull request?
>
> You should probably be consistent with urllib's behaviour for other
> URLs - from the documentation of urlopen:
>
> """
> This function returns a file-like object that works as a context
> manager, with two additional methods from the urllib.response module
>
> geturl() ? return the URL of the resource retrieved, commonly used to
> determine if a redirect was followed
> info() ? return the meta-information of the page, such as headers, in
> the form of an email.message_from_string() instance (see Quick
> Reference to HTTP Headers)
> Raises URLError on errors.
> """
>

Ok, I added the two methods. Now there are 3 ways to get the headers:
req.headers, req.msg, req.info()

Shouldn't there be *one* obvious way to do this? req.headers?

> To create a doc patch, open a feature request on bugs.python.org and
> attach a patch. The documentation is in the core Python repository,
> from hg.python.org. You can clone that and use Mercurial to generate a
> patch, but there's no "pull request" mechanism that I know of.
>
> Paul
>



From p.f.moore at gmail.com  Sun Nov  4 09:28:40 2012
From: p.f.moore at gmail.com (Paul Moore)
Date: Sun, 4 Nov 2012 08:28:40 +0000
Subject: [Python-ideas] Support data: URLs in urllib
In-Reply-To: <5095CAC2.6010309@gmx.net>
References: <5090B0FC.1030801@gmx.net>
	<CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
	<50945B9D.8010002@gmx.net>
	<CACac1F_P4L7b26fu1sh7hz0QMLKRP-vpLAx45MGBOgd9JNOoow@mail.gmail.com>
	<5095CAC2.6010309@gmx.net>
Message-ID: <CACac1F8AnEsairyxf8YKYxMERan+C04rGRaik_OxAdpEBz6wfg@mail.gmail.com>

On Sunday, 4 November 2012, Mathias Panzenb?ck wrote:
>
>
> Shouldn't there be *one* obvious way to do this? req.headers
>

Well, I'd say that the stdlib docs imply that req.info is the required way
so that's the "one obvious way". If you want to add extra methods for
convenience, fair enough, but code that doesn't already know it is handling
a data URL can't use them so I don't see the point, personally.

But others may have different views...

Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121104/c5052a59/attachment.html>

From paul at colomiets.name  Sun Nov  4 12:58:05 2012
From: paul at colomiets.name (Paul Colomiets)
Date: Sun, 4 Nov 2012 13:58:05 +0200
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAP7+vJLs3tFrr7JYrXx_3br6i8D--rLU8PZH-uPOLaCOKL_WOQ@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<CAP7+vJKXgmTXA7JnHw0=uGst5P=mxv3HhFMxh71GDGOn4ZFQDQ@mail.gmail.com>
	<k741qk$hat$1@ger.gmane.org>
	<CAP7+vJLs3tFrr7JYrXx_3br6i8D--rLU8PZH-uPOLaCOKL_WOQ@mail.gmail.com>
Message-ID: <CAA0gF6pvMJDhW312zhY0x8QsA-dj64E9DrDBHjdBRPVHqE3M+w@mail.gmail.com>

On Sun, Nov 4, 2012 at 12:06 AM, Guido van Rossum <guido at python.org> wrote:
> FWIW, I've been studying other event loops. It's interesting to see
> the similarities (and differences) between e.g. the tulip eventloop,
> pyftpd's ioloop, Tornado's IOLoop, and 0MQ's IOLoop. The latter two
> look very similar, except that 0MQ makes the poller pluggable, but
> generally there are lots of similarities between the structure of all
> four. Twisted, as usual, stands apart. :-)
>

AFAIK, the Twisted is the only framework from the listed ones which
supports IOCP. This is probably the reason of why it's so different.

-- 
Paul


From barry at python.org  Sun Nov  4 15:32:39 2012
From: barry at python.org (Barry Warsaw)
Date: Sun, 4 Nov 2012 09:32:39 -0500
Subject: [Python-ideas] with-statement syntactic quirk
References: <20121031113853.66fb0514@resist>
	<CADiSq7fFnpm8kA6ewJvTD5W5Tdnon7R0crQWm6afW28XBmGz0g@mail.gmail.com>
	<CANSw7KzqACFjifrz0LwEoTiQqnTKv0LtierJxRQbp5V_wOEqmQ@mail.gmail.com>
	<CADiSq7fPHSiO=kDMb=GiacJtO6tO4455tSj_+eOMLQqa3aX5Kg@mail.gmail.com>
Message-ID: <20121104093239.37e777b8@resist.wooz.org>

On Nov 02, 2012, at 08:41 AM, Nick Coghlan wrote:

>> Go's "defer" is quite a neat solution for these hassles if anyone's in
>the mood for a time machine discussion.
>>
>> http://golang.org/doc/effective_go.html#defer
>
>Go was one of the reference points for the ExitStack design (it's a large
>part of why the API also supports providing callbacks directly to the exit
>stack, not just as context managers).

Is it fair to say that the difference between Go's defer and ExitStack is that
the latter gives you the opportunity to clean up earlier than at function
exit?

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121104/0c838ebc/attachment.pgp>

From guido at python.org  Sun Nov  4 16:26:23 2012
From: guido at python.org (Guido van Rossum)
Date: Sun, 4 Nov 2012 07:26:23 -0800
Subject: [Python-ideas] The async API of the future
In-Reply-To: <5095AD95.4010509@canterbury.ac.nz>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
	<FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
	<k72umn$qc0$1@ger.gmane.org>
	<6841C4C6-B6B2-44A9-A773-637EE2839CDF@molden.no>
	<5095AD95.4010509@canterbury.ac.nz>
Message-ID: <CAP7+vJ+-BAFVV677jO_sWoynPd9opOPz9NC+c1CdEeis4AdKKg@mail.gmail.com>

On Sat, Nov 3, 2012 at 4:49 PM, Greg Ewing <greg.ewing at canterbury.ac.nz>
wrote:
> Sturla Molden wrote:
>>
>> But it uses a thread-pool that polls the registered wait objects, so the
>> overhead (with respect to latency) will still be O(n).
>
>
> I'm not sure exactly what you mean by "polling" here. I'm
> pretty sure that *none* of the mechanisms we're talking about
> here (select, poll, kqueue, IOCP, WaitForMultipleWhatever, etc)
> indulge in busy-waiting while looping over the relevant handles.
> They all ultimately make use of hardware interrupts to wake up
> a thread when something interesting happens.
>
> The scaling issue, as I understand it, is that select() and
> WaitForMultipleObjects() require you to pass in the entire list
> of fds or handles on every call, so that there is an O(n) setup
> cost every time you wait.
>
> A more scaling-friendly API would let you pre-register the set
> of interesting objects, so that the actual waiting call is
> O(1). I believe this is the reason things like epoll, kqueue
> and IOCP are considered more scalable.

I've been thinking about this too. I can see the scalability issues with
select(),  but frankly, poll(), epoll(), and even kqueue() all look similar
in O() behavior to me from an API perspective. I guess the differences are
in the kernel -- but is it a constant factor or an unfortunate O(N) or
worse? To what extent would this be overwhelmed by overhead in the Python
code we're writing around it? How bad is it to add extra
register()/unregister() (or (modify()) calls per read operation?


--
--Guido van Rossum (python.org/~guido)


-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121104/c7659820/attachment.html>

From ncoghlan at gmail.com  Sun Nov  4 16:41:35 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Mon, 5 Nov 2012 01:41:35 +1000
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <20121104093239.37e777b8@resist.wooz.org>
References: <20121031113853.66fb0514@resist>
	<CADiSq7fFnpm8kA6ewJvTD5W5Tdnon7R0crQWm6afW28XBmGz0g@mail.gmail.com>
	<CANSw7KzqACFjifrz0LwEoTiQqnTKv0LtierJxRQbp5V_wOEqmQ@mail.gmail.com>
	<CADiSq7fPHSiO=kDMb=GiacJtO6tO4455tSj_+eOMLQqa3aX5Kg@mail.gmail.com>
	<20121104093239.37e777b8@resist.wooz.org>
Message-ID: <CADiSq7dZo65HFwMgSiYUfUYpXP4PRqyfjTbmrT5tRXAz1PKF6Q@mail.gmail.com>

On Mon, Nov 5, 2012 at 12:32 AM, Barry Warsaw <barry at python.org> wrote:
> On Nov 02, 2012, at 08:41 AM, Nick Coghlan wrote:
>
>>> Go's "defer" is quite a neat solution for these hassles if anyone's in
>>the mood for a time machine discussion.
>>>
>>> http://golang.org/doc/effective_go.html#defer
>>
>>Go was one of the reference points for the ExitStack design (it's a large
>>part of why the API also supports providing callbacks directly to the exit
>>stack, not just as context managers).
>
> Is it fair to say that the difference between Go's defer and ExitStack is that
> the latter gives you the opportunity to clean up earlier than at function
> exit?

Yep. You can also do some pretty interesting things with ExitStack
because of the pop_all() operation (which moves all of the registered
operations to a *new* ExitStack instance).

I wrote up a few of the motivating use cases as examples and recipes
in the 3.3 docs:
http://docs.python.org/3/library/contextlib#examples-and-recipes

I hope to see more interesting uses over time as more people explore
the possibilities of a dynamic tool for composing context managers
without needing to worry about the messy details of unwinding them
correctly (ExitStack.__exit__ is by far the most complicated aspect of
the implementation).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


From ben at bendarnell.com  Sun Nov  4 17:00:33 2012
From: ben at bendarnell.com (Ben Darnell)
Date: Sun, 4 Nov 2012 08:00:33 -0800
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAP7+vJLs3tFrr7JYrXx_3br6i8D--rLU8PZH-uPOLaCOKL_WOQ@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<CAP7+vJKXgmTXA7JnHw0=uGst5P=mxv3HhFMxh71GDGOn4ZFQDQ@mail.gmail.com>
	<k741qk$hat$1@ger.gmane.org>
	<CAP7+vJLs3tFrr7JYrXx_3br6i8D--rLU8PZH-uPOLaCOKL_WOQ@mail.gmail.com>
Message-ID: <CAFkYKJ7rfx9h3_d8-6KTcaMrs5OBU9RjPZ05y2CRC3WLpj8gEg@mail.gmail.com>

On Sat, Nov 3, 2012 at 3:06 PM, Guido van Rossum <guido at python.org> wrote:

> FWIW, I've been studying other event loops. It's interesting to see
> the similarities (and differences) between e.g. the tulip eventloop,
> pyftpd's ioloop, Tornado's IOLoop, and 0MQ's IOLoop. The latter two
> look very similar, except that 0MQ makes the poller pluggable, but
> generally there are lots of similarities between the structure of all
> four. Twisted, as usual, stands apart. :-)
>

Pyzmq's IOLoop is actually a fork/monkey-patch of Tornado's, and they have
the same pluggable-poller implementation (In the master branch of Tornado
it's been moved to the PollIOLoop subclass).

-Ben


>
> --Guido
>
> On Sat, Nov 3, 2012 at 2:20 PM, Richard Oudkerk <shibturn at gmail.com>
> wrote:
> > On 02/11/2012 11:59pm, Guido van Rossum wrote:
> >>
> >> Working code or it didn't happen. (And it should scale too.)
> >
> >
> > I have some (mostly) working code which replaces tulip's "pollster"
> classes
> > with "proactor" classes for select(), poll(), epoll() and IOCP.  See
> >
> >
> >
> https://bitbucket.org/sbt/tulip-proactor/changeset/c64ff42bf0f2679437838ee7795adb85
> >
> > The IOCP proactor does not support ssl (or ipv6) so main.py does not
> succeed
> > in downloading from xkcd.com using ssl.  Using the other proactors it
> works
> > correctly.
> >
> > The basic interface for the proactor looks like
> >
> >     class Proactor:
> >         def recv(self, sock, n): ...
> >         def send(self, sock, buf): ...
> >         def connect(self, sock, address): ...
> >         def accept(self, sock): ...
> >
> >         def poll(self, timeout=None): ...
> >         def pollable(self): ...
> >
> > recv(), send(), connect() and accept() initiate io operations and return
> > futures.  poll() returns a list of ready futures.  pollable() returns
> true
> > if there are any outstanding operations registered with the proactor.
>  You
> > use a pattern like
> >
> >     f = proactor.recv(sock, 100)
> >     if not f.done():
> >         yield from scheduling.block_future(f)
> >     res = f.result()
> >
> > --
> > Richard
> >
> >
> > _______________________________________________
> > Python-ideas mailing list
> > Python-ideas at python.org
> > http://mail.python.org/mailman/listinfo/python-ideas
>
>
>
> --
> --Guido van Rossum (python.org/~guido)
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121104/660c101b/attachment.html>

From guido at python.org  Sun Nov  4 17:10:42 2012
From: guido at python.org (Guido van Rossum)
Date: Sun, 4 Nov 2012 08:10:42 -0800
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAFkYKJ7rfx9h3_d8-6KTcaMrs5OBU9RjPZ05y2CRC3WLpj8gEg@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<CAP7+vJKXgmTXA7JnHw0=uGst5P=mxv3HhFMxh71GDGOn4ZFQDQ@mail.gmail.com>
	<k741qk$hat$1@ger.gmane.org>
	<CAP7+vJLs3tFrr7JYrXx_3br6i8D--rLU8PZH-uPOLaCOKL_WOQ@mail.gmail.com>
	<CAFkYKJ7rfx9h3_d8-6KTcaMrs5OBU9RjPZ05y2CRC3WLpj8gEg@mail.gmail.com>
Message-ID: <CAP7+vJ+QFOiM_EHY6k-J_EvYAb2-on+OXHwRsFdXB=3SXX774g@mail.gmail.com>

On Sun, Nov 4, 2012 at 8:00 AM, Ben Darnell <ben at bendarnell.com> wrote:
> On Sat, Nov 3, 2012 at 3:06 PM, Guido van Rossum <guido at python.org> wrote:
>> FWIW, I've been studying other event loops. It's interesting to see
>> the similarities (and differences) between e.g. the tulip eventloop,
>> pyftpd's ioloop, Tornado's IOLoop, and 0MQ's IOLoop. The latter two
>> look very similar, except that 0MQ makes the poller pluggable, but
>> generally there are lots of similarities between the structure of all
>> four. Twisted, as usual, stands apart. :-)

> Pyzmq's IOLoop is actually a fork/monkey-patch of Tornado's, and they have
> the same pluggable-poller implementation (In the master branch of Tornado
> it's been moved to the PollIOLoop subclass).

I was beginning to suspect as much. :-)

Have you had the time to look at tulip's eventloop? I'd love your feedback:
http://code.google.com/p/tulip/source/browse/polling.py

Also, Richard has a modified version that supports IOCP, which changes
the APIs around quite a bit. (Does Tornado try anything with IOCP?
Does it even support Windows?) Any thoughts on this vs. my version?
https://bitbucket.org/sbt/tulip-proactor/changeset/c64ff42bf0f2679437838ee7795adb85

-- 
--Guido van Rossum (python.org/~guido)


From ben at bendarnell.com  Sun Nov  4 17:11:24 2012
From: ben at bendarnell.com (Ben Darnell)
Date: Sun, 4 Nov 2012 08:11:24 -0800
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAP7+vJ+-BAFVV677jO_sWoynPd9opOPz9NC+c1CdEeis4AdKKg@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
	<FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
	<k72umn$qc0$1@ger.gmane.org>
	<6841C4C6-B6B2-44A9-A773-637EE2839CDF@molden.no>
	<5095AD95.4010509@canterbury.ac.nz>
	<CAP7+vJ+-BAFVV677jO_sWoynPd9opOPz9NC+c1CdEeis4AdKKg@mail.gmail.com>
Message-ID: <CAFkYKJ5bm5a4_p+5SYiN-GObwSu9Tn3zai6hnpP6YBehwcsgbQ@mail.gmail.com>

On Sun, Nov 4, 2012 at 7:26 AM, Guido van Rossum <guido at python.org> wrote:

> I've been thinking about this too. I can see the scalability issues with
> select(),  but frankly, poll(), epoll(), and even kqueue() all look similar
> in O() behavior to me from an API perspective. I guess the differences are
> in the kernel -- but is it a constant factor or an unfortunate O(N) or
> worse? To what extent would this be overwhelmed by overhead in the Python
> code we're writing around it? How bad is it to add extra
> register()/unregister() (or (modify()) calls per read operation?
>


The extra system calls add up.  The interface of Tornado's IOLoop was based
on epoll (where the internal state is roughly a mapping {fd: event_set}),
so it requires more register/unregister operations when running on kqueue
(where the internal state is roughly a set of (fd, event) pairs).  This
shows up in benchmarks of the HTTPServer; it's faster on platforms with
epoll than platforms with kqueue.  In low-concurrency scenarios it's
actually faster to use select() even when kqueue is available (or maybe
that's a mac-specific quirk).

-Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121104/3c206e0a/attachment.html>

From guido at python.org  Sun Nov  4 17:19:08 2012
From: guido at python.org (Guido van Rossum)
Date: Sun, 4 Nov 2012 08:19:08 -0800
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAFkYKJ5bm5a4_p+5SYiN-GObwSu9Tn3zai6hnpP6YBehwcsgbQ@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
	<FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
	<k72umn$qc0$1@ger.gmane.org>
	<6841C4C6-B6B2-44A9-A773-637EE2839CDF@molden.no>
	<5095AD95.4010509@canterbury.ac.nz>
	<CAP7+vJ+-BAFVV677jO_sWoynPd9opOPz9NC+c1CdEeis4AdKKg@mail.gmail.com>
	<CAFkYKJ5bm5a4_p+5SYiN-GObwSu9Tn3zai6hnpP6YBehwcsgbQ@mail.gmail.com>
Message-ID: <CAP7+vJKmXVWnNtOdg_YbCTC-_4vwGyeBLdaSCt1eyTQjFOXuTg@mail.gmail.com>

On Sun, Nov 4, 2012 at 8:11 AM, Ben Darnell <ben at bendarnell.com> wrote:
> The extra system calls add up.  The interface of Tornado's IOLoop was based
> on epoll (where the internal state is roughly a mapping {fd: event_set}), so
> it requires more register/unregister operations when running on kqueue
> (where the internal state is roughly a set of (fd, event) pairs).  This
> shows up in benchmarks of the HTTPServer; it's faster on platforms with
> epoll than platforms with kqueue.  In low-concurrency scenarios it's
> actually faster to use select() even when kqueue is available (or maybe
> that's a mac-specific quirk).

Awesome info!

-- 
--Guido van Rossum (python.org/~guido)


From shibturn at gmail.com  Sun Nov  4 18:32:52 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Sun, 04 Nov 2012 17:32:52 +0000
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAP7+vJ+-BAFVV677jO_sWoynPd9opOPz9NC+c1CdEeis4AdKKg@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
	<FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
	<k72umn$qc0$1@ger.gmane.org>
	<6841C4C6-B6B2-44A9-A773-637EE2839CDF@molden.no>
	<5095AD95.4010509@canterbury.ac.nz>
	<CAP7+vJ+-BAFVV677jO_sWoynPd9opOPz9NC+c1CdEeis4AdKKg@mail.gmail.com>
Message-ID: <k768s3$ms1$1@ger.gmane.org>

On 04/11/12 15:26, Guido van Rossum wrote:
> I've been thinking about this too. I can see the scalability issues with
> select(),  but frankly, poll(), epoll(), and even kqueue() all look
> similar in O() behavior to me from an API perspective. I guess the
> differences are in the kernel -- but is it a constant factor or an
> unfortunate O(N) or worse? To what extent would this be overwhelmed by
> overhead in the Python code we're writing around it? How bad is it to
> add extra register()/unregister() (or (modify()) calls per read operation?

At the C level poll() and epoll() have quite different APIs.  Each time 
you use poll() you have to pass an array which describes the events you 
are interested in.  That is not necessary with epoll().

The python API hides the difference.

--
Richard




From techtonik at gmail.com  Sun Nov  4 22:49:24 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Mon, 5 Nov 2012 00:49:24 +0300
Subject: [Python-ideas] sys.py3k
Message-ID: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>

if sys.py3k:
  # some py2k specific code
  pass


Why?
 1. readable
 2. traceable

Explained:
 1. self-explanatory
 2. sys.version_info >= (3, 0) or  sys.version[0] == '3' is harder to
trace when you need to find all python 3 related hacks

--
anatoly t.


From lists at studiosola.com  Sun Nov  4 22:50:20 2012
From: lists at studiosola.com (Kevin LaTona)
Date: Sun, 4 Nov 2012 13:50:20 -0800
Subject: [Python-ideas] The async API of the future
In-Reply-To: <k768s3$ms1$1@ger.gmane.org>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
	<FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
	<k72umn$qc0$1@ger.gmane.org>
	<6841C4C6-B6B2-44A9-A773-637EE2839CDF@molden.no>
	<5095AD95.4010509@canterbury.ac.nz>
	<CAP7+vJ+-BAFVV677jO_sWoynPd9opOPz9NC+c1CdEeis4AdKKg@mail.gmail.com>
	<k768s3$ms1$1@ger.gmane.org>
Message-ID: <AF415991-9BA0-49FA-95E0-F5EC798CA5DC@studiosola.com>




I came upon a set of blog posts today that some folks who are tracking  
this async discussion might find interesting for further research and  
ideas.


http://blog.incubaid.com/2012/04/02/tracking-asynchronous-io-using-type-systems/


-Kevin



From phd at phdru.name  Sun Nov  4 22:57:20 2012
From: phd at phdru.name (Oleg Broytman)
Date: Mon, 5 Nov 2012 01:57:20 +0400
Subject: [Python-ideas] sys.py3k
In-Reply-To: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
References: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
Message-ID: <20121104215720.GA26449@iskra.aviel.ru>

On Mon, Nov 05, 2012 at 12:49:24AM +0300, anatoly techtonik <techtonik at gmail.com> wrote:
> if sys.py3k:

1. import sys
2. sys.py3k = sys.version_info >= (3, 0)
3. ???
4. PROFIT!

Oleg.
-- 
     Oleg Broytman            http://phdru.name/            phd at phdru.name
           Programmers don't die, they just GOSUB without RETURN.


From pyideas at rebertia.com  Sun Nov  4 23:02:28 2012
From: pyideas at rebertia.com (Chris Rebert)
Date: Sun, 4 Nov 2012 14:02:28 -0800
Subject: [Python-ideas] sys.py3k
In-Reply-To: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
References: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
Message-ID: <CAMZYqRSBq4t05bA+-bQiQbmMDzCHjC7r8-Np5O2=g+04fpW3Lw@mail.gmail.com>

On Sun, Nov 4, 2012 at 1:49 PM, anatoly techtonik <techtonik at gmail.com> wrote:
> if sys.py3k:
>   # some py2k specific code
(I assume your comment has a numeric typo?)
>   pass

You would need that attribute to also be present in Python 2.x for
your snippet to work, and my understanding is that 2.x is closed to
feature additions.

> Why?
>  1. readable
>  2. traceable
>
> Explained:
>  1. self-explanatory

"py3k" is a nickname/codename that not everyone using Python 3 may know about.

>  2. sys.version_info >= (3, 0) or  sys.version[0] == '3' is harder to
> trace when you need to find all python 3 related hacks


Rebutted:
"There should be one-- and preferably only one --obvious way to do
it." Apparently we already have at least 2 ways to do it; let's not
muddle things further by adding yet another.

Cheers,
Chris


From steve at pearwood.info  Sun Nov  4 23:33:42 2012
From: steve at pearwood.info (Steven D'Aprano)
Date: Mon, 05 Nov 2012 09:33:42 +1100
Subject: [Python-ideas] sys.py3k
In-Reply-To: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
References: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
Message-ID: <5096ED46.20502@pearwood.info>

On 05/11/12 08:49, anatoly techtonik wrote:
> if sys.py3k:
>    # some py2k specific code
>    pass

Do you expect every single Python 3.x version will have exactly the same
feature set? That's not true now, and it won't be true in the future.

In my option, a better approach is more verbose and a little more work, but
safer and more reliable: check for the actual feature you care about, not
some version number. E.g. I do things like this:


# Bring back reload in Python 3.
try:
     reload
except NameError:
     from imp import reload


Now your code is future-proofed: if Python 3.5 moves reload back into
the builtins, your code won't needlessly replace it. Or if you're running
under some environment that monkey-patches the builtins (I don't know,
IDLE or IPython or something?) you will use their patched reload instead
of the one in the imp module.


Or I go the other way:


try:
     any
except NameError:
     # Python 2.4 compatibility.
     def any(items):
         for item in items:
             if item:
                 return True
         return False


Now if I'm running under a version of 2.4 that has backported the
"any" function, I will prefer the backported version to my own.



-- 
Steven


From oscar.j.benjamin at gmail.com  Mon Nov  5 03:08:33 2012
From: oscar.j.benjamin at gmail.com (Oscar Benjamin)
Date: Mon, 5 Nov 2012 02:08:33 +0000
Subject: [Python-ideas] sys.py3k
In-Reply-To: <5096ED46.20502@pearwood.info>
References: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
	<5096ED46.20502@pearwood.info>
Message-ID: <CAHVvXxSJ1wP8+VK1gZP5cyCL9Xr9iu209DeQWgGAOo4AmZ6UZQ@mail.gmail.com>

On 4 November 2012 22:33, Steven D'Aprano <steve at pearwood.info> wrote:
> On 05/11/12 08:49, anatoly techtonik wrote:
>>
>> if sys.py3k:
>>    # some py2k specific code
>>    pass
>
>
> Do you expect every single Python 3.x version will have exactly the same
> feature set? That's not true now, and it won't be true in the future.
>
> In my option, a better approach is more verbose and a little more work, but
> safer and more reliable: check for the actual feature you care about, not
> some version number. E.g. I do things like this:
>
>
> # Bring back reload in Python 3.
> try:
>     reload
> except NameError:
>     from imp import reload

There are certain cases where explicitly checking the version makes
sense. I think that Python 3 vs Python 2 is sometimes such a case.
Python 3 changes the meaning of a number of elementary aspects of
Python so that the same code can run without error but with different
semantics under the two different version series.

Checking the version rather than checking the attribute/name would
often be a mistake when comparing say Python 2.6 and Python 2.7 since
you're better off just sticking to 2.6 syntax and checking for
potentially useful names available under 2.7 as you describe.

On the other hand if you are distinguishing between 2.x and 3.x then
it is sometimes clearer and more robust to explicitly make a version
check rather than think hard about how to write code that works in
both cases (and hope that you remember your reasoning later). It also
makes it easier for you to clean up your codebase when you eventually
drop support for 2.x.


Oscar


From steve at pearwood.info  Mon Nov  5 07:30:09 2012
From: steve at pearwood.info (Steven D'Aprano)
Date: Mon, 5 Nov 2012 17:30:09 +1100
Subject: [Python-ideas] sys.py3k
In-Reply-To: <CAHVvXxSJ1wP8+VK1gZP5cyCL9Xr9iu209DeQWgGAOo4AmZ6UZQ@mail.gmail.com>
References: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
	<5096ED46.20502@pearwood.info>
	<CAHVvXxSJ1wP8+VK1gZP5cyCL9Xr9iu209DeQWgGAOo4AmZ6UZQ@mail.gmail.com>
Message-ID: <20121105063008.GA14836@ando>

On Mon, Nov 05, 2012 at 02:08:33AM +0000, Oscar Benjamin wrote:

> There are certain cases where explicitly checking the version makes
> sense. I think that Python 3 vs Python 2 is sometimes such a case.
> Python 3 changes the meaning of a number of elementary aspects of
> Python so that the same code can run without error but with different
> semantics under the two different version series.

You can test for that without an explicit version check.

if isinstance(map(lambda x: x, []), list):
    # Python 2 semantics
    ...
else:
    # Python 3 semantics
    ...

This now guards you against (e.g.) somebody backporting Python 3 
semantics to "Python 2.8" (it's opensource, somebody could fork 
CPython), or running your code under "FooPython" which has 3.x semantics 
and a 1.x version number.

This is more work than just mechanically looking at the version number, 
but it's not that much more work, and is more reliable since it 
explicitly checks for the feature you want, rather than an implicit 
check based on the version number.

In any case, arguments about defensive coding style are getting 
off-topic. The point is that there are various ways to test for the 
existence of features, and adding yet another coarse-grained test 
"sys.py3k" doesn't gain us much (if anything).


-- 
Steven


From techtonik at gmail.com  Mon Nov  5 07:48:49 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Mon, 5 Nov 2012 09:48:49 +0300
Subject: [Python-ideas] os.path.split(path, maxsplit=1)
Message-ID: <CAPkN8xLCog9ZT=_QaprLbo82x-QFEuayyWTqSyBVxZewN68kOA@mail.gmail.com>

Why?
Because it is critical when comparing paths and non-trivial.

http://stackoverflow.com/questions/4579908/cross-platform-splitting-of-path-in-python
http://stackoverflow.com/questions/3167154/how-to-split-a-dos-path-into-its-components-in-python
http://www.gossamer-threads.com/lists/python/dev/654410


--
anatoly t.


From g.brandl at gmx.net  Mon Nov  5 07:55:24 2012
From: g.brandl at gmx.net (Georg Brandl)
Date: Mon, 05 Nov 2012 07:55:24 +0100
Subject: [Python-ideas] sys.py3k
In-Reply-To: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
References: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
Message-ID: <k77nsr$4i3$1@ger.gmane.org>

Am 04.11.2012 22:49, schrieb anatoly techtonik:
> if sys.py3k:
>   # some py2k specific code
>   pass
> 
> 
> Why?
>  1. readable
>  2. traceable
> 
> Explained:
>  1. self-explanatory
>  2. sys.version_info >= (3, 0) or  sys.version[0] == '3' is harder to
> trace when you need to find all python 3 related hacks

This proposal is roughly 3 minor versions late.  I can offer you
a sys.py3_4 attribute though... ;)

Georg



From ncoghlan at gmail.com  Mon Nov  5 09:04:41 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Mon, 5 Nov 2012 18:04:41 +1000
Subject: [Python-ideas] sys.py3k
In-Reply-To: <k77nsr$4i3$1@ger.gmane.org>
References: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
	<k77nsr$4i3$1@ger.gmane.org>
Message-ID: <CADiSq7d4WEttOrOm_48D_fYgdopTaqBszxnxBCUx1phig-SGiQ@mail.gmail.com>

On Mon, Nov 5, 2012 at 4:55 PM, Georg Brandl <g.brandl at gmx.net> wrote:
> Am 04.11.2012 22:49, schrieb anatoly techtonik:
>> if sys.py3k:
>>   # some py2k specific code
>>   pass
>>
>>
>> Why?
>>  1. readable
>>  2. traceable
>>
>> Explained:
>>  1. self-explanatory
>>  2. sys.version_info >= (3, 0) or  sys.version[0] == '3' is harder to
>> trace when you need to find all python 3 related hacks
>
> This proposal is roughly 3 minor versions late.  I can offer you
> a sys.py3_4 attribute though... ;)

Even better (http://packages.python.org/six/#package-contents):

    import six

    if six.PY3:
        # Ooh, Python 3
    else:
        # Not Python 3

If anyone is trying to do single code base Python 2/3 support without
relying on six, they're doing it wrong. Even bundling a copy (if you
don't want to deal with dependency management issues) is a better idea
than reinventing that wheel.

If you *are* rolling your own (or need additional compatibility fixes
that six doesn't provide), then all Python 2/3 compatibility hacks
should be located in a small number of compatibility modules. They
*shouldn't* be distributed widely throughout your codebase.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


From ubershmekel at gmail.com  Mon Nov  5 09:10:41 2012
From: ubershmekel at gmail.com (Yuval Greenfield)
Date: Mon, 5 Nov 2012 10:10:41 +0200
Subject: [Python-ideas] with-statement syntactic quirk
In-Reply-To: <CADiSq7dZo65HFwMgSiYUfUYpXP4PRqyfjTbmrT5tRXAz1PKF6Q@mail.gmail.com>
References: <20121031113853.66fb0514@resist>
	<CADiSq7fFnpm8kA6ewJvTD5W5Tdnon7R0crQWm6afW28XBmGz0g@mail.gmail.com>
	<CANSw7KzqACFjifrz0LwEoTiQqnTKv0LtierJxRQbp5V_wOEqmQ@mail.gmail.com>
	<CADiSq7fPHSiO=kDMb=GiacJtO6tO4455tSj_+eOMLQqa3aX5Kg@mail.gmail.com>
	<20121104093239.37e777b8@resist.wooz.org>
	<CADiSq7dZo65HFwMgSiYUfUYpXP4PRqyfjTbmrT5tRXAz1PKF6Q@mail.gmail.com>
Message-ID: <CANSw7Kwat0_jaYU6bnh5Mp1hX=ZjDgAGc3wFH5Ut-gj3KQMKuQ@mail.gmail.com>

On Sun, Nov 4, 2012 at 5:41 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> Yep. You can also do some pretty interesting things with ExitStack
> because of the pop_all() operation (which moves all of the registered
> operations to a *new* ExitStack instance).
>
> I wrote up a few of the motivating use cases as examples and recipes
> in the 3.3 docs:
> http://docs.python.org/3/library/contextlib#examples-and-recipes
>
> I hope to see more interesting uses over time as more people explore
> the possibilities of a dynamic tool for composing context managers
> without needing to worry about the messy details of unwinding them
> correctly (ExitStack.__exit__ is by far the most complicated aspect of
> the implementation).
>
> Cheers,
> Nick.
>
>
Pretty interesting things indeed. It does look like the concept is very
powerful. If I were to design a language I might have just given every
function an optional ExitStack. I.e. an explicit, dynamic, introspectable
version of defer.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121105/43a2c713/attachment.html>

From techtonik at gmail.com  Mon Nov  5 12:08:05 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Mon, 5 Nov 2012 14:08:05 +0300
Subject: [Python-ideas] os.path.split(path, maxsplit=1)
In-Reply-To: <CAPkN8xLCog9ZT=_QaprLbo82x-QFEuayyWTqSyBVxZewN68kOA@mail.gmail.com>
References: <CAPkN8xLCog9ZT=_QaprLbo82x-QFEuayyWTqSyBVxZewN68kOA@mail.gmail.com>
Message-ID: <CAPkN8xKT07Mm295cQ0k8+sc88hVDtpP3Kg1Mjb8f3KUzyo4GnQ@mail.gmail.com>

Implementation.

pathsplit('asd/asd\\asd\sad')  == '['asd', 'asd', 'asd', 'sad']

def pathsplit(pathstr):
    """split relative path into list"""
    path = list(os.path.split(pathstr))
    while '' not in path[:2]:
        path[:1] = list(os.path.split(path[0]))
    if path[0] == '':
        return path[1:]
    return path[:1] + path[2:]
--
anatoly t.


On Mon, Nov 5, 2012 at 9:48 AM, anatoly techtonik <techtonik at gmail.com> wrote:
> Why?
> Because it is critical when comparing paths and non-trivial.
>
> http://stackoverflow.com/questions/4579908/cross-platform-splitting-of-path-in-python
> http://stackoverflow.com/questions/3167154/how-to-split-a-dos-path-into-its-components-in-python
> http://www.gossamer-threads.com/lists/python/dev/654410
>
>
> --
> anatoly t.


From techtonik at gmail.com  Mon Nov  5 13:41:13 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Mon, 5 Nov 2012 15:41:13 +0300
Subject: [Python-ideas] os.path.split(path, maxsplit=1)
In-Reply-To: <CAPkN8xKT07Mm295cQ0k8+sc88hVDtpP3Kg1Mjb8f3KUzyo4GnQ@mail.gmail.com>
References: <CAPkN8xLCog9ZT=_QaprLbo82x-QFEuayyWTqSyBVxZewN68kOA@mail.gmail.com>
	<CAPkN8xKT07Mm295cQ0k8+sc88hVDtpP3Kg1Mjb8f3KUzyo4GnQ@mail.gmail.com>
Message-ID: <CAPkN8xK=YQQjrx1ZegHvnOfKj+FtYbWXQv80Sai=CLFjKhT_WA@mail.gmail.com>

It appears that implementation fails on 'foo/bar/'
--
anatoly t.


On Mon, Nov 5, 2012 at 2:08 PM, anatoly techtonik <techtonik at gmail.com> wrote:
> Implementation.
>
> pathsplit('asd/asd\\asd\sad')  == '['asd', 'asd', 'asd', 'sad']
>
> def pathsplit(pathstr):
>     """split relative path into list"""
>     path = list(os.path.split(pathstr))
>     while '' not in path[:2]:
>         path[:1] = list(os.path.split(path[0]))
>     if path[0] == '':
>         return path[1:]
>     return path[:1] + path[2:]
> --
> anatoly t.
>
>
> On Mon, Nov 5, 2012 at 9:48 AM, anatoly techtonik <techtonik at gmail.com> wrote:
>> Why?
>> Because it is critical when comparing paths and non-trivial.
>>
>> http://stackoverflow.com/questions/4579908/cross-platform-splitting-of-path-in-python
>> http://stackoverflow.com/questions/3167154/how-to-split-a-dos-path-into-its-components-in-python
>> http://www.gossamer-threads.com/lists/python/dev/654410
>>
>>
>> --
>> anatoly t.


From ned at nedbatchelder.com  Mon Nov  5 13:52:44 2012
From: ned at nedbatchelder.com (Ned Batchelder)
Date: Mon, 05 Nov 2012 07:52:44 -0500
Subject: [Python-ideas] os.path.split(path, maxsplit=1)
In-Reply-To: <CAPkN8xK=YQQjrx1ZegHvnOfKj+FtYbWXQv80Sai=CLFjKhT_WA@mail.gmail.com>
References: <CAPkN8xLCog9ZT=_QaprLbo82x-QFEuayyWTqSyBVxZewN68kOA@mail.gmail.com>
	<CAPkN8xKT07Mm295cQ0k8+sc88hVDtpP3Kg1Mjb8f3KUzyo4GnQ@mail.gmail.com>
	<CAPkN8xK=YQQjrx1ZegHvnOfKj+FtYbWXQv80Sai=CLFjKhT_WA@mail.gmail.com>
Message-ID: <5097B69C.5080202@nedbatchelder.com>

Anatoly, I appreciate the energy and dedication you've shown to the 
Python community, but maybe you should spend a little more time on each 
proposal?  For example, the subject line here is a different (and 
already taken) function name than the implementation, and has a maxsplit 
argument that the implementation doesn't have.

Get everything the way you want it, and then propose it.

--Ned.

On 11/5/2012 7:41 AM, anatoly techtonik wrote:
> It appears that implementation fails on 'foo/bar/'
> --
> anatoly t.
>
>
> On Mon, Nov 5, 2012 at 2:08 PM, anatoly techtonik <techtonik at gmail.com> wrote:
>> Implementation.
>>
>> pathsplit('asd/asd\\asd\sad')  == '['asd', 'asd', 'asd', 'sad']
>>
>> def pathsplit(pathstr):
>>      """split relative path into list"""
>>      path = list(os.path.split(pathstr))
>>      while '' not in path[:2]:
>>          path[:1] = list(os.path.split(path[0]))
>>      if path[0] == '':
>>          return path[1:]
>>      return path[:1] + path[2:]
>> --
>> anatoly t.
>>
>>
>> On Mon, Nov 5, 2012 at 9:48 AM, anatoly techtonik <techtonik at gmail.com> wrote:
>>> Why?
>>> Because it is critical when comparing paths and non-trivial.
>>>
>>> http://stackoverflow.com/questions/4579908/cross-platform-splitting-of-path-in-python
>>> http://stackoverflow.com/questions/3167154/how-to-split-a-dos-path-into-its-components-in-python
>>> http://www.gossamer-threads.com/lists/python/dev/654410
>>>
>>>
>>> --
>>> anatoly t.
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>



From sturla at molden.no  Mon Nov  5 15:19:31 2012
From: sturla at molden.no (Sturla Molden)
Date: Mon, 05 Nov 2012 15:19:31 +0100
Subject: [Python-ideas] The async API of the future
In-Reply-To: <20121103182255.70ea9c5a@pitrou.net>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<E2D5E1C4-EA8A-48CE-A2B9-5ED48C527049@molden.no>
	<20121103003036.74621d59@pitrou.net>
	<15F02388-6EC1-4413-A42A-800F92804144@molden.no>
	<20121103005400.6fb1735f@pitrou.net>
	<3BCCACF3-B24E-4CF6-AA16-8837224BCA2D@molden.no>
	<20121103111418.1dae7525@pitrou.net>
	<8136D88B-5345-4260-BD03-1D286C799938@molden.no>
	<20121103182255.70ea9c5a@pitrou.net>
Message-ID: <5097CAF3.3030702@molden.no>

On 03.11.2012 18:22, Antoine Pitrou wrote:

 >> With IOCP on Windows there is a thread-pool that continuously polls 
the i/o tasks
 >> for completion. So I think IOCPs might approach O(n) at some point.
 >
 > Well, I don't know about the IOCP implementation, but "continuously
 > polling the I/O tasks" sounds like a costly way to do it (what system
 > call would that use?).

The polling uses the system call GetOverlappedResult, and if the task is 
unfinished, call Sleep(0) to release the time-slice and poll again.

Specifically, if the last argument to GetOverlappedResult is FALSE, and 
the return value is FALSE, we must call GetLastError to retrieve an 
error code. If GetLastError returns ERROR_IO_INCOMPLETE, we know that 
the task was not finished.

A bit more sophisticated: Put all these asynchronous i/o tasks in a fifo 
queue, and set up a thread-pool that pops tasks off the queue and polls 
with GetOverlappedResult and GetLastError. A task that is unfinished 
goes back into the queue. If a task is complete, the thread that popped 
it off the queue executes a callback. A thread-pool that operates like 
this will reduce/prevent the excessive number of context shifts in the 
kernel as multiple threads hammering on Sleep(0) would incur. Then 
invent a fancy name for this scheme, e.g. call it "I/O Completion Ports".

Then you notice that due to the queue, the latency is proportional to 
O(n) with n the number of pending i/o tasks in the "I/O Completion 
Port". To avoid this affecting the latency, you patch your program by 
setting up multiple "I/O Completion Ports", and reinvent the load 
balancer to distribute i/o tasks to multiple "ports". With a bit of 
work, the server will remain responsive and "rather scalable" as long as 
the server is still i/o bound. At the moment the number of i/o tasks 
makes the server go CPU bound, which will happen rather soon because of 
they way IOCPs operate, the computer overheats and goes up in smoke. And 
that is when the MBA manager starts to curse Windows as well, and 
finally agrees to use Linux or *BSD/Apple instead ;-)


 > If the kernel cooperates, no continuous polling
 > should be required.

Indeed.

However:

My main problem with IOCP is that they provide the "wrong" signal. They 
tell us when I/O is completed. But then the work is already done, and 
how did we know when to start?

The asynch i/o in select, poll, epoll, kqueue, /dev/poll, etc. do the 
opposite. They inform us when to start an i/o task, which makes more 
sense to me at least.

Typically, programs that use IOCP must invent their own means of 
signalling "i/o ready to start", which might kill any advantage of using 
IOCPs over simpler means (e.g. blocking i/o).

This by the way makes me wonder what Windows SUA does? It is OpenBSD 
based. Does it have kqueue or /dev/poll? If so, there must be support 
for it in ntdll.dll, and we might use those functions instead of pesky 
IOCPs.



Sturla

















From barry at python.org  Mon Nov  5 17:11:06 2012
From: barry at python.org (Barry Warsaw)
Date: Mon, 5 Nov 2012 11:11:06 -0500
Subject: [Python-ideas] sys.py3k
References: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
	<k77nsr$4i3$1@ger.gmane.org>
	<CADiSq7d4WEttOrOm_48D_fYgdopTaqBszxnxBCUx1phig-SGiQ@mail.gmail.com>
Message-ID: <20121105111106.7f271238@resist.wooz.org>

On Nov 05, 2012, at 06:04 PM, Nick Coghlan wrote:

>Even better (http://packages.python.org/six/#package-contents):
>
>    import six
>
>    if six.PY3:
>        # Ooh, Python 3
>    else:
>        # Not Python 3
>
>If anyone is trying to do single code base Python 2/3 support without
>relying on six, they're doing it wrong. Even bundling a copy (if you
>don't want to deal with dependency management issues) is a better idea
>than reinventing that wheel.
>
>If you *are* rolling your own (or need additional compatibility fixes
>that six doesn't provide), then all Python 2/3 compatibility hacks
>should be located in a small number of compatibility modules. They
>*shouldn't* be distributed widely throughout your codebase.

While I agree with the sentiment, and also agree that six is an excellent
package that can be very useful, I'll just point out that it's often very
possible and not at all painful to write to a single code base without using
it.  It all depends on what your code does/needs.

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121105/081a5ad9/attachment.pgp>

From guido at python.org  Mon Nov  5 18:41:20 2012
From: guido at python.org (Guido van Rossum)
Date: Mon, 5 Nov 2012 09:41:20 -0800
Subject: [Python-ideas] The async API of the future
In-Reply-To: <5097CAF3.3030702@molden.no>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<E2D5E1C4-EA8A-48CE-A2B9-5ED48C527049@molden.no>
	<20121103003036.74621d59@pitrou.net>
	<15F02388-6EC1-4413-A42A-800F92804144@molden.no>
	<20121103005400.6fb1735f@pitrou.net>
	<3BCCACF3-B24E-4CF6-AA16-8837224BCA2D@molden.no>
	<20121103111418.1dae7525@pitrou.net>
	<8136D88B-5345-4260-BD03-1D286C799938@molden.no>
	<20121103182255.70ea9c5a@pitrou.net> <5097CAF3.3030702@molden.no>
Message-ID: <CAP7+vJJDjAL41nYy+FTDqW3ipFUe4edfe62A6WuT4NA+m+Fhpg@mail.gmail.com>

On Mon, Nov 5, 2012 at 6:19 AM, Sturla Molden <sturla at molden.no> wrote:
> My main problem with IOCP is that they provide the "wrong" signal. They tell
> us when I/O is completed. But then the work is already done, and how did we
> know when to start?
>
> The asynch i/o in select, poll, epoll, kqueue, /dev/poll, etc. do the
> opposite. They inform us when to start an i/o task, which makes more sense
> to me at least.
>
> Typically, programs that use IOCP must invent their own means of signalling
> "i/o ready to start", which might kill any advantage of using IOCPs over
> simpler means (e.g. blocking i/o).

This sounds like you are thoroughly used to the UNIX way and don't
appreciate how odd that feels to someone first learning about it
(after having used blocking I/O, perhaps in threads for years).

>From that perspective, the Windows model is actually easier to grasp
than the UNIX model, because it is more similar to the synchronous
model: in the synchronous model, you say e.g. "fetch the next 32
bytes"; in the async model you say, "start fetching the next 32 bytes
and tell me when you've got them".

Whereas in the select()-based model, you have to change your code to
say "tell me when I can fetch some more bytes without blocking" and
when you are told you have to fetch *some* bytes" but you may not get
all 32 bytes, and it is even possible that the signal was an outright
lie, so you have to build a loop around this until you actually have
gotten 32 bytes. Same if instead of 32 bytes you want the next line --
select() and friends don't tell you whether you can read a whole line,
just when at least one more byte is ready.

So it's all a matter of perspective, and there is nothing "wrong" with
IOCP. Note, I don't think there is anything wrong with the select()
model either -- they're just different but equally valid models of the
world, that cause you to structure your code vastly different. Like
wave vs. particle, almost.

-- 
--Guido van Rossum (python.org/~guido)


From sam-pydeas at rushing.nightmare.com  Mon Nov  5 20:30:40 2012
From: sam-pydeas at rushing.nightmare.com (Sam Rushing)
Date: Mon, 05 Nov 2012 11:30:40 -0800
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAFkYKJ5bm5a4_p+5SYiN-GObwSu9Tn3zai6hnpP6YBehwcsgbQ@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
	<FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
	<k72umn$qc0$1@ger.gmane.org>
	<6841C4C6-B6B2-44A9-A773-637EE2839CDF@molden.no>
	<5095AD95.4010509@canterbury.ac.nz>
	<CAP7+vJ+-BAFVV677jO_sWoynPd9opOPz9NC+c1CdEeis4AdKKg@mail.gmail.com>
	<CAFkYKJ5bm5a4_p+5SYiN-GObwSu9Tn3zai6hnpP6YBehwcsgbQ@mail.gmail.com>
Message-ID: <509813E0.3080600@rushing.nightmare.com>

On 11/4/12 8:11 AM, Ben Darnell wrote:
>
> The extra system calls add up.  The interface of Tornado's IOLoop was
> based on epoll (where the internal state is roughly a mapping {fd:
> event_set}), so it requires more register/unregister operations when
> running on kqueue (where the internal state is roughly a set of (fd,
> event) pairs).  This shows up in benchmarks of the HTTPServer; it's
> faster on platforms with epoll than platforms with kqueue.  In
> low-concurrency scenarios it's actually faster to use select() even
> when kqueue is available (or maybe that's a mac-specific quirk).
>
>
Just so I have this right, you're saying that HTTPServer is slower on
kqueue because of the IOLoop design, yes?

I've just looked over the epoll interface and I see at least one huge
difference compared to kqueue: it requires a system call for each fd
registration event.  With kevent() you can accumulate thousands of
registrations, shove them into a single kevent() call and get thousands
of events out.  It's a little all-singing-all-dancing, but it's hard to
imagine a way to do it using fewer system calls. 8^)

-Sam



From markus at unterwaditzer.net  Mon Nov  5 20:28:41 2012
From: markus at unterwaditzer.net (Markus Unterwaditzer)
Date: Mon, 5 Nov 2012 20:28:41 +0100
Subject: [Python-ideas] The "in"-statement
Message-ID: <20121105192841.GA26572@untibox>

This is my first post on this mailinglist and i haven't lurked a long time yet,
so please be gentle.

While mocking objects, i got annoyed by the following code pattern i had to
use when modifying multiple attributes on a single object::

    obj.first_attr = "value"
    obj.second_attr = "value2"

    some_other = "lel"

I thought it would be neat if i could do::

    in obj:
        first_attr = "value"
        second_attr = "value2"

    some_other = "lel"  # indenting this would cause it to appear as an attribute of obj

Just a vague idea. Tell me what you think.

-- Markus


From bruce at leapyear.org  Mon Nov  5 20:47:12 2012
From: bruce at leapyear.org (Bruce Leban)
Date: Mon, 5 Nov 2012 11:47:12 -0800
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <20121105192841.GA26572@untibox>
References: <20121105192841.GA26572@untibox>
Message-ID: <CAGu0AntHN21qC+38zB_UGVAwns+okvskS=3PnA0Tz4C74_cAAw@mail.gmail.com>

On Mon, Nov 5, 2012 at 11:28 AM, Markus Unterwaditzer <
markus at unterwaditzer.net> wrote:

> While mocking objects, i got annoyed by the following code pattern i had to
> use when modifying multiple attributes on a single object::
>
>     obj.first_attr = "value"
>     obj.second_attr = "value2"
>
>     some_other = "lel"
>
> I thought it would be neat if i could do::
>
>     in obj:
>         first_attr = "value"
>         second_attr = "value2"
>
>     some_other = "lel"  # indenting this would cause it to appear as an
> attribute of obj
>

Hard to read, error-prone and ill-defined. Does it create new attributes or
only change existing ones? What about identifiers on right hand sides? What
would
    third_attr = lambda: first_attr
do?

And this is easy enough:

def multi_setattr(obj, **kwargs):
    for k in kwargs:
        setattr(obj, k, kwargs[k])

multi_setattr(obj,
    first_attr = "value",
    second_attr = "value2")

--- Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121105/c22b990b/attachment.html>

From storchaka at gmail.com  Mon Nov  5 20:49:20 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Mon, 05 Nov 2012 21:49:20 +0200
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <20121105192841.GA26572@untibox>
References: <20121105192841.GA26572@untibox>
Message-ID: <k79586$sp0$1@ger.gmane.org>

On 05.11.12 21:28, Markus Unterwaditzer wrote:
> I thought it would be neat if i could do::
>
>      in obj:
>          first_attr = "value"
>          second_attr = "value2"

     vars(obj).update(
         first_attr="value",
         second_attr="value2",
         )

Or obj.__dict__.update.



From mikegraham at gmail.com  Mon Nov  5 21:29:06 2012
From: mikegraham at gmail.com (Mike Graham)
Date: Mon, 5 Nov 2012 15:29:06 -0500
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <k79586$sp0$1@ger.gmane.org>
References: <20121105192841.GA26572@untibox>
	<k79586$sp0$1@ger.gmane.org>
Message-ID: <CAEBZo3PZTkFOH2G99PNyXriT0v_m80EM_nO1LnYd2e9Lkdqnng@mail.gmail.com>

On Mon, Nov 5, 2012 at 2:49 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:
> On 05.11.12 21:28, Markus Unterwaditzer wrote:
>>
>> I thought it would be neat if i could do::
>>
>>      in obj:
>>          first_attr = "value"
>>          second_attr = "value2"
>
>
>     vars(obj).update(
>         first_attr="value",
>         second_attr="value2",
>         )
>
> Or obj.__dict__.update.

Tinkering with the object's attribute dict directly using either of
these is extremely error-prone because it does not work for many
objects and because it circumvents the descriptor protocol.

Mike


From masklinn at masklinn.net  Mon Nov  5 21:44:26 2012
From: masklinn at masklinn.net (Masklinn)
Date: Mon, 5 Nov 2012 21:44:26 +0100
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <CAGu0AntHN21qC+38zB_UGVAwns+okvskS=3PnA0Tz4C74_cAAw@mail.gmail.com>
References: <20121105192841.GA26572@untibox>
	<CAGu0AntHN21qC+38zB_UGVAwns+okvskS=3PnA0Tz4C74_cAAw@mail.gmail.com>
Message-ID: <881094A2-7116-4273-8D9D-16687142BC58@masklinn.net>


On 2012-11-05, at 20:47 , Bruce Leban wrote:

> On Mon, Nov 5, 2012 at 11:28 AM, Markus Unterwaditzer <
> markus at unterwaditzer.net> wrote:
> 
>> While mocking objects, i got annoyed by the following code pattern i had to
>> use when modifying multiple attributes on a single object::
>> 
>>    obj.first_attr = "value"
>>    obj.second_attr = "value2"
>> 
>>    some_other = "lel"
>> 
>> I thought it would be neat if i could do::
>> 
>>    in obj:
>>        first_attr = "value"
>>        second_attr = "value2"
>> 
>>    some_other = "lel"  # indenting this would cause it to appear as an
>> attribute of obj
>> 
> 
> Hard to read, error-prone and ill-defined. Does it create new attributes or
> only change existing ones? What about identifiers on right hand sides? What
> would
>    third_attr = lambda: first_attr
> do?

Even well-defined, it sounds like a shortcut creating more issues than
it solves. Javascript has something which sounds very, very similar in
`with` and apart from being hell on performances and having very broken
semantics[0] its one and only non-hacky use case is for `eval` (because
javascript's eval doesn't have `locals` and `globals` parameters).

[0] `with` uses the object it is provided as an internal scope, which
     means this:

    with (a) {
        b = 3;
    }

    will result in `a.b` being set to 3 if `a` already has a `b`
    attribute, otherwise it may clobber an existing `b` in a higher
    lexical scope, and if none exists it will just create a brand new
    global `b`.

From tjreedy at udel.edu  Mon Nov  5 22:48:42 2012
From: tjreedy at udel.edu (Terry Reedy)
Date: Mon, 05 Nov 2012 16:48:42 -0500
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <20121105192841.GA26572@untibox>
References: <20121105192841.GA26572@untibox>
Message-ID: <k79c7s$s6b$1@ger.gmane.org>

On 11/5/2012 2:28 PM, Markus Unterwaditzer wrote:
> This is my first post on this mailinglist and i haven't lurked a long time yet,
> so please be gentle.
>
> While mocking objects, i got annoyed by the following code pattern i had to
> use when modifying multiple attributes on a single object::
>
>      obj.first_attr = "value"
>      obj.second_attr = "value2"

o = obj  # solves most of the 'problem' of retyping 'obj' over and over
o.first_attr = 'aval'
o.sec = o.third(1, o.fourth)

If the original is

obj.first = first
obj.second = second

as in common in __init__, then your solution

in obj:
   first = first
   second = second

requires disambiguation by, I presume, position.

New syntax must add something that is not trivial to do now.

-- 
Terry Jan Reedy



From joshua.landau.ws at gmail.com  Mon Nov  5 23:05:33 2012
From: joshua.landau.ws at gmail.com (Joshua Landau)
Date: Mon, 5 Nov 2012 22:05:33 +0000
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <CAGu0AntHN21qC+38zB_UGVAwns+okvskS=3PnA0Tz4C74_cAAw@mail.gmail.com>
References: <20121105192841.GA26572@untibox>
	<CAGu0AntHN21qC+38zB_UGVAwns+okvskS=3PnA0Tz4C74_cAAw@mail.gmail.com>
Message-ID: <CAN1F8qWWoi4VJUdVB_Nqhf70fJwjS6Mu_US3JZnpHEbNchvpTw@mail.gmail.com>

On 5 November 2012 19:47, Bruce Leban <bruce at leapyear.org> wrote:

> On Mon, Nov 5, 2012 at 11:28 AM, Markus Unterwaditzer <
> markus at unterwaditzer.net> wrote:
>
>> While mocking objects, i got annoyed by the following code pattern i had
>> to
>> use when modifying multiple attributes on a single object::
>>
>>     obj.first_attr = "value"
>>     obj.second_attr = "value2"
>>
>>     some_other = "lel"
>>
>> I thought it would be neat if i could do::
>>
>>     in obj:
>>         first_attr = "value"
>>         second_attr = "value2"
>>
>>     some_other = "lel"  # indenting this would cause it to appear as an
>> attribute of obj
>>
>
> Hard to read, error-prone and ill-defined. Does it create new attributes
> or only change existing ones? What about identifiers on right hand sides?
> What would
>     third_attr = lambda: first_attr
> do?
>

My solution has always been:

.third_attr = lambda: .first_attr
.fourth_attr = not_attr

Although a single "." is hard for some [weaklings; pah!] to see.

It has exactly one use-case in my opinion:
def m(self, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u,
v, w, x, y, z):
    in self:
        .a, .b, .c, .d, .e, .f, .g, .h, .i, .j, .k, .l, .m, .n, .o, .p, .q,
.r, .s, .t, .u, .v, .w, .x, .y, .z = \
            a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u,
v, w, x, y, z

....not that having an excuse to do this is a good thing...

And this is easy enough:
>

> def multi_setattr(obj, **kwargs):
>     for k in kwargs:
>         setattr(obj, k, kwargs[k])
>
> multi_setattr(obj,
>     first_attr = "value",
>     second_attr = "value2")
>

TYVM for this, I'd never have thought of it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121105/6ff0e596/attachment.html>

From greg.ewing at canterbury.ac.nz  Mon Nov  5 23:23:43 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 06 Nov 2012 11:23:43 +1300
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAP7+vJJDjAL41nYy+FTDqW3ipFUe4edfe62A6WuT4NA+m+Fhpg@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<E2D5E1C4-EA8A-48CE-A2B9-5ED48C527049@molden.no>
	<20121103003036.74621d59@pitrou.net>
	<15F02388-6EC1-4413-A42A-800F92804144@molden.no>
	<20121103005400.6fb1735f@pitrou.net>
	<3BCCACF3-B24E-4CF6-AA16-8837224BCA2D@molden.no>
	<20121103111418.1dae7525@pitrou.net>
	<8136D88B-5345-4260-BD03-1D286C799938@molden.no>
	<20121103182255.70ea9c5a@pitrou.net> <5097CAF3.3030702@molden.no>
	<CAP7+vJJDjAL41nYy+FTDqW3ipFUe4edfe62A6WuT4NA+m+Fhpg@mail.gmail.com>
Message-ID: <50983C6F.9050303@canterbury.ac.nz>

Guido van Rossum wrote:
> when you are told you have to fetch *some* bytes" but you may not get
> all 32 bytes ... so you have to build a loop around this until you actually have
> gotten 32 bytes. Same if instead of 32 bytes you want the next line --

You have to build a loop for these reasons when using
synchronous calls, too. You just don't usually notice this
because the libraries take care of it for you.

-- 
Greg


From steve at pearwood.info  Mon Nov  5 23:22:01 2012
From: steve at pearwood.info (Steven D'Aprano)
Date: Tue, 06 Nov 2012 09:22:01 +1100
Subject: [Python-ideas] os.path.split(path, maxsplit=1)
In-Reply-To: <5097B69C.5080202@nedbatchelder.com>
References: <CAPkN8xLCog9ZT=_QaprLbo82x-QFEuayyWTqSyBVxZewN68kOA@mail.gmail.com>
	<CAPkN8xKT07Mm295cQ0k8+sc88hVDtpP3Kg1Mjb8f3KUzyo4GnQ@mail.gmail.com>
	<CAPkN8xK=YQQjrx1ZegHvnOfKj+FtYbWXQv80Sai=CLFjKhT_WA@mail.gmail.com>
	<5097B69C.5080202@nedbatchelder.com>
Message-ID: <50983C09.7030802@pearwood.info>

On 05/11/12 23:52, Ned Batchelder wrote:

> Anatoly, I appreciate the energy and dedication you've shown to the
>Python community, but maybe you should spend a little more time on
each proposal? For example, the subject line here is a different (and
>already taken) function name than the implementation, and has a
>maxsplit argument that the implementation doesn't have.
>
> Get everything the way you want it, and then propose it.

+1

Also consider publishing it as a recipe on ActiveState, where many
people will view it, use it, and offer feedback. This has many
benefits:

* You will gauge community interest;

* Many eyeballs make bugs shallow;

* You are providing a useful recipe that others can use, even
   if it doesn't get included in the std lib.

Some of the most useful parts of the std lib, like namedtuple,
started life on ActiveState.

http://code.activestate.com/recipes/langs/python/

-- 
Steven


From steve at pearwood.info  Mon Nov  5 23:36:56 2012
From: steve at pearwood.info (Steven D'Aprano)
Date: Tue, 06 Nov 2012 09:36:56 +1100
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <20121105192841.GA26572@untibox>
References: <20121105192841.GA26572@untibox>
Message-ID: <50983F88.9010105@pearwood.info>

On 06/11/12 06:28, Markus Unterwaditzer wrote:

> I thought it would be neat if i could do::
>
>      in obj:
>          first_attr = "value"
>          second_attr = "value2"
>
>      some_other = "lel"  # indenting this would cause it to appear as an attribute of obj
>
> Just a vague idea. Tell me what you think.


This is Pascal's old "with" block. It works well for static languages like
Pascal, where the compiler can tell ahead of time which names belong to what,
but less well for dynamic languages like Python. Non-trivial examples of this
design feature will be ambiguous and error-prone.

There's even a FAQ about it:

http://docs.python.org/2/faq/design.html#why-doesn-t-python-have-a-with-statement-for-attribute-assignments



-- 
Steven


From ncoghlan at gmail.com  Mon Nov  5 23:45:57 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Tue, 6 Nov 2012 08:45:57 +1000
Subject: [Python-ideas] sys.py3k
In-Reply-To: <20121105111106.7f271238@resist.wooz.org>
References: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
	<k77nsr$4i3$1@ger.gmane.org>
	<CADiSq7d4WEttOrOm_48D_fYgdopTaqBszxnxBCUx1phig-SGiQ@mail.gmail.com>
	<20121105111106.7f271238@resist.wooz.org>
Message-ID: <CADiSq7c1j15KNFnt4g6XeR4WKEqf4Emb6JmvuRv=ttvkXErm3w@mail.gmail.com>

On Nov 6, 2012 2:12 AM, "Barry Warsaw" <barry at python.org> wrote:
>
> On Nov 05, 2012, at 06:04 PM, Nick Coghlan wrote:
>
> >Even better (http://packages.python.org/six/#package-contents):
> >
> >    import six
> >
> >    if six.PY3:
> >        # Ooh, Python 3
> >    else:
> >        # Not Python 3
> >
> >If anyone is trying to do single code base Python 2/3 support without
> >relying on six, they're doing it wrong. Even bundling a copy (if you
> >don't want to deal with dependency management issues) is a better idea
> >than reinventing that wheel.
> >
> >If you *are* rolling your own (or need additional compatibility fixes
> >that six doesn't provide), then all Python 2/3 compatibility hacks
> >should be located in a small number of compatibility modules. They
> >*shouldn't* be distributed widely throughout your codebase.
>
> While I agree with the sentiment, and also agree that six is an excellent
> package that can be very useful, I'll just point out that it's often very
> possible and not at all painful to write to a single code base without
using
> it.  It all depends on what your code does/needs.

True, my own 2/3 compatible projects don't use it, but they also don't have
any significant special cases for either version. I guess stick a "for
non-trivial cases" qualifier in there somewhere :)

Cheers,
Nick.

--
Sent from my phone, thus the relative brevity :)

> Cheers,
> -Barry
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121106/767f4b28/attachment.html>

From wuwei23 at gmail.com  Tue Nov  6 04:53:44 2012
From: wuwei23 at gmail.com (alex23)
Date: Mon, 5 Nov 2012 19:53:44 -0800 (PST)
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <20121105192841.GA26572@untibox>
References: <20121105192841.GA26572@untibox>
Message-ID: <c5ef0c20-4bea-43bf-9a5f-ae6d2d762ebb@kt16g2000pbb.googlegroups.com>

On Nov 6, 5:39?am, Markus Unterwaditzer <mar... at unterwaditzer.net>
wrote:
> I thought it would be neat if i could do::
> ? ? in obj:
> ? ? ? ? first_attr = "value"
> ? ? ? ? second_attr = "value2"

My concern is that it would promote its own code flow at the expense
of readability. It's great for simple assignment, but doesn't allow
for anything else:

    in obj:
        first_attr = function1()
        temp_var_not_an_attr = function2(first_attr)
        second_attr = function2(temp_var_not_an_attr)

And what is the expected behaviour for something like:

    first_attr = "FOO"
    in obj:
        first_attr = "BAR"
        second_attr = first_attr

Is obj.second_attr "FOO" or "BAR"?  How do I distinguish between outer
& inner scope labels?


From jeanpierreda at gmail.com  Tue Nov  6 08:55:17 2012
From: jeanpierreda at gmail.com (Devin Jeanpierre)
Date: Tue, 6 Nov 2012 02:55:17 -0500
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <CAEBZo3PZTkFOH2G99PNyXriT0v_m80EM_nO1LnYd2e9Lkdqnng@mail.gmail.com>
References: <20121105192841.GA26572@untibox> <k79586$sp0$1@ger.gmane.org>
	<CAEBZo3PZTkFOH2G99PNyXriT0v_m80EM_nO1LnYd2e9Lkdqnng@mail.gmail.com>
Message-ID: <CABicbJLipzg4MkXfmE6xjkp-0b+8PTgNf-MOsxHm3BgfexC8Fw@mail.gmail.com>

On Mon, Nov 5, 2012 at 3:29 PM, Mike Graham <mikegraham at gmail.com> wrote:
> On Mon, Nov 5, 2012 at 2:49 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:
>> On 05.11.12 21:28, Markus Unterwaditzer wrote:
>>>
>>> I thought it would be neat if i could do::
>>>
>>>      in obj:
>>>          first_attr = "value"
>>>          second_attr = "value2"
>>
>>
>>     vars(obj).update(
>>         first_attr="value",
>>         second_attr="value2",
>>         )
>>
>> Or obj.__dict__.update.
>
> Tinkering with the object's attribute dict directly using either of
> these is extremely error-prone because it does not work for many
> objects and because it circumvents the descriptor protocol.

def update_vars(obj, **kwargs):
    for k, v in kwargs.iteritems():
        setattr(obj, k, v)

update_vars(obj,
    a=b,
    c=d)

-- Devin


From masklinn at masklinn.net  Tue Nov  6 09:49:36 2012
From: masklinn at masklinn.net (Masklinn)
Date: Tue, 6 Nov 2012 09:49:36 +0100
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <50983F88.9010105@pearwood.info>
References: <20121105192841.GA26572@untibox> <50983F88.9010105@pearwood.info>
Message-ID: <065886AB-E750-402A-8C6E-A2838CE21888@masklinn.net>


On 2012-11-05, at 23:36 , Steven D'Aprano wrote:

> On 06/11/12 06:28, Markus Unterwaditzer wrote:
> 
>> I thought it would be neat if i could do::
>> 
>>     in obj:
>>         first_attr = "value"
>>         second_attr = "value2"
>> 
>>     some_other = "lel"  # indenting this would cause it to appear as an attribute of obj
>> 
>> Just a vague idea. Tell me what you think.
> 
> 
> This is Pascal's old "with" block. It works well for static languages like
> Pascal, where the compiler can tell ahead of time which names belong to what,
> but less well for dynamic languages like Python. Non-trivial examples of this
> design feature will be ambiguous and error-prone.

A possible alternative (though I'm not sure how integration in Python
would work, syntactically) which might be a better fit for dynamically
typed languages would be Smalltalk's "message cascading" with allowances
for assignment.

Using `|` as the cascading operator (as Smalltalk's ";" is already used
in Python) for the example,

    obj| first_attr = "value"
       | second_attr = "value2"
       | some_method()
       | some_other_method(2)

this would essentially desugar to:

    obj.first_attr = "value"
    obj.second_attr = "value2"
    obj.some_method()
    obj.some_other_method(2)

but as a single expression, and returning the value of the last call. I
see most of the value of cascading in the "chaining" of method calls
which can't be chained, but extending it to attribute access (get/set)
could be neat-ish.

Of course this might also need a new self/yourself attribute on
``object`` to top up the cascade.

From jsbueno at python.org.br  Tue Nov  6 11:18:11 2012
From: jsbueno at python.org.br (Joao S. O. Bueno)
Date: Tue, 6 Nov 2012 08:18:11 -0200
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <065886AB-E750-402A-8C6E-A2838CE21888@masklinn.net>
References: <20121105192841.GA26572@untibox> <50983F88.9010105@pearwood.info>
	<065886AB-E750-402A-8C6E-A2838CE21888@masklinn.net>
Message-ID: <CAH0mxTTQ7WkXdr28+gA-xm2Dq3WvfSqESU0cpQfVrThv1hMRaw@mail.gmail.com>

On 6 November 2012 06:49, Masklinn <masklinn at masklinn.net> wrote:
> Using `|` as the cascading operator (as Smalltalk's ";" is already used
> in Python) for the example,

Just for the record, so is "|"


btw, I am -1 for the whole idea - this is trivial to implement through
various 2-3 lines snippets, as shown along the thread - makes any
case beyond the absolute trivial impossible to write, and
complicates readability.

I would suggest people liking the idea to get a programming editor with
a "copy word above" feature or plug-in instead.

  js
 -><-


From storchaka at gmail.com  Tue Nov  6 16:27:28 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Tue, 06 Nov 2012 17:27:28 +0200
Subject: [Python-ideas] os.path.commonpath()
Message-ID: <k7ba97$n5f$1@ger.gmane.org>

See http://bugs.python.org/issue10395.

os.path.commonpath() should be a function which returns right longest common sub-path for specified paths (os.path.commonprefix() is completely useless for this).

There are some open questions about details of *right* behavior.



What should be a common prefix of '/var/log/apache2' and
 '/var//log/mysql'?
What should be a common prefix of '/usr' and '//usr'?
What should be a common prefix of '/usr/local/' and '/usr/local/'?
What should be a common prefix of '/usr/local/' and '/usr/local/bin'?
What should be a common prefix of '/usr/bin/..' and '/usr/bin'?

Please, those who are interested in this feature, give consistent answers to these questions.



From ben at bendarnell.com  Tue Nov  6 16:41:20 2012
From: ben at bendarnell.com (Ben Darnell)
Date: Tue, 6 Nov 2012 07:41:20 -0800
Subject: [Python-ideas] The async API of the future
In-Reply-To: <509813E0.3080600@rushing.nightmare.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
	<FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
	<k72umn$qc0$1@ger.gmane.org>
	<6841C4C6-B6B2-44A9-A773-637EE2839CDF@molden.no>
	<5095AD95.4010509@canterbury.ac.nz>
	<CAP7+vJ+-BAFVV677jO_sWoynPd9opOPz9NC+c1CdEeis4AdKKg@mail.gmail.com>
	<CAFkYKJ5bm5a4_p+5SYiN-GObwSu9Tn3zai6hnpP6YBehwcsgbQ@mail.gmail.com>
	<509813E0.3080600@rushing.nightmare.com>
Message-ID: <CAFkYKJ6v0OcdX_RfxL3WK=0Vx_Fam94qHnCufmRJARQWC0QxYg@mail.gmail.com>

On Mon, Nov 5, 2012 at 11:30 AM, Sam Rushing <
sam-pydeas at rushing.nightmare.com> wrote:

> On 11/4/12 8:11 AM, Ben Darnell wrote:
> >
> > The extra system calls add up.  The interface of Tornado's IOLoop was
> > based on epoll (where the internal state is roughly a mapping {fd:
> > event_set}), so it requires more register/unregister operations when
> > running on kqueue (where the internal state is roughly a set of (fd,
> > event) pairs).  This shows up in benchmarks of the HTTPServer; it's
> > faster on platforms with epoll than platforms with kqueue.  In
> > low-concurrency scenarios it's actually faster to use select() even
> > when kqueue is available (or maybe that's a mac-specific quirk).
> >
> >
> Just so I have this right, you're saying that HTTPServer is slower on
> kqueue because of the IOLoop design, yes?
>

Yes.  When the server processes a request and switches from listening for
readability to listening for writability, with epoll it's one call directly
into the C module to set the event mask for the socket.  With kqueue
something in the IOLoop must store the previous state and generate the two
separate actions to remove the read listener and add a write listener.  I
misspoke when I mentioned system call; the difference is actually the
amount of python code that must be run to call the right C functions.  This
would get a lot better if more of the IOLoop were written in C.


>
> I've just looked over the epoll interface and I see at least one huge
> difference compared to kqueue: it requires a system call for each fd
> registration event.  With kevent() you can accumulate thousands of
> registrations, shove them into a single kevent() call and get thousands
> of events out.  It's a little all-singing-all-dancing, but it's hard to
> imagine a way to do it using fewer system calls. 8^)
>

True, although whenever I've tried to be clever and batch up kevent calls I
haven't gotten the performance I'd hoped for because system calls aren't
actually that expensive in comparison to python opcodes.  Also at least
some versions of Mac OS have a bug where you can only pass one event at a
time.

-Ben


>
> -Sam
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121106/a8f6db51/attachment.html>

From ronaldoussoren at mac.com  Tue Nov  6 16:49:42 2012
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Tue, 06 Nov 2012 16:49:42 +0100
Subject: [Python-ideas] os.path.commonpath()
In-Reply-To: <k7ba97$n5f$1@ger.gmane.org>
References: <k7ba97$n5f$1@ger.gmane.org>
Message-ID: <6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>


On 6 Nov, 2012, at 16:27, Serhiy Storchaka <storchaka at gmail.com> wrote:

> See http://bugs.python.org/issue10395.
> 
> os.path.commonpath() should be a function which returns right longest common sub-path for specified paths (os.path.commonprefix() is completely useless for this).
> 
> There are some open questions about details of *right* behavior.
> 
> 
> 
> What should be a common prefix of '/var/log/apache2' and
> '/var//log/mysql'?

/var/log

> What should be a common prefix of '/usr' and '//usr'?

/usr

> What should be a common prefix of '/usr/local/' and '/usr/local/'?

/usr/local

> What should be a common prefix of '/usr/local/' and '/usr/local/bin'?

/usr/local

> What should be a common prefix of '/usr/bin/..' and '/usr/bin'?

/usr/bin

In all cases the path is first split into its elements, then calculate the largest common prefix of the two sets of elements, then join the elements back up again.

Some cases you don't mention:

* Relative paths that don't share a prefix should raise an exception
* On windows two paths that don't have the same drive should raise an exception

The alternative is to return some arbitrary value (like None) that you have to test for, which would IMHO make it too easy to accidently pass an useless value to some other API and get a confusing exeption later on.

> 
> Please, those who are interested in this feature, give consistent answers to these questions.

Ronald



From sam-pydeas at rushing.nightmare.com  Tue Nov  6 20:43:57 2012
From: sam-pydeas at rushing.nightmare.com (Sam Rushing)
Date: Tue, 06 Nov 2012 11:43:57 -0800
Subject: [Python-ideas] The async API of the future
In-Reply-To: <CAFkYKJ6v0OcdX_RfxL3WK=0Vx_Fam94qHnCufmRJARQWC0QxYg@mail.gmail.com>
References: <CAP7+vJLzct4p_SHyMHPc6C0aDE=-zbHw-L6F9502xi8zfGpj9w@mail.gmail.com>
	<2CEFACA8-FB96-4C17-9D14-CADEE217F662@molden.no>
	<20121102231417.12407875@pitrou.net>
	<49169B74-5776-4A0C-BD0B-07B7D18C77F6@molden.no>
	<k71om1$4uq$1@ger.gmane.org>
	<FD48777E-56BC-47D7-8BB5-64D6D74D7128@molden.no>
	<k72umn$qc0$1@ger.gmane.org>
	<6841C4C6-B6B2-44A9-A773-637EE2839CDF@molden.no>
	<5095AD95.4010509@canterbury.ac.nz>
	<CAP7+vJ+-BAFVV677jO_sWoynPd9opOPz9NC+c1CdEeis4AdKKg@mail.gmail.com>
	<CAFkYKJ5bm5a4_p+5SYiN-GObwSu9Tn3zai6hnpP6YBehwcsgbQ@mail.gmail.com>
	<509813E0.3080600@rushing.nightmare.com>
	<CAFkYKJ6v0OcdX_RfxL3WK=0Vx_Fam94qHnCufmRJARQWC0QxYg@mail.gmail.com>
Message-ID: <5099687D.8040707@rushing.nightmare.com>

On 11/6/12 7:41 AM, Ben Darnell wrote:
> Yes.  When the server processes a request and switches from listening
> for readability to listening for writability, with epoll it's one call
> directly into the C module to set the event mask for the socket.  With
> kqueue something in the IOLoop must store the previous state and
> generate the two separate actions to remove the read listener and add
> a write listener.

Does that mean you're not using EV_ONESHOT?

>  I misspoke when I mentioned system call; the difference is actually
> the amount of python code that must be run to call the right C
> functions.  This would get a lot better if more of the IOLoop were
> written in C.

That's what we did with shrapnel, though we split the difference and
wrote everything in Pyrex.

> True, although whenever I've tried to be clever and batch up kevent
> calls I haven't gotten the performance I'd hoped for because system
> calls aren't actually that expensive in comparison to python opcodes.
>

And yeah, of course all this is dominated by time in the python VM...
Also, you still have to execute all the read/write system calls, so it
only cuts it in half.

-Sam



From ironfroggy at gmail.com  Tue Nov  6 23:17:43 2012
From: ironfroggy at gmail.com (Calvin Spealman)
Date: Tue, 6 Nov 2012 17:17:43 -0500
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <20121105192841.GA26572@untibox>
References: <20121105192841.GA26572@untibox>
Message-ID: <CAGaVwhSbLYhKK7kpTooHF24foC-wNKGwPMnTSu29ThVmRND0_g@mail.gmail.com>

On Mon, Nov 5, 2012 at 2:28 PM, Markus Unterwaditzer <
markus at unterwaditzer.net> wrote:

> This is my first post on this mailinglist and i haven't lurked a long time
> yet,
> so please be gentle.
>
> While mocking objects, i got annoyed by the following code pattern i had to
> use when modifying multiple attributes on a single object::
>
>     obj.first_attr = "value"
>     obj.second_attr = "value2"
>
>     some_other = "lel"
>
> I thought it would be neat if i could do::
>
>     in obj:
>         first_attr = "value"
>         second_attr = "value2"
>
>     some_other = "lel"  # indenting this would cause it to appear as an
> attribute of obj
>

in obj_1:
    in obj_2:
        a = b

Tell me what this does, then.


> Just a vague idea. Tell me what you think.
>
> -- Markus
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>



-- 
Read my blog! I depend on your acceptance of my opinion! I am interesting!
http://techblog.ironfroggy.com/
Follow me if you're into that sort of thing:
http://www.twitter.com/ironfroggy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121106/81b3ac85/attachment.html>

From eliben at gmail.com  Wed Nov  7 02:01:50 2012
From: eliben at gmail.com (Eli Bendersky)
Date: Tue, 6 Nov 2012 17:01:50 -0800
Subject: [Python-ideas] os.path.commonpath()
In-Reply-To: <6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
References: <k7ba97$n5f$1@ger.gmane.org>
	<6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
Message-ID: <CAF-Rda_o951BPup63auxSeNoqzvTsVFUz89s2uAD2xC=6+twyg@mail.gmail.com>

> On 6 Nov, 2012, at 16:27, Serhiy Storchaka <storchaka at gmail.com> wrote:
>
> > See http://bugs.python.org/issue10395.
> >
> > os.path.commonpath() should be a function which returns right longest
> common sub-path for specified paths (os.path.commonprefix() is completely
> useless for this).
> >
> > There are some open questions about details of *right* behavior.
> >
> >
> >
> > What should be a common prefix of '/var/log/apache2' and
> > '/var//log/mysql'?
>
> /var/log
>
> > What should be a common prefix of '/usr' and '//usr'?
>
> /usr
>
> > What should be a common prefix of '/usr/local/' and '/usr/local/'?
>
> /usr/local
>
> > What should be a common prefix of '/usr/local/' and '/usr/local/bin'?
>
> /usr/local
>
> > What should be a common prefix of '/usr/bin/..' and '/usr/bin'?
>
> /usr/bin
>
> In all cases the path is first split into its elements, then calculate the
> largest common prefix of the two sets of elements, then join the elements
> back up again.
>

+1

Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121106/728496a4/attachment.html>

From jsbueno at python.org.br  Wed Nov  7 03:00:12 2012
From: jsbueno at python.org.br (Joao S. O. Bueno)
Date: Wed, 7 Nov 2012 00:00:12 -0200
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <CAGaVwhSbLYhKK7kpTooHF24foC-wNKGwPMnTSu29ThVmRND0_g@mail.gmail.com>
References: <20121105192841.GA26572@untibox>
	<CAGaVwhSbLYhKK7kpTooHF24foC-wNKGwPMnTSu29ThVmRND0_g@mail.gmail.com>
Message-ID: <CAH0mxTRfVoaCQ+B8V3C4YRcG0oQFA7_Mbq=CTV8PR=tyqXQR3w@mail.gmail.com>

On 6 November 2012 20:17, Calvin Spealman <ironfroggy at gmail.com> wrote:
>
> in obj_1:
>     in obj_2:
>         a = b
>
> Tell me what this does, then.
>
I think it is quite clear the snippet above should be equivalent to:

from itertools import product
from os import fork

for combination in product(obj1, obj2):
    pid = fork()
    if pid == 0:
           setattr(combination[0], "a", getattr(combination[1], "b") )
           break
-------
Actually near equivalent - it would be more proper if we had a fork
variant where each subproccess would
run in a different parallel universe. Maybe when Python 5 be
overhauled to be optimized for quantum computing we can get close
enough.

  js
 -><-


From bruce at leapyear.org  Wed Nov  7 03:05:30 2012
From: bruce at leapyear.org (Bruce Leban)
Date: Tue, 6 Nov 2012 18:05:30 -0800
Subject: [Python-ideas] os.path.commonpath()
In-Reply-To: <6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
References: <k7ba97$n5f$1@ger.gmane.org>
	<6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
Message-ID: <CAGu0AnvO2up8WmShg5ZtE9RKz0MjA=vOUKeb+oTkFX+_Q4+i5Q@mail.gmail.com>

It would be nice if in conjunction with this os.path.commonprefix is
renamed as string.commonprefix with the os.path.commonprefix kept for
backwards compatibility (and deprecated).

more inline

On Tue, Nov 6, 2012 at 7:49 AM, Ronald Oussoren <ronaldoussoren at mac.com>wrote:

>
> On 6 Nov, 2012, at 16:27, Serhiy Storchaka <storchaka at gmail.com> wrote:
> > What should be a common prefix of '/var/log/apache2'
> and '/var//log/mysql'?
> /var/log
>
> > What should be a common prefix of '/usr' and '//usr'?
> /usr
>
> > What should be a common prefix of '/usr/local/' and '/usr/local/'?
> /usr/local
>
> It appears that you want the result to never include a trailing /.
However, you've left out one key test case:

What is commonpath('/usr', '/var')?

It seems to me that the only reasonable value is '/'.

If you change the semantics so that it either (1) it always always includes
a trailing / or (2) it includes a trailing slash if the two paths have it
in common, then you don't have the weirdness that in this case it returns a
slash and in others it doesn't. I am slightly inclined to (1) at this point.

It would also be a bit surprising that there are cases where
commonpath(a,a) != a.



>  > What should be a common prefix of '/usr/local/' and '/usr/local/bin'?
> /usr/local
>
> > What should be a common prefix of '/usr/bin/..' and '/usr/bin'?
> /usr/bin
>

seems better than the alternative of interpreting the '..'.

>
> * Relative paths that don't share a prefix should raise an exception
>

Why? Why is an empty path not a reasonable result?


> * On windows two paths that don't have the same drive should raise an
> exception
>

I disagree. On unix systems, should two paths that don't have the same
drive also raise an exception? What if I'm using this function on windows
to compare two http paths or two paths to a remote unix system? Raising an
exception in either case would be wrong.


> The alternative is to return some arbitrary value (like None) that you
> have to test for, which would IMHO make it too easy to accidently pass an
> useless value to some other API and get a confusing exeption later on.
>

Yes, don't return a useless value. An empty string is useful in the
relative path case and '/' is useful in the non-relative but paths don't
have common prefix at all case.


--- Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121106/29c7c97a/attachment.html>

From grosser.meister.morti at gmx.net  Wed Nov  7 04:45:46 2012
From: grosser.meister.morti at gmx.net (=?ISO-8859-1?Q?Mathias_Panzenb=F6ck?=)
Date: Wed, 07 Nov 2012 04:45:46 +0100
Subject: [Python-ideas] Support data: URLs in urllib
In-Reply-To: <CACac1F8AnEsairyxf8YKYxMERan+C04rGRaik_OxAdpEBz6wfg@mail.gmail.com>
References: <5090B0FC.1030801@gmx.net>
	<CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
	<50945B9D.8010002@gmx.net>
	<CACac1F_P4L7b26fu1sh7hz0QMLKRP-vpLAx45MGBOgd9JNOoow@mail.gmail.com>
	<5095CAC2.6010309@gmx.net>
	<CACac1F8AnEsairyxf8YKYxMERan+C04rGRaik_OxAdpEBz6wfg@mail.gmail.com>
Message-ID: <5099D96A.2090602@gmx.net>

Ok, I've written an issue in the python bug tracker and attached a doc patch for the recipe:

http://bugs.python.org/issue16423

On 11/04/2012 09:28 AM, Paul Moore wrote:
> On Sunday, 4 November 2012, Mathias Panzenb?ck wrote:
>
>
>     Shouldn't there be *one* obvious way to do this? req.headers
>
>
> Well, I'd say that the stdlib docs imply that req.info <http://req.info> is the required way so
> that's the "one obvious way". If you want to add extra methods for convenience, fair enough, but
> code that doesn't already know it is handling a data URL can't use them so I don't see the point,
> personally.
>
> But others may have different views...
>
> Paul



From senthil at uthcode.com  Wed Nov  7 06:08:23 2012
From: senthil at uthcode.com (Senthil Kumaran)
Date: Tue, 6 Nov 2012 21:08:23 -0800
Subject: [Python-ideas] Support data: URLs in urllib
In-Reply-To: <5099D96A.2090602@gmx.net>
References: <5090B0FC.1030801@gmx.net>
	<CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
	<50945B9D.8010002@gmx.net>
	<CACac1F_P4L7b26fu1sh7hz0QMLKRP-vpLAx45MGBOgd9JNOoow@mail.gmail.com>
	<5095CAC2.6010309@gmx.net>
	<CACac1F8AnEsairyxf8YKYxMERan+C04rGRaik_OxAdpEBz6wfg@mail.gmail.com>
	<5099D96A.2090602@gmx.net>
Message-ID: <CAPOVWOTNTZ9Enxc_PbetpgNu6v9iS3xf9G92Jgu4tzOuH5BpjA@mail.gmail.com>

Had not known about the 'data' url scheme. Thanks for pointing out (
http://tools.ietf.org/html/rfc2397 ) and the documentation patch.
BTW, documentation patch is easy to get in, but should the support in
a more natural form, where data url is parsed internally by the module
and expected results be returned should be considered? That could be
targeted for 3.4 and docs recipe does serve for all the other
releases.


Thank you,
Senthil


On Tue, Nov 6, 2012 at 7:45 PM, Mathias Panzenb?ck
<grosser.meister.morti at gmx.net> wrote:
> Ok, I've written an issue in the python bug tracker and attached a doc patch
> for the recipe:
>
> http://bugs.python.org/issue16423
>
>
> On 11/04/2012 09:28 AM, Paul Moore wrote:
>>
>> On Sunday, 4 November 2012, Mathias Panzenb?ck wrote:
>>
>>
>>     Shouldn't there be *one* obvious way to do this? req.headers
>>
>>
>> Well, I'd say that the stdlib docs imply that req.info <http://req.info>
>> is the required way so
>>
>> that's the "one obvious way". If you want to add extra methods for
>> convenience, fair enough, but
>> code that doesn't already know it is handling a data URL can't use them so
>> I don't see the point,
>> personally.
>>
>> But others may have different views...
>>
>> Paul
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas


From greg.ewing at canterbury.ac.nz  Wed Nov  7 06:15:48 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 07 Nov 2012 18:15:48 +1300
Subject: [Python-ideas] os.path.commonpath()
In-Reply-To: <CAGu0AnvO2up8WmShg5ZtE9RKz0MjA=vOUKeb+oTkFX+_Q4+i5Q@mail.gmail.com>
References: <k7ba97$n5f$1@ger.gmane.org>
	<6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
	<CAGu0AnvO2up8WmShg5ZtE9RKz0MjA=vOUKeb+oTkFX+_Q4+i5Q@mail.gmail.com>
Message-ID: <5099EE84.8000101@canterbury.ac.nz>

Bruce Leban wrote:

> If you change the semantics so that it either (1) it always always 
> includes a trailing / or (2) it includes a trailing slash if the two 
> paths have it in common, then you don't have the weirdness that in this 
> case it returns a slash and in others it doesn't. I am slightly inclined 
> to (1) at this point.

But then the common prefix of "/a/b" and "/a/c" would be "/a/",
which would be very unexpected -- usually the dirname of a path is
not considered to include a trailing slash.

The special treatment of the root directory is no weirder than it
is anywhere else. It's already special, since in unix it's the
only case where a trailing slash is semantically significant.
(To the kernel, at least -- a few command line utilities break this
rule, but they're screwy.)

-- 
Greg


From aquavitae69 at gmail.com  Wed Nov  7 06:59:49 2012
From: aquavitae69 at gmail.com (David Townshend)
Date: Wed, 7 Nov 2012 07:59:49 +0200
Subject: [Python-ideas] os.path.commonpath()
In-Reply-To: <5099EE84.8000101@canterbury.ac.nz>
References: <k7ba97$n5f$1@ger.gmane.org>
	<6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
	<CAGu0AnvO2up8WmShg5ZtE9RKz0MjA=vOUKeb+oTkFX+_Q4+i5Q@mail.gmail.com>
	<5099EE84.8000101@canterbury.ac.nz>
Message-ID: <CAEgL-ffBL=+hYh4P6EmvYF4UDhY6Zrc3fr1bcW5AqFwX-7FA9g@mail.gmail.com>

This seems to be overlapping quite a lot with the recent discussion on
object-oriented paths (
http://mail.python.org/pipermail/python-ideas/2012-October/016338.html) and
this question of how paths are represented on different systems was
discussed quite extensively.  I'm not sure where the thread left off, but
if PEP 428 is still going ahead then maybe this is something that should be
brought into it.

David


On Wed, Nov 7, 2012 at 7:15 AM, Greg Ewing <greg.ewing at canterbury.ac.nz>wrote:

> Bruce Leban wrote:
>
>  If you change the semantics so that it either (1) it always always
>> includes a trailing / or (2) it includes a trailing slash if the two paths
>> have it in common, then you don't have the weirdness that in this case it
>> returns a slash and in others it doesn't. I am slightly inclined to (1) at
>> this point.
>>
>
> But then the common prefix of "/a/b" and "/a/c" would be "/a/",
> which would be very unexpected -- usually the dirname of a path is
> not considered to include a trailing slash.
>
> The special treatment of the root directory is no weirder than it
> is anywhere else. It's already special, since in unix it's the
> only case where a trailing slash is semantically significant.
> (To the kernel, at least -- a few command line utilities break this
> rule, but they're screwy.)
>
> --
> Greg
>
> ______________________________**_________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/**mailman/listinfo/python-ideas<http://mail.python.org/mailman/listinfo/python-ideas>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121107/a76307b9/attachment.html>

From bruce at leapyear.org  Wed Nov  7 07:05:58 2012
From: bruce at leapyear.org (Bruce Leban)
Date: Tue, 6 Nov 2012 22:05:58 -0800
Subject: [Python-ideas] os.path.commonpath()
In-Reply-To: <5099EE84.8000101@canterbury.ac.nz>
References: <k7ba97$n5f$1@ger.gmane.org>
	<6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
	<CAGu0AnvO2up8WmShg5ZtE9RKz0MjA=vOUKeb+oTkFX+_Q4+i5Q@mail.gmail.com>
	<5099EE84.8000101@canterbury.ac.nz>
Message-ID: <CAGu0Anur-dYiLdHLL=6_SwBdHkc=hZiwaJy-Wym__zuaf-9d0A@mail.gmail.com>

On Tue, Nov 6, 2012 at 9:15 PM, Greg Ewing <greg.ewing at canterbury.ac.nz>wrote:

> Bruce Leban wrote:
>
>  If you change the semantics so that it either (1) it always always
>> includes a trailing / or (2) it includes a trailing slash if the two paths
>> have it in common, then you don't have the weirdness that in this case it
>> returns a slash and in others it doesn't. I am slightly inclined to (1) at
>> this point.
>>
>
> But then the common prefix of "/a/b" and "/a/c" would be "/a/",
> which would be very unexpected -- usually the dirname of a path is
> not considered to include a trailing slash.
>

Although less confusing than the current behavior :-)

>
> The special treatment of the root directory is no weirder than it
> is anywhere else. It's already special, since in unix it's the
> only case where a trailing slash is semantically significant.
> (To the kernel, at least -- a few command line utilities break this
> rule, but they're screwy.)


That's reasonable. Perhaps it's sufficient to document it clearly.

--- Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121106/3bf5ba5f/attachment.html>

From techtonik at gmail.com  Wed Nov  7 07:23:49 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Wed, 7 Nov 2012 09:23:49 +0300
Subject: [Python-ideas] os.path.split(path, maxsplit=1)
In-Reply-To: <50983C09.7030802@pearwood.info>
References: <CAPkN8xLCog9ZT=_QaprLbo82x-QFEuayyWTqSyBVxZewN68kOA@mail.gmail.com>
	<CAPkN8xKT07Mm295cQ0k8+sc88hVDtpP3Kg1Mjb8f3KUzyo4GnQ@mail.gmail.com>
	<CAPkN8xK=YQQjrx1ZegHvnOfKj+FtYbWXQv80Sai=CLFjKhT_WA@mail.gmail.com>
	<5097B69C.5080202@nedbatchelder.com> <50983C09.7030802@pearwood.info>
Message-ID: <CAPkN8x+CmhouxKX0Hv+WndjmOcT5AHZ62e3va-sh7pMAzVyUpQ@mail.gmail.com>

On Tue, Nov 6, 2012 at 1:22 AM, Steven D'Aprano <steve at pearwood.info> wrote:
> On 05/11/12 23:52, Ned Batchelder wrote:
>
>> Anatoly, I appreciate the energy and dedication you've shown to the
>> Python community, but maybe you should spend a little more time on
>> each proposal? For example, the subject line here is a different (and
>> already taken) function name than the implementation, and has a
>> maxsplit argument that the implementation doesn't have.

It's the idea to add maxsplit argument to os.path.split(). If the idea
is good, it will be developed into actual proposal.

I've included prototype code, because in the past people complained
about the absence of source code. The name in prototype function is
different, because it uses os.path.split internally, which clashes.
Here is the working prototype. Attached is with test case from SO.
Note that it behaves differently on Windows, Python 3 because of the
regression http://bugs.python.org/issue16424

def pathsplit(pathstr, maxsplit=):
    """split relative path into list"""
    path = [pathstr]
    while True:
        oldpath = path[:]
        path[:1] = list(os.path.split(path[0]))
        if path[0] == '':
            path = path[1:]
        elif path[1] == '':
            path = path[:1] + path[2:]
        if path == oldpath:
            return path
        if maxsplit is not None and len(path) > maxsplit:
            return path
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pathsplit.py
Type: application/octet-stream
Size: 1144 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121107/cdc131a3/attachment.obj>

From techtonik at gmail.com  Wed Nov  7 07:46:30 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Wed, 7 Nov 2012 09:46:30 +0300
Subject: [Python-ideas] Publishing ideas on ActiveState recipe site
Message-ID: <CAPkN8xJ5Aam1tLyBPOgbBMwCuDoKs8gCp4aDwtr42H3YU56E+g@mail.gmail.com>

On Tue, Nov 6, 2012 at 1:22 AM, Steven D'Aprano <steve at pearwood.info> wrote:
> On 05/11/12 23:52, Ned Batchelder wrote:
>>
>> Get everything the way you want it, and then propose it.
>
>
> +1
>
> Also consider publishing it as a recipe on ActiveState, where many
> people will view it, use it, and offer feedback. This has many
> benefits:
>
> * You will gauge community interest;
>
> * Many eyeballs make bugs shallow;
>
> * You are providing a useful recipe that others can use, even
>   if it doesn't get included in the std lib.
>
> Some of the most useful parts of the std lib, like namedtuple,
> started life on ActiveState.
>
> http://code.activestate.com/recipes/langs/python/

Why I don't use ActiveState:

1. StackOverflow is much easier to access - just one click to login
with Google Account versus several clicks, data entry and copy/paste
operations to remind the password on ActiveState - I want to login
there with Python account
2. StackOverflow is problem search oriented - not recipe catalog
oriented, which makes it better for solving problems, which I do more
often than reading the recipe book (although I must admin when I was
starting Python - the Cookbook from O'Reilly in CHM format was mega
awesome)
3. I post the code as gists as it includes the notion of history,
unlike ActiveState, which interface looks a little outdated - it was
not obvious for me that recipes have history until today
4. Recipes are licensed, which is a too much of a burden for a snippet
5. ActiveState site makes it clear that it is ActiveState site - the
20% of my screen is taken by ActiveState header, so it looks like
company site - not a site for community

Otherwise the idea of community recipe site is very nice.


From techtonik at gmail.com  Wed Nov  7 08:14:01 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Wed, 7 Nov 2012 10:14:01 +0300
Subject: [Python-ideas] Code snippet signature (Was: Publishing ideas on
	ActiveState recipe site)
Message-ID: <CAPkN8xKfmf_FMt3NeNW3+dz+Yigiv5k97XF7sUzz6o89UD5pfg@mail.gmail.com>

Speaking about snippets in the thread about ActiveState recipe site
reminded me of this idea.

Snippet signature is a checksum over normalized source code with
stripped comments. The signature looks like `# snippet:site:hash`
added before function as a comment. Normalized means that code is
unindented and PEPified. Comments are also stripped as they may be
custom.

Signatures are useful to:
1. identify piece of code repeatably used in Python apps for potential
inclusion in Python library
2. measure popularity of snippets using automated code analysis
3. automatically check for updates using lookup on the given hash
4. automatically check if local code has improvements over the
original snippet by recalculating and comparing hash
5. lookup if snippets were integrated into standard (or some other library)
6. compile a Python library with mostly reused snippets on the server side

--
anatoly t.


On Wed, Nov 7, 2012 at 9:46 AM, anatoly techtonik <techtonik at gmail.com> wrote:
> On Tue, Nov 6, 2012 at 1:22 AM, Steven D'Aprano <steve at pearwood.info> wrote:
>> On 05/11/12 23:52, Ned Batchelder wrote:
>>>
>>> Get everything the way you want it, and then propose it.
>>
>>
>> +1
>>
>> Also consider publishing it as a recipe on ActiveState, where many
>> people will view it, use it, and offer feedback. This has many
>> benefits:
>>
>> * You will gauge community interest;
>>
>> * Many eyeballs make bugs shallow;
>>
>> * You are providing a useful recipe that others can use, even
>>   if it doesn't get included in the std lib.
>>
>> Some of the most useful parts of the std lib, like namedtuple,
>> started life on ActiveState.
>>
>> http://code.activestate.com/recipes/langs/python/
>
> Why I don't use ActiveState:
>
> 1. StackOverflow is much easier to access - just one click to login
> with Google Account versus several clicks, data entry and copy/paste
> operations to remind the password on ActiveState - I want to login
> there with Python account
> 2. StackOverflow is problem search oriented - not recipe catalog
> oriented, which makes it better for solving problems, which I do more
> often than reading the recipe book (although I must admin when I was
> starting Python - the Cookbook from O'Reilly in CHM format was mega
> awesome)
> 3. I post the code as gists as it includes the notion of history,
> unlike ActiveState, which interface looks a little outdated - it was
> not obvious for me that recipes have history until today
> 4. Recipes are licensed, which is a too much of a burden for a snippet
> 5. ActiveState site makes it clear that it is ActiveState site - the
> 20% of my screen is taken by ActiveState header, so it looks like
> company site - not a site for community
>
> Otherwise the idea of community recipe site is very nice.


From rosuav at gmail.com  Wed Nov  7 08:19:10 2012
From: rosuav at gmail.com (Chris Angelico)
Date: Wed, 7 Nov 2012 18:19:10 +1100
Subject: [Python-ideas] The "in"-statement
In-Reply-To: <CAH0mxTRfVoaCQ+B8V3C4YRcG0oQFA7_Mbq=CTV8PR=tyqXQR3w@mail.gmail.com>
References: <20121105192841.GA26572@untibox>
	<CAGaVwhSbLYhKK7kpTooHF24foC-wNKGwPMnTSu29ThVmRND0_g@mail.gmail.com>
	<CAH0mxTRfVoaCQ+B8V3C4YRcG0oQFA7_Mbq=CTV8PR=tyqXQR3w@mail.gmail.com>
Message-ID: <CAPTjJmrE9DaFje7M4=4DTNfxvXSnsd+y9VvRGet8r3pKgsNndQ@mail.gmail.com>

On Wed, Nov 7, 2012 at 1:00 PM, Joao S. O. Bueno <jsbueno at python.org.br> wrote:
> Actually near equivalent - it would be more proper if we had a fork
> variant where each subproccess would
> run in a different parallel universe. Maybe when Python 5 be
> overhauled to be optimized for quantum computing we can get close
> enough.

TBH, there's no reason not to implement that with today's Python. The
specifications, like RFC 2795, do not preclude implementations
involving multiple universes and/or subatomic monkeys.

ChrisA


From ronaldoussoren at mac.com  Wed Nov  7 08:22:40 2012
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Wed, 07 Nov 2012 08:22:40 +0100
Subject: [Python-ideas] os.path.commonpath()
In-Reply-To: <CAGu0AnvO2up8WmShg5ZtE9RKz0MjA=vOUKeb+oTkFX+_Q4+i5Q@mail.gmail.com>
References: <k7ba97$n5f$1@ger.gmane.org>
	<6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
	<CAGu0AnvO2up8WmShg5ZtE9RKz0MjA=vOUKeb+oTkFX+_Q4+i5Q@mail.gmail.com>
Message-ID: <FBA7DA6E-F1FE-4156-AAB9-2A80358EF59E@mac.com>


On 7 Nov, 2012, at 3:05, Bruce Leban <bruce at leapyear.org> wrote:

> It would be nice if in conjunction with this os.path.commonprefix is renamed as string.commonprefix with the os.path.commonprefix kept for backwards compatibility (and deprecated).
> 
> more inline
> 
> On Tue, Nov 6, 2012 at 7:49 AM, Ronald Oussoren <ronaldoussoren at mac.com> wrote:
> 
> On 6 Nov, 2012, at 16:27, Serhiy Storchaka <storchaka at gmail.com> wrote:
> > What should be a common prefix of '/var/log/apache2' and '/var//log/mysql'?
> /var/log
> 
> > What should be a common prefix of '/usr' and '//usr'?
> /usr
> 
> > What should be a common prefix of '/usr/local/' and '/usr/local/'?
> /usr/local
> 
> It appears that you want the result to never include a trailing /. However, you've left out one key test case:
> 
> What is commonpath('/usr', '/var')?
> 
> It seems to me that the only reasonable value is '/'.

I agree

> 
> If you change the semantics so that it either (1) it always always includes a trailing / or (2) it includes a trailing slash if the two paths have it in common, then you don't have the weirdness that in this case it returns a slash and in others it doesn't. I am slightly inclined to (1) at this point.

I'd prefer to only have a path seperator at the end when it has semantic meaning. That would mean that only the root of a filesystem tree ("/" on Unix, but also "C:\" and "\\server\share\" on Windows) have a separator and the end.

> 
> It would also be a bit surprising that there are cases where commonpath(a,a) != a.

That's already true, commonpath('/usr//bin', '/usr//bin') would be  '/usr/bin' and not '/usr//bin'.

> 
>  
> > What should be a common prefix of '/usr/local/' and '/usr/local/bin'?
> /usr/local
> 
> > What should be a common prefix of '/usr/bin/..' and '/usr/bin'?
> /usr/bin
> 
> seems better than the alternative of interpreting the '..'.

That was the hard choice in the list, my reason for picking this result is that interpreting '..' can change the meaning of a path when dealing with symbolic links and therefore would make the function less useful (and you can always call os.path.normpath when you do want to interpret '..').  

Stripping '.' elements would be fine, e.g. commonpath('/usr/./bin/ls', '/usr/bin/sh') could be '/usr/bin'. 

> 
> * Relative paths that don't share a prefix should raise an exception
> 
> Why? Why is an empty path not a reasonable result?

An empty string is not a valid path.  Now that I reconsider this question: "." would be a valid path, and would have a sane meaning.

>  
> * On windows two paths that don't have the same drive should raise an exception
> 
> I disagree. On unix systems, should two paths that don't have the same drive also raise an exception? What if I'm using this function on windows to compare two http paths or two paths to a remote unix system? Raising an exception in either case would be wrong.

The paths in URLs don't have a drive, hence both URL paths would have the "same" drive.   More importantly: posixpath.commonpath would be better to compare two http or remote unix paths as that function uses the correct separator (ntpath.commonpath uses a backslash as separator)

Also: when two paths have a different drive letter or UNC share name there is no way to have a value for the prefix that allows for the construction of a path from the common prefix to one of those paths.

That is,

     path1 = "c:\windows"
     path2 = "d:\data"

     pfx = commonpath(path1, path2)

The only value of pfx that would result in there being a value of 'sfx' such that   os.path.join(pfx, sfx) == path1 is the empty string, but that value does not refer to a filesystem location.  That means you have to explictly test if commonpath returns the empty string because you likely have to behave differently when there is no shared prefix. I'd then prefer if commonpath raises an exception, because it would be too easy to forget to check for this (especially when developing on a unix based platform and later porting to windows).  An exception would mean code blows up, instead of giving unexpected results (leading to questions like "Why is your program writing junk in my home directory?")

> 
> 
> The alternative is to return some arbitrary value (like None) that you have to test for, which would IMHO make it too easy to accidently pass an useless value to some other API and get a confusing exeption later on.
> 
> Yes, don't return a useless value. An empty string is useful in the relative path case and '/' is useful in the non-relative but paths don't have common prefix at all case. 

"/" *is* the common prefix for absolute paths on Unix that don't share any path elements.  As mentioned above "." (or rather os.path.curdir) would be a sane result for relative paths.

Ronald 
> 
> 
> --- Bruce

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121107/d8b2fed9/attachment.html>

From ubershmekel at gmail.com  Wed Nov  7 09:03:27 2012
From: ubershmekel at gmail.com (Yuval Greenfield)
Date: Wed, 7 Nov 2012 10:03:27 +0200
Subject: [Python-ideas] Publishing ideas on ActiveState recipe site
In-Reply-To: <CAPkN8xJ5Aam1tLyBPOgbBMwCuDoKs8gCp4aDwtr42H3YU56E+g@mail.gmail.com>
References: <CAPkN8xJ5Aam1tLyBPOgbBMwCuDoKs8gCp4aDwtr42H3YU56E+g@mail.gmail.com>
Message-ID: <CANSw7KygPJxsNuWksjm-DVo5HKAJ8fRTyOCgG9BQRo=j=8akBA@mail.gmail.com>

On Wed, Nov 7, 2012 at 8:46 AM, anatoly techtonik <techtonik at gmail.com>wrote:

> On Tue, Nov 6, 2012 at 1:22 AM, Steven D'Aprano <steve at pearwood.info>
> wrote:
> > On 05/11/12 23:52, Ned Batchelder wrote:
> >>
> >> Get everything the way you want it, and then propose it.
> >
> >
> > +1
> >
> > Also consider publishing it as a recipe on ActiveState, where many
> > people will view it, use it, and offer feedback. This has many
> > benefits:
> >
> > * You will gauge community interest;
> >
> > * Many eyeballs make bugs shallow;
> >
> > * You are providing a useful recipe that others can use, even
> >   if it doesn't get included in the std lib.
> >
> > Some of the most useful parts of the std lib, like namedtuple,
> > started life on ActiveState.
> >
> > http://code.activestate.com/recipes/langs/python/
>
> Why I don't use ActiveState:
>
> 1. StackOverflow is much easier to access - just one click to login
> with Google Account versus several clicks, data entry and copy/paste
> operations to remind the password on ActiveState - I want to login
> there with Python account
> 2. StackOverflow is problem search oriented - not recipe catalog
> oriented, which makes it better for solving problems, which I do more
> often than reading the recipe book (although I must admin when I was
> starting Python - the Cookbook from O'Reilly in CHM format was mega
> awesome)
> 3. I post the code as gists as it includes the notion of history,
> unlike ActiveState, which interface looks a little outdated - it was
> not obvious for me that recipes have history until today
> 4. Recipes are licensed, which is a too much of a burden for a snippet
> 5. ActiveState site makes it clear that it is ActiveState site - the
> 20% of my screen is taken by ActiveState header, so it looks like
> company site - not a site for community
>
> Otherwise the idea of community recipe site is very nice.
>
>
https://gist.github.com/ works great too. But I believe we are a bit OT.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121107/dfd6b380/attachment.html>

From techtonik at gmail.com  Wed Nov  7 09:19:34 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Wed, 7 Nov 2012 11:19:34 +0300
Subject: [Python-ideas] Publishing ideas on ActiveState recipe site
In-Reply-To: <CANSw7KygPJxsNuWksjm-DVo5HKAJ8fRTyOCgG9BQRo=j=8akBA@mail.gmail.com>
References: <CAPkN8xJ5Aam1tLyBPOgbBMwCuDoKs8gCp4aDwtr42H3YU56E+g@mail.gmail.com>
	<CANSw7KygPJxsNuWksjm-DVo5HKAJ8fRTyOCgG9BQRo=j=8akBA@mail.gmail.com>
Message-ID: <CAPkN8xKT2Ww_h-njz=Ot_ZxVD4GPH8hXm-+WO=cJgvPV8gw6sw@mail.gmail.com>

On Wed, Nov 7, 2012 at 11:03 AM, Yuval Greenfield <ubershmekel at gmail.com>wrote:

> On Wed, Nov 7, 2012 at 8:46 AM, anatoly techtonik <techtonik at gmail.com>wrote:
>
>> On Tue, Nov 6, 2012 at 1:22 AM, Steven D'Aprano <steve at pearwood.info>
>> wrote:
>> > On 05/11/12 23:52, Ned Batchelder wrote:
>> >>
>> >> Get everything the way you want it, and then propose it.
>> >
>> >
>> > +1
>> >
>> > Also consider publishing it as a recipe on ActiveState, where many
>> > people will view it, use it, and offer feedback. This has many
>> > benefits:
>> >
>> > * You will gauge community interest;
>> >
>> > * Many eyeballs make bugs shallow;
>> >
>> > * You are providing a useful recipe that others can use, even
>> >   if it doesn't get included in the std lib.
>> >
>> > Some of the most useful parts of the std lib, like namedtuple,
>> > started life on ActiveState.
>> >
>> > http://code.activestate.com/recipes/langs/python/
>>
>> Why I don't use ActiveState:
>>
>> 1. StackOverflow is much easier to access - just one click to login
>> with Google Account versus several clicks, data entry and copy/paste
>> operations to remind the password on ActiveState - I want to login
>> there with Python account
>> 2. StackOverflow is problem search oriented - not recipe catalog
>> oriented, which makes it better for solving problems, which I do more
>> often than reading the recipe book (although I must admin when I was
>> starting Python - the Cookbook from O'Reilly in CHM format was mega
>> awesome)
>> 3. I post the code as gists as it includes the notion of history,
>> unlike ActiveState, which interface looks a little outdated - it was
>> not obvious for me that recipes have history until today
>> 4. Recipes are licensed, which is a too much of a burden for a snippet
>> 5. ActiveState site makes it clear that it is ActiveState site - the
>> 20% of my screen is taken by ActiveState header, so it looks like
>> company site - not a site for community
>>
>> Otherwise the idea of community recipe site is very nice.
>>
>>
> https://gist.github.com/ works great too. But I believe we are a bit OT.
>

Yea. That's point no.3

We're not offtopic, because people are proposing to post ideas on
ActiveState first.

I also found that ActiveState site doesn't allow to release recipes into
Public Domain.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121107/1784f934/attachment.html>

From storchaka at gmail.com  Wed Nov  7 09:20:59 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Wed, 07 Nov 2012 10:20:59 +0200
Subject: [Python-ideas] os.path.commonpath()
In-Reply-To: <6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
References: <k7ba97$n5f$1@ger.gmane.org>
	<6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
Message-ID: <k7d5le$d47$1@ger.gmane.org>

On 06.11.12 17:49, Ronald Oussoren wrote:
> On 6 Nov, 2012, at 16:27, Serhiy Storchaka <storchaka at gmail.com> wrote:
>> There are some open questions about details of *right* behavior.

I only asked the questions for which there are different opinions or for which I myself doubt.

>> What should be a common prefix of '/var/log/apache2' and
>> '/var//log/mysql'?
> /var/log

I think so too.

>> What should be a common prefix of '/usr' and '//usr'?
> /usr

normpath() preserves leading double slash (but not triple).  That's why I asked the question.

>> What should be a common prefix of '/usr/local/' and '/usr/local/'?
> /usr/local

os.path.split('/usr/local/') is ('/usr/local', '').  Repeated application of os.path.split() gives us ('/', 'usr', 'local', '').  That's why I assume that it is possible appropriate here to preserve the trailing slash.  I'm not sure.

>> What should be a common prefix of '/usr/local/' and '/usr/local/bin'?
> /usr/local

Here the same considerations as for the previous question.  In any case a common prefix of '/usr/local/etc' and '/usr/local/bin' should be '/usr/local'.

> * Relative paths that don't share a prefix should raise an exception

I disagree.  A common prefix for relative paths on the same drive is a current directory on this drive (if we decide to drop '..').

> * On windows two paths that don't have the same drive should raise an exception
> The alternative is to return some arbitrary value (like None) that you have to test for, which would IMHO make it too easy to accidently pass an useless value to some other API and get a confusing exeption later on.

May be.  This should be the same result (None or an exception) as for empty list or mixing of absolute and relative paths.

Thank you for your answers.



From stefan_ml at behnel.de  Wed Nov  7 09:29:09 2012
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 07 Nov 2012 09:29:09 +0100
Subject: [Python-ideas] Publishing ideas on ActiveState recipe site
In-Reply-To: <CAPkN8xKT2Ww_h-njz=Ot_ZxVD4GPH8hXm-+WO=cJgvPV8gw6sw@mail.gmail.com>
References: <CAPkN8xJ5Aam1tLyBPOgbBMwCuDoKs8gCp4aDwtr42H3YU56E+g@mail.gmail.com>
	<CANSw7KygPJxsNuWksjm-DVo5HKAJ8fRTyOCgG9BQRo=j=8akBA@mail.gmail.com>
	<CAPkN8xKT2Ww_h-njz=Ot_ZxVD4GPH8hXm-+WO=cJgvPV8gw6sw@mail.gmail.com>
Message-ID: <k7d64j$gui$1@ger.gmane.org>

anatoly techtonik, 07.11.2012 09:19:
>>> 4. Recipes are licensed, which is a too much of a burden for a snippet
> 
> I also found that ActiveState site doesn't allow to release recipes into
> Public Domain.

Which is ok because "Public Domain" is not a universal concept. It won't
work in all countries where your recipes will be read and where people want
to use them. Better use a suitable license.

Stefan




From storchaka at gmail.com  Wed Nov  7 09:30:55 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Wed, 07 Nov 2012 10:30:55 +0200
Subject: [Python-ideas] os.path.commonpath()
In-Reply-To: <CAGu0AnvO2up8WmShg5ZtE9RKz0MjA=vOUKeb+oTkFX+_Q4+i5Q@mail.gmail.com>
References: <k7ba97$n5f$1@ger.gmane.org>
	<6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
	<CAGu0AnvO2up8WmShg5ZtE9RKz0MjA=vOUKeb+oTkFX+_Q4+i5Q@mail.gmail.com>
Message-ID: <k7d684$hpt$1@ger.gmane.org>

On 07.11.12 04:05, Bruce Leban wrote:
> It would be nice if in conjunction with this os.path.commonprefix is
> renamed as string.commonprefix with the os.path.commonprefix kept for
> backwards compatibility (and deprecated).

Agree.

> more inline

In most cases I agree with Greg and Ronald.




From storchaka at gmail.com  Wed Nov  7 09:51:46 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Wed, 07 Nov 2012 10:51:46 +0200
Subject: [Python-ideas] os.path.commonpath()
In-Reply-To: <FBA7DA6E-F1FE-4156-AAB9-2A80358EF59E@mac.com>
References: <k7ba97$n5f$1@ger.gmane.org>
	<6EEDCE1C-414A-4A8A-B535-38DF57655CBB@mac.com>
	<CAGu0AnvO2up8WmShg5ZtE9RKz0MjA=vOUKeb+oTkFX+_Q4+i5Q@mail.gmail.com>
	<FBA7DA6E-F1FE-4156-AAB9-2A80358EF59E@mac.com>
Message-ID: <k7d7f4$sjn$1@ger.gmane.org>

On 07.11.12 09:22, Ronald Oussoren wrote:
>> It would also be a bit surprising that there are cases where 
>> commonpath(a,a) != a.
> 
> That's already true, commonpath('/usr//bin', '/usr//bin') would be 
>   '/usr/bin' and not '/usr//bin'.

Yes, the current implementation does not preserve the repeated slashes, this is an argument for the answer that commonpath(['/usr//bin', '/usr/bin']) should return '/usr/bin' and not '/usr'.

However it would be a bit surprising that there are cases where commonpath([normpath(a), normpath(a)]) != normpath(a).

> Stripping '.' elements would be fine, e.g. commonpath('/usr/./bin/ls', 
> '/usr/bin/sh') could be '/usr/bin'.

May be.

> An empty string is not a valid path.  Now that I reconsider this 
> question: "." would be a valid path, and would have a sane meaning.

Looks reasonable, but I am not sure.  A returned value most probably will be used in join() and this will add an unexpected './' at the start of path.




From jeanpierreda at gmail.com  Wed Nov  7 10:11:47 2012
From: jeanpierreda at gmail.com (Devin Jeanpierre)
Date: Wed, 7 Nov 2012 04:11:47 -0500
Subject: [Python-ideas] Async API: some code to review
In-Reply-To: <64FEC457-659B-47DB-BE0E-E830F61F4200@twistedmatrix.com>
References: <CAP7+vJJhkGEK6=BQ6SNtCSZUgN21S6e90YOb7tLJgds2Le+rGA@mail.gmail.com>
	<64FEC457-659B-47DB-BE0E-E830F61F4200@twistedmatrix.com>
Message-ID: <CABicbJL3gwrSyELRgKP8iVkkwCpYR4sxRBvHGLHZXtR6WwqtQA@mail.gmail.com>

It's been a week, and nobody has responded to Glyph's email. I don't
think I know enough to agree or disagree with what he said, but it was
well-written and it looked important. Also, Glyph has a lot of
experience with this sort of thing, and it would be a shame if he was
discouraged by the lack of response. We can't really expect people to
contribute if their opinions are ignored.

Can relevant people please take another look at his post?

-- Devin

On Wed, Oct 31, 2012 at 6:10 AM, Glyph <glyph at twistedmatrix.com> wrote:
> Finally getting around to this one...
>
> I am sorry if I'm repeating any criticism that has already been rehashed in
> this thread.  There is really a deluge of mail here and I can't keep up with
> it.  I've skimmed some of it and avoided or noted things that I did see
> mentioned, but I figured I should write up something before next week.
>
> To make a long story short, my main points here are:
>
> I think tulip unfortunately has a lot of the problems I tried to describe in
> earlier messages,
> it would be really great if we could have a core I/O interface that we could
> use for interoperability with Twisted before bolting a requirement for
> coroutine trampolines on to everything,
> twisted-style protocol/transport separation is really important and this
> should not neglect it.  As I've tried to illustrate in previous messages, an
> API where applications have to call send() or recv() is just not going to
> behave intuitively in edge cases or perform well,
> I know it's a prototype, but this isn't such an unexplored area that it
> should be developed without TDD: all this code should both have tests and
> provide testing support to show how applications that use it can be tested
> the scheduler module needs some example implementation of something like
> Twisted's gatherResults for me to critique its expressiveness; it looks like
> it might be missing something in the area of one task coordinating multiple
> others but I can't tell
>
>
> On Oct 28, 2012, at 4:52 PM, Guido van Rossum <guido at python.org> wrote:
>
> The pollster has a very simple API: add_reader(fd, callback, *args),
>
> add_writer(<ditto>), remove_reader(fd), remove_writer(fd), and
> poll(timeout) -> list of events. (fd means file descriptor.) There's
> also pollable() which just checks if there are any fds registered. My
> implementation requires fd to be an int, but that could easily be
> extended to support other types of event sources.
>
>
> I don't see how that is.  All of the mechanisms I would leverage within
> Twisted to support other event sources are missing (e.g.: abstract
> interfaces for those event sources).  Are you saying that a totally
> different pollster could just accept a different type to add_reader, and not
> an integer?  If so, how would application code know how to construct
> something else.
>
> I'm not super happy that I have parallel reader/writer APIs, but passing a
> separate read/write flag didn't come out any more elegant, and I don't
> foresee other operation types (though I may be wrong).
>
>
> add_reader and add_writer is an important internal layer of the API for
> UNIX-like operating systems, but the design here is fundamentally flawed in
> that application code (e.g. echosvr.py) needs to import concrete
> socket-handling classes like SocketTransport and BufferedReader in order to
> synthesize a transport.  These classes might need to vary their behavior
> significantly between platforms, and application code should not be
> manipulating them unless there is a serious low-level need to.
>
> It looks like you've already addressed the fact that some transports need to
> be platform-specific.  That's not quite accurate, unless you take a very
> broad definition of "platform".  In Twisted, the basic socket-based TCP
> transport is actually supported across all platforms; but some other *APIs*
> (well, let's be honest, right now, just IOCP, but there have been others,
> such as java's native I/O APIs under Jython, in the past).
>
> You have to ask the "pollster" (by which I mean: reactor) for transport
> objects, because different multiplexing mechanisms can require different I/O
> APIs, even for basic socket I/O.  This is why I keep talking about IOCP.
> It's not that Windows is particularly great, but that the IOCP API, if used
> correctly, is fairly alien, and is a good proxy for other use-cases which
> are less direct to explain, like interacting with GUI libraries where you
> need to interact with the GUI's notion of a socket to get notifications,
> rather than a raw FD.  (GUI libraries often do this because they have to
> support Windows and therefore IOCP.)  Others in this thread have already
> mentioned the fact that ZeroMQ requires the same sort of affordance.  This
> is really a design error on 0MQ's part, but, you have to deal with it anyway
> ;-).
>
> More importantly, concretely tying everything to sockets is just bad design.
> You want to be able to operate on pipes and PTYs (which need to call read(),
> or, a bunch of gross ioctl()s and then read(), not recv()).  You want to be
> able to able to operate on these things in unit tests without involving any
> actual file descriptors or syscalls.  The higher level of abstraction makes
> regular application code a lot shorter, too: I was able to compress
> echosvr.py down to 22 lines by removing all the comments and logging and
> such, but that is still more than twice as long as the (9 line) echo server
> example on the front page of <http://twistedmatrix.com/trac/>.  It's closer
> in length to the (19 line) full line-based publish/subscribe protocol over
> on the third tab.
>
> Also, what about testing? You want to be able to simulate the order of
> responses of multiple syscalls to coerce your event-driven program to
> receive its events in different orders.  One of the big advantages of event
> driven programming is that everything's just a method call, so your unit
> tests can just call the methods to deliver data to your program and see what
> it does, without needing to have a large, elaborate simulation edifice to
> pretend to be a socket.  But, once you mix in the magic of the generator
> trampoline, it's somewhat hard to assemble your own working environment
> without some kind of test event source; at least, it's not clear to me how
> to assemble a Task without having a pollster anywhere, or how to make my own
> basic pollster for testing.
>
> The event loop has two basic ways to register callbacks:
> call_soon(callback, *args) causes callback(*args) to be called the
> next time the event loop runs; call_later(delay, callback, *args)
> schedules a callback at some time (relative or absolute) in the
> future.
>
>
> "relative or absolute" is hiding the whole monotonic-clocks discussion
> behind a simple phrase, but that probably does not need to be resolved
> here... I'll let you know if we ever figure it out :).
>
> sockets.py: http://code.google.com/p/tulip/source/browse/sockets.py
>
> This implements some internet primitives using the APIs in
> scheduling.py (including block_r() and block_w()). I call them
> transports but they are different from transports Twisted; they are
> closer to idealized sockets. SocketTransport wraps a plain socket,
> offering recv() and send() methods that must be invoked using yield
> from.
>
>
> I feel I should note that these methods behave inconsistently; send()
> behaves as sendall(), re-trying its writes until it receives a full buffer,
> but recv() may yield a short read.
>
> (But most importantly, block_r and block_w are insufficient as primitives;
> you need a separate pollster that uses write_then_block(data) and
> read_then_block() too, which may need to dispatch to WSASend/WSARecv or
> WriteFile/ReadFile.)
>
> SslTransport wraps an ssl socket (luckily in Python 2.6 and up,
> stdlib ssl sockets have good async support!).
>
>
> stdlib ssl sockets have async support that makes a number of UNIX-y
> assumptions.  The wrap_socket trick doesn't work with IOCP, because the I/O
> operations are initiated within the SSL layer, and therefore can't be
> associated with a completion port, so they won't cause a queued completion
> status trigger and therefore won't wake up the loop.  This plagued us for
> many years within Twisted and has only relatively recently been fixed:
> <http://tm.tl/593>.
>
> Since probably 99% of the people on this list don't actually give a crap
> about Windows, let me give a more practical example: you can't do SSL over a
> UNIX pipe.  Off the top of my head, this means you can't write a
> command-line tool to encrypt a connection via a shell pipeline, but there
> are many other cases where you'd expect to be able to get arbitrary I/O over
> stdout.
>
> It's reasonable, of course, for lots of Python applications to not care
> about high-performance, high-concurrency SSL on Windows,; select() works
> okay for many applications on Windows.  And most SSL happens on sockets, not
> pipes, hence the existence of the OpenSSL API that the stdlib ssl module
> exposes for wrapping sockets.  But, as I'll explain in a moment, this is one
> reason that it's important to be able to give your code a turbo boost with
> Twisted (or other third-party extensions) once you start encountering
> problems like this.
>
> I don't particularly care about the exact abstractions in this module;
> they are convenient and I was surprised how easy it was to add SSL,
> but still these mostly serve as somewhat realistic examples of how to
> use scheduling.py.
>
>
> This is where I think we really differ.
>
> I think that the whole attempt to build a coroutine scheduler at the low
> level is somewhat misguided and will encourage people to write misleading,
> sloppy, incorrect programs that will be tricky to debug (although, to be
> fair, not quite as tricky as even more misleading/sloppy/incorrect
> multi-threaded ones).  However, I'm more than happy to agree to disagree on
> this point: clearly you think that forests of yielding coroutines are a big
> part of the future of Python.  Maybe you're even right to do so, since I
> have no interest in adding language features, whereas if you hit a rough
> edge in 'yield' syntax you can sand it off rather than living with it.  I
> will readily concede that 'yield from' and 'return' are nicer than the
> somewhat ad-hoc idioms we ended up having to contend with in the current
> iteration of @inlineCallbacks.  (Except for the exit-at-a-distance problem,
> which it doesn't seem that return->StopIteration addresses - does this
> happen, with PEP-380 generators?
> <http://twistedmatrix.com/trac/ticket/4157>)
>
> What I'm not happy to disagree about is the importance of a good I/O
> abstraction and interoperation layer.
>
> Twisted is not going away; there are oodles of good reasons that it's built
> the way it is, as I've tried to describe in this and other messages, and
> none of our plans for its future involve putting coroutine trampolines at
> the core of the event loop; those are just fine over on the side with
> inlineCallbacks.  However, lots of Python programmers are going to use what
> you come up with.  They'd use it even if it didn't really work, just because
> it's bundled in and it's convenient.  But I think it'll probably work fine
> for many tasks, and it will appeal to lots of people new to event-driven I/O
> because of the seductive deception of synchronous control flow and the
> superiority to scheduling I/O operations with threads.
>
> What I think is really very important in the design of this new system is to
> present an API whereby:
>
> if someone wants to write a basic protocol or data-format parser for the
> stdlib, it should be easy to write it as a feed parser without needing
> generator coroutines (for example, if they're pushing data into a C library,
> they shouldn't have to write a while loop that calls recv, they should be
> able to just transform some data callback into Python into some data
> callback in C; it should be able to leverage tulip without much more work,
> if users of tulip (read; the stdlib) need access to some functionality
> implemented within Twisted, like an event-driven DNS client that is more
> scalable than getaddrinfo, they can call into Twisted without re-writing
> their entire program,
> if users of Twisted need to invoke some functionality implemented on top of
> tulip, they can construct a task and weave in a scheduler, similarly without
> re-writing much,
> if users of tulip want to just use Twisted to get better performance or
> reliability than the built-in stdlib multiplexor, they ideally shouldn't
> have to change anything, just run it with a different import line or
> something, and
> if (when) users of tulip realize that their generators have devolved into a
> mess of spaghetti ;-) and they need to migrate to Twisted-style event-driven
> callbacks and maybe some formal state machines or generated parsers to deal
> with their inputs, that process can be done incrementally and not in one
> giant shoot-the-moon effort which will make them hate Twisted.
>
>
> As an added bonus, such an API would provide a great basis for Tornado and
> Twisted to interoperate.
>
> It would also be nice to have a more discrete I/O layer to insulate
> application code from common foibles like the fact that, for example, if you
> call send() in tulip multiple times but forget to 'yield from ...send()',
> you may end up writing interleaved garbage on the connection, then raising
> an assertion error, but only if there's a sufficient quantity of data and it
> needs to block; it will otherwise appear to work, leading to bugs that only
> start happening when you are pushing large volumes of data through a system
> at rates exceeding wire speed.  In other words, "only in production, only
> during the holiday season, only during traffic spikes, only when it's really
> really important for the system to keep working".
>
> This is why I think that step 1 here needs to be a common low-level API for
> event-triggered operations that does not have anything to do with
> generators.  I don't want to stop you from doing interesting things with
> generators, but I do really want to decouple the tasks so that their
> responsibilities are not unnecessarily conflated.
>
> task.unblock() is a method; protocol.data_received is a method.  Both can be
> invoked at the same level by an event loop.  Once that low-level event loop
> is delivering data to that callback's satisfaction, the callbacks can
> happily drive a coroutine scheduler, and the coroutine scheduler can have
> much less of a deep integration with the I/O itself; it just needs some kind
> of sentinel object (a Future, a Deferred) to keep track of what exactly it's
> waiting for.
>
> I'm most interested in feedback on the design of polling.py and
> scheduling.py, and to a lesser extent on the design of sockets.py;
> main.py is just an example of how this style works out in practice.
>
>
> It looks to me like there's a design error in scheduling.py with respect to
> coordinating concurrent operations.  If you try to block on two operations
> at once, you'll get an assertion error ('assert not self.blocked', in
> block), so you can't coordinate two interesting I/O requests without
> spawning a bunch of new Tasks and then having them unblock their parent Task
> when they're done.  I may just be failing to imagine how one would implement
> something like Twisted's gatherResults, but this looks like it would be
> frustrating, tedious, and involve creating lots of extra objects and making
> the scheduler do a bunch more work.
>
> Also, shouldn't there be a lot more real exceptions and a lot fewer
> assertions in this code?
>
> Relatedly, add_reader/writer will silently stomp on a previous FD
> registration, so if two tasks end up calling recv() on the same socket, it
> doesn't look like there's any way to find out that they both did that.  It
> looks like the first task to call it will just hang forever, and the second
> one will "win"?  What are the intended semantics?
>
> Speaking from the perspective of I/O scheduling, it will also be thrashing
> any stateful multiplexor with a ton of unnecessary syscalls.  A Twisted
> protocol in normal operation just receiving data from a single connection,
> using, let's say, a kqueue-based multiplexor will call kevent() once to
> register interest, then kqueue() to block, and then just keep getting
> data-available notifications and processing them unless some downstream
> buffer fills up and the transport is told to pause producing data, at which
> point another kevent() gets issued.  tulip, by contrast, will call kevent()
> over and over again, removing and then re-adding its reader repeatedly for
> every packet, since it can never know if someone is about to call recv()
> again any time soon.  Once again, request/response is not the best model for
> retrieving data from a transport; active connections need to be prepared to
> receive more data at any time and not in response to any particular request.
>
> Finally, apologies for spelling / grammar errors; I didn't have a lot of
> time to copy-edit.
>
> -glyph
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>


From techtonik at gmail.com  Wed Nov  7 12:10:30 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Wed, 7 Nov 2012 14:10:30 +0300
Subject: [Python-ideas] Publishing ideas on ActiveState recipe site
In-Reply-To: <k7d64j$gui$1@ger.gmane.org>
References: <CAPkN8xJ5Aam1tLyBPOgbBMwCuDoKs8gCp4aDwtr42H3YU56E+g@mail.gmail.com>
	<CANSw7KygPJxsNuWksjm-DVo5HKAJ8fRTyOCgG9BQRo=j=8akBA@mail.gmail.com>
	<CAPkN8xKT2Ww_h-njz=Ot_ZxVD4GPH8hXm-+WO=cJgvPV8gw6sw@mail.gmail.com>
	<k7d64j$gui$1@ger.gmane.org>
Message-ID: <CAPkN8xLpZe1WjwCZvvonPRm8+0uk2ThA89L92zB-41cbzPjz1w@mail.gmail.com>

On Wed, Nov 7, 2012 at 11:29 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:

> anatoly techtonik, 07.11.2012 09:19:
> >>> 4. Recipes are licensed, which is a too much of a burden for a snippet
> >
> > I also found that ActiveState site doesn't allow to release recipes into
> > Public Domain.
>
> Which is ok because "Public Domain" is not a universal concept. It won't
> work in all countries where your recipes will be read and where people want
> to use them. Better use a suitable license.


MIT license or GPL license is not a universal concept either and wont work
outside of U.S. universally. In court it will come down as a special case
of personal copyright agreement between user and author. In this respect it
is absolutely no different from the public domain notice.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121107/d408112c/attachment.html>

From jrwren at xmtp.net  Wed Nov  7 16:18:20 2012
From: jrwren at xmtp.net (Jay Wren)
Date: Wed, 7 Nov 2012 10:18:20 -0500
Subject: [Python-ideas] Publishing ideas on ActiveState recipe site
In-Reply-To: <k7d64j$gui$1@ger.gmane.org>
References: <CAPkN8xJ5Aam1tLyBPOgbBMwCuDoKs8gCp4aDwtr42H3YU56E+g@mail.gmail.com>
	<CANSw7KygPJxsNuWksjm-DVo5HKAJ8fRTyOCgG9BQRo=j=8akBA@mail.gmail.com>
	<CAPkN8xKT2Ww_h-njz=Ot_ZxVD4GPH8hXm-+WO=cJgvPV8gw6sw@mail.gmail.com>
	<k7d64j$gui$1@ger.gmane.org>
Message-ID: <5064EA98-EDC2-4C29-A7E3-B25E0BE5D109@xmtp.net>


On Nov 7, 2012, at 3:29 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:

> anatoly techtonik, 07.11.2012 09:19:
>>>> 4. Recipes are licensed, which is a too much of a burden for a snippet
>> 
>> I also found that ActiveState site doesn't allow to release recipes into
>> Public Domain.
> 
> Which is ok because "Public Domain" is not a universal concept. It won't
> work in all countries where your recipes will be read and where people want
> to use them. Better use a suitable license.

Creative Commons did a lot of work on making CC0 a universal "Public Domain". While other CC licenses are not suitable for code, CC0 does make sense when you want to release code as what some of us know as "Public Domain".

https://creativecommons.org/publicdomain/zero/1.0/

As for ActiveState not allowing it, StackOverflow might confuse things as well since all contributions on StackExchange sites are licensed under CC-BY-SA. It may be difficult to put a CC0 license along side each post in StackOverflow.

https://stackexchange.com/legal/terms-of-service

--
Jay R. Wren

From guido at python.org  Wed Nov  7 16:19:32 2012
From: guido at python.org (Guido van Rossum)
Date: Wed, 7 Nov 2012 07:19:32 -0800
Subject: [Python-ideas] Async API: some code to review
In-Reply-To: <CABicbJL3gwrSyELRgKP8iVkkwCpYR4sxRBvHGLHZXtR6WwqtQA@mail.gmail.com>
References: <CAP7+vJJhkGEK6=BQ6SNtCSZUgN21S6e90YOb7tLJgds2Le+rGA@mail.gmail.com>
	<64FEC457-659B-47DB-BE0E-E830F61F4200@twistedmatrix.com>
	<CABicbJL3gwrSyELRgKP8iVkkwCpYR4sxRBvHGLHZXtR6WwqtQA@mail.gmail.com>
Message-ID: <CAP7+vJJxvsnRC0Bi=a2Dmu-jmCOY4XR0Rvp+GVD--xfGcGBxdw@mail.gmail.com>

Glyph and three other Twisted developers visited me yesterday. All is well.
We're behind in reporting -- I have a variety of trips and other activities
coming up, but I am still very much planning to act on what we discussed.
(And no, they didn't convince me to add Twisted to the stdlib. :-)

--Guido


On Wed, Nov 7, 2012 at 1:11 AM, Devin Jeanpierre <jeanpierreda at gmail.com>wrote:

> It's been a week, and nobody has responded to Glyph's email. I don't
> think I know enough to agree or disagree with what he said, but it was
> well-written and it looked important. Also, Glyph has a lot of
> experience with this sort of thing, and it would be a shame if he was
> discouraged by the lack of response. We can't really expect people to
> contribute if their opinions are ignored.
>
> Can relevant people please take another look at his post?
>
> -- Devin
>
> On Wed, Oct 31, 2012 at 6:10 AM, Glyph <glyph at twistedmatrix.com> wrote:
> > Finally getting around to this one...
> >
> > I am sorry if I'm repeating any criticism that has already been rehashed
> in
> > this thread.  There is really a deluge of mail here and I can't keep up
> with
> > it.  I've skimmed some of it and avoided or noted things that I did see
> > mentioned, but I figured I should write up something before next week.
> >
> > To make a long story short, my main points here are:
> >
> > I think tulip unfortunately has a lot of the problems I tried to
> describe in
> > earlier messages,
> > it would be really great if we could have a core I/O interface that we
> could
> > use for interoperability with Twisted before bolting a requirement for
> > coroutine trampolines on to everything,
> > twisted-style protocol/transport separation is really important and this
> > should not neglect it.  As I've tried to illustrate in previous
> messages, an
> > API where applications have to call send() or recv() is just not going to
> > behave intuitively in edge cases or perform well,
> > I know it's a prototype, but this isn't such an unexplored area that it
> > should be developed without TDD: all this code should both have tests and
> > provide testing support to show how applications that use it can be
> tested
> > the scheduler module needs some example implementation of something like
> > Twisted's gatherResults for me to critique its expressiveness; it looks
> like
> > it might be missing something in the area of one task coordinating
> multiple
> > others but I can't tell
> >
> >
> > On Oct 28, 2012, at 4:52 PM, Guido van Rossum <guido at python.org>
> wrote:
> >
> > The pollster has a very simple API: add_reader(fd, callback, *args),
> >
> > add_writer(<ditto>), remove_reader(fd), remove_writer(fd), and
> > poll(timeout) -> list of events. (fd means file descriptor.) There's
> > also pollable() which just checks if there are any fds registered. My
> > implementation requires fd to be an int, but that could easily be
> > extended to support other types of event sources.
> >
> >
> > I don't see how that is.  All of the mechanisms I would leverage within
> > Twisted to support other event sources are missing (e.g.: abstract
> > interfaces for those event sources).  Are you saying that a totally
> > different pollster could just accept a different type to add_reader, and
> not
> > an integer?  If so, how would application code know how to construct
> > something else.
> >
> > I'm not super happy that I have parallel reader/writer APIs, but passing
> a
> > separate read/write flag didn't come out any more elegant, and I don't
> > foresee other operation types (though I may be wrong).
> >
> >
> > add_reader and add_writer is an important internal layer of the API for
> > UNIX-like operating systems, but the design here is fundamentally flawed
> in
> > that application code (e.g. echosvr.py) needs to import concrete
> > socket-handling classes like SocketTransport and BufferedReader in order
> to
> > synthesize a transport.  These classes might need to vary their behavior
> > significantly between platforms, and application code should not be
> > manipulating them unless there is a serious low-level need to.
> >
> > It looks like you've already addressed the fact that some transports
> need to
> > be platform-specific.  That's not quite accurate, unless you take a very
> > broad definition of "platform".  In Twisted, the basic socket-based TCP
> > transport is actually supported across all platforms; but some other
> *APIs*
> > (well, let's be honest, right now, just IOCP, but there have been others,
> > such as java's native I/O APIs under Jython, in the past).
> >
> > You have to ask the "pollster" (by which I mean: reactor) for transport
> > objects, because different multiplexing mechanisms can require different
> I/O
> > APIs, even for basic socket I/O.  This is why I keep talking about IOCP.
> > It's not that Windows is particularly great, but that the IOCP API, if
> used
> > correctly, is fairly alien, and is a good proxy for other use-cases which
> > are less direct to explain, like interacting with GUI libraries where you
> > need to interact with the GUI's notion of a socket to get notifications,
> > rather than a raw FD.  (GUI libraries often do this because they have to
> > support Windows and therefore IOCP.)  Others in this thread have already
> > mentioned the fact that ZeroMQ requires the same sort of affordance.
>  This
> > is really a design error on 0MQ's part, but, you have to deal with it
> anyway
> > ;-).
> >
> > More importantly, concretely tying everything to sockets is just bad
> design.
> > You want to be able to operate on pipes and PTYs (which need to call
> read(),
> > or, a bunch of gross ioctl()s and then read(), not recv()).  You want to
> be
> > able to able to operate on these things in unit tests without involving
> any
> > actual file descriptors or syscalls.  The higher level of abstraction
> makes
> > regular application code a lot shorter, too: I was able to compress
> > echosvr.py down to 22 lines by removing all the comments and logging and
> > such, but that is still more than twice as long as the (9 line) echo
> server
> > example on the front page of <http://twistedmatrix.com/trac/>.  It's
> closer
> > in length to the (19 line) full line-based publish/subscribe protocol
> over
> > on the third tab.
> >
> > Also, what about testing? You want to be able to simulate the order of
> > responses of multiple syscalls to coerce your event-driven program to
> > receive its events in different orders.  One of the big advantages of
> event
> > driven programming is that everything's just a method call, so your unit
> > tests can just call the methods to deliver data to your program and see
> what
> > it does, without needing to have a large, elaborate simulation edifice to
> > pretend to be a socket.  But, once you mix in the magic of the generator
> > trampoline, it's somewhat hard to assemble your own working environment
> > without some kind of test event source; at least, it's not clear to me
> how
> > to assemble a Task without having a pollster anywhere, or how to make my
> own
> > basic pollster for testing.
> >
> > The event loop has two basic ways to register callbacks:
> > call_soon(callback, *args) causes callback(*args) to be called the
> > next time the event loop runs; call_later(delay, callback, *args)
> > schedules a callback at some time (relative or absolute) in the
> > future.
> >
> >
> > "relative or absolute" is hiding the whole monotonic-clocks discussion
> > behind a simple phrase, but that probably does not need to be resolved
> > here... I'll let you know if we ever figure it out :).
> >
> > sockets.py: http://code.google.com/p/tulip/source/browse/sockets.py
> >
> > This implements some internet primitives using the APIs in
> > scheduling.py (including block_r() and block_w()). I call them
> > transports but they are different from transports Twisted; they are
> > closer to idealized sockets. SocketTransport wraps a plain socket,
> > offering recv() and send() methods that must be invoked using yield
> > from.
> >
> >
> > I feel I should note that these methods behave inconsistently; send()
> > behaves as sendall(), re-trying its writes until it receives a full
> buffer,
> > but recv() may yield a short read.
> >
> > (But most importantly, block_r and block_w are insufficient as
> primitives;
> > you need a separate pollster that uses write_then_block(data) and
> > read_then_block() too, which may need to dispatch to WSASend/WSARecv or
> > WriteFile/ReadFile.)
> >
> > SslTransport wraps an ssl socket (luckily in Python 2.6 and up,
> > stdlib ssl sockets have good async support!).
> >
> >
> > stdlib ssl sockets have async support that makes a number of UNIX-y
> > assumptions.  The wrap_socket trick doesn't work with IOCP, because the
> I/O
> > operations are initiated within the SSL layer, and therefore can't be
> > associated with a completion port, so they won't cause a queued
> completion
> > status trigger and therefore won't wake up the loop.  This plagued us for
> > many years within Twisted and has only relatively recently been fixed:
> > <http://tm.tl/593>.
> >
> > Since probably 99% of the people on this list don't actually give a crap
> > about Windows, let me give a more practical example: you can't do SSL
> over a
> > UNIX pipe.  Off the top of my head, this means you can't write a
> > command-line tool to encrypt a connection via a shell pipeline, but there
> > are many other cases where you'd expect to be able to get arbitrary I/O
> over
> > stdout.
> >
> > It's reasonable, of course, for lots of Python applications to not care
> > about high-performance, high-concurrency SSL on Windows,; select() works
> > okay for many applications on Windows.  And most SSL happens on sockets,
> not
> > pipes, hence the existence of the OpenSSL API that the stdlib ssl module
> > exposes for wrapping sockets.  But, as I'll explain in a moment, this is
> one
> > reason that it's important to be able to give your code a turbo boost
> with
> > Twisted (or other third-party extensions) once you start encountering
> > problems like this.
> >
> > I don't particularly care about the exact abstractions in this module;
> > they are convenient and I was surprised how easy it was to add SSL,
> > but still these mostly serve as somewhat realistic examples of how to
> > use scheduling.py.
> >
> >
> > This is where I think we really differ.
> >
> > I think that the whole attempt to build a coroutine scheduler at the low
> > level is somewhat misguided and will encourage people to write
> misleading,
> > sloppy, incorrect programs that will be tricky to debug (although, to be
> > fair, not quite as tricky as even more misleading/sloppy/incorrect
> > multi-threaded ones).  However, I'm more than happy to agree to disagree
> on
> > this point: clearly you think that forests of yielding coroutines are a
> big
> > part of the future of Python.  Maybe you're even right to do so, since I
> > have no interest in adding language features, whereas if you hit a rough
> > edge in 'yield' syntax you can sand it off rather than living with it.  I
> > will readily concede that 'yield from' and 'return' are nicer than the
> > somewhat ad-hoc idioms we ended up having to contend with in the current
> > iteration of @inlineCallbacks.  (Except for the exit-at-a-distance
> problem,
> > which it doesn't seem that return->StopIteration addresses - does this
> > happen, with PEP-380 generators?
> > <http://twistedmatrix.com/trac/ticket/4157>)
> >
> > What I'm not happy to disagree about is the importance of a good I/O
> > abstraction and interoperation layer.
> >
> > Twisted is not going away; there are oodles of good reasons that it's
> built
> > the way it is, as I've tried to describe in this and other messages, and
> > none of our plans for its future involve putting coroutine trampolines at
> > the core of the event loop; those are just fine over on the side with
> > inlineCallbacks.  However, lots of Python programmers are going to use
> what
> > you come up with.  They'd use it even if it didn't really work, just
> because
> > it's bundled in and it's convenient.  But I think it'll probably work
> fine
> > for many tasks, and it will appeal to lots of people new to event-driven
> I/O
> > because of the seductive deception of synchronous control flow and the
> > superiority to scheduling I/O operations with threads.
> >
> > What I think is really very important in the design of this new system
> is to
> > present an API whereby:
> >
> > if someone wants to write a basic protocol or data-format parser for the
> > stdlib, it should be easy to write it as a feed parser without needing
> > generator coroutines (for example, if they're pushing data into a C
> library,
> > they shouldn't have to write a while loop that calls recv, they should be
> > able to just transform some data callback into Python into some data
> > callback in C; it should be able to leverage tulip without much more
> work,
> > if users of tulip (read; the stdlib) need access to some functionality
> > implemented within Twisted, like an event-driven DNS client that is more
> > scalable than getaddrinfo, they can call into Twisted without re-writing
> > their entire program,
> > if users of Twisted need to invoke some functionality implemented on top
> of
> > tulip, they can construct a task and weave in a scheduler, similarly
> without
> > re-writing much,
> > if users of tulip want to just use Twisted to get better performance or
> > reliability than the built-in stdlib multiplexor, they ideally shouldn't
> > have to change anything, just run it with a different import line or
> > something, and
> > if (when) users of tulip realize that their generators have devolved
> into a
> > mess of spaghetti ;-) and they need to migrate to Twisted-style
> event-driven
> > callbacks and maybe some formal state machines or generated parsers to
> deal
> > with their inputs, that process can be done incrementally and not in one
> > giant shoot-the-moon effort which will make them hate Twisted.
> >
> >
> > As an added bonus, such an API would provide a great basis for Tornado
> and
> > Twisted to interoperate.
> >
> > It would also be nice to have a more discrete I/O layer to insulate
> > application code from common foibles like the fact that, for example, if
> you
> > call send() in tulip multiple times but forget to 'yield from ...send()',
> > you may end up writing interleaved garbage on the connection, then
> raising
> > an assertion error, but only if there's a sufficient quantity of data
> and it
> > needs to block; it will otherwise appear to work, leading to bugs that
> only
> > start happening when you are pushing large volumes of data through a
> system
> > at rates exceeding wire speed.  In other words, "only in production, only
> > during the holiday season, only during traffic spikes, only when it's
> really
> > really important for the system to keep working".
> >
> > This is why I think that step 1 here needs to be a common low-level API
> for
> > event-triggered operations that does not have anything to do with
> > generators.  I don't want to stop you from doing interesting things with
> > generators, but I do really want to decouple the tasks so that their
> > responsibilities are not unnecessarily conflated.
> >
> > task.unblock() is a method; protocol.data_received is a method.  Both
> can be
> > invoked at the same level by an event loop.  Once that low-level event
> loop
> > is delivering data to that callback's satisfaction, the callbacks can
> > happily drive a coroutine scheduler, and the coroutine scheduler can have
> > much less of a deep integration with the I/O itself; it just needs some
> kind
> > of sentinel object (a Future, a Deferred) to keep track of what exactly
> it's
> > waiting for.
> >
> > I'm most interested in feedback on the design of polling.py and
> > scheduling.py, and to a lesser extent on the design of sockets.py;
> > main.py is just an example of how this style works out in practice.
> >
> >
> > It looks to me like there's a design error in scheduling.py with respect
> to
> > coordinating concurrent operations.  If you try to block on two
> operations
> > at once, you'll get an assertion error ('assert not self.blocked', in
> > block), so you can't coordinate two interesting I/O requests without
> > spawning a bunch of new Tasks and then having them unblock their parent
> Task
> > when they're done.  I may just be failing to imagine how one would
> implement
> > something like Twisted's gatherResults, but this looks like it would be
> > frustrating, tedious, and involve creating lots of extra objects and
> making
> > the scheduler do a bunch more work.
> >
> > Also, shouldn't there be a lot more real exceptions and a lot fewer
> > assertions in this code?
> >
> > Relatedly, add_reader/writer will silently stomp on a previous FD
> > registration, so if two tasks end up calling recv() on the same socket,
> it
> > doesn't look like there's any way to find out that they both did that.
>  It
> > looks like the first task to call it will just hang forever, and the
> second
> > one will "win"?  What are the intended semantics?
> >
> > Speaking from the perspective of I/O scheduling, it will also be
> thrashing
> > any stateful multiplexor with a ton of unnecessary syscalls.  A Twisted
> > protocol in normal operation just receiving data from a single
> connection,
> > using, let's say, a kqueue-based multiplexor will call kevent() once to
> > register interest, then kqueue() to block, and then just keep getting
> > data-available notifications and processing them unless some downstream
> > buffer fills up and the transport is told to pause producing data, at
> which
> > point another kevent() gets issued.  tulip, by contrast, will call
> kevent()
> > over and over again, removing and then re-adding its reader repeatedly
> for
> > every packet, since it can never know if someone is about to call recv()
> > again any time soon.  Once again, request/response is not the best model
> for
> > retrieving data from a transport; active connections need to be prepared
> to
> > receive more data at any time and not in response to any particular
> request.
> >
> > Finally, apologies for spelling / grammar errors; I didn't have a lot of
> > time to copy-edit.
> >
> > -glyph
> >
> > _______________________________________________
> > Python-ideas mailing list
> > Python-ideas at python.org
> > http://mail.python.org/mailman/listinfo/python-ideas
> >
>



-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121107/ee02d7c7/attachment.html>

From grosser.meister.morti at gmx.net  Wed Nov  7 18:24:21 2012
From: grosser.meister.morti at gmx.net (=?UTF-8?B?TWF0aGlhcyBQYW56ZW5iw7Zjaw==?=)
Date: Wed, 07 Nov 2012 18:24:21 +0100
Subject: [Python-ideas] Support data: URLs in urllib
In-Reply-To: <CAPOVWOTNTZ9Enxc_PbetpgNu6v9iS3xf9G92Jgu4tzOuH5BpjA@mail.gmail.com>
References: <5090B0FC.1030801@gmx.net>
	<CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
	<50945B9D.8010002@gmx.net>
	<CACac1F_P4L7b26fu1sh7hz0QMLKRP-vpLAx45MGBOgd9JNOoow@mail.gmail.com>
	<5095CAC2.6010309@gmx.net>
	<CACac1F8AnEsairyxf8YKYxMERan+C04rGRaik_OxAdpEBz6wfg@mail.gmail.com>
	<5099D96A.2090602@gmx.net>
	<CAPOVWOTNTZ9Enxc_PbetpgNu6v9iS3xf9G92Jgu4tzOuH5BpjA@mail.gmail.com>
Message-ID: <509A9945.6040609@gmx.net>

Sorry, I don't quite understand.

On 11/07/2012 06:08 AM, Senthil Kumaran wrote:
> Had not known about the 'data' url scheme. Thanks for pointing out (
> http://tools.ietf.org/html/rfc2397 ) and the documentation patch.
> BTW, documentation patch is easy to get in, but should the support in
> a more natural form, where data url is parsed internally by the module

Do you mean the parse_data_url function should be removed and put into DataResponse (or DataHandler)?

> and expected results be returned should be considered?

What expected results? And in what way should they be considered? Considered for what?

> That could be
> targeted for 3.4 and docs recipe does serve for all the other
> releases.
>
>
> Thank you,
> Senthil
>
>
> On Tue, Nov 6, 2012 at 7:45 PM, Mathias Panzenb?ck
> <grosser.meister.morti at gmx.net> wrote:
>> Ok, I've written an issue in the python bug tracker and attached a doc patch
>> for the recipe:
>>
>> http://bugs.python.org/issue16423
>>
>>
>> On 11/04/2012 09:28 AM, Paul Moore wrote:
>>>
>>> On Sunday, 4 November 2012, Mathias Panzenb?ck wrote:
>>>
>>>
>>>      Shouldn't there be *one* obvious way to do this? req.headers
>>>
>>>
>>> Well, I'd say that the stdlib docs imply that req.info <http://req.info>
>>> is the required way so
>>>
>>> that's the "one obvious way". If you want to add extra methods for
>>> convenience, fair enough, but
>>> code that doesn't already know it is handling a data URL can't use them so
>>> I don't see the point,
>>> personally.
>>>
>>> But others may have different views...
>>>
>>> Paul
>>
>>


From techtonik at gmail.com  Wed Nov  7 19:35:42 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Wed, 7 Nov 2012 21:35:42 +0300
Subject: [Python-ideas] sys.py3k
In-Reply-To: <20121105063008.GA14836@ando>
References: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
	<5096ED46.20502@pearwood.info>
	<CAHVvXxSJ1wP8+VK1gZP5cyCL9Xr9iu209DeQWgGAOo4AmZ6UZQ@mail.gmail.com>
	<20121105063008.GA14836@ando>
Message-ID: <CAPkN8xLbr1Xn+m_LWw92KKrVB5K7AeZs_+Ju9CZq0Q=yG-5wSA@mail.gmail.com>

On Mon, Nov 5, 2012 at 9:30 AM, Steven D'Aprano <steve at pearwood.info> wrote:

> On Mon, Nov 05, 2012 at 02:08:33AM +0000, Oscar Benjamin wrote:
>
> > There are certain cases where explicitly checking the version makes
> > sense. I think that Python 3 vs Python 2 is sometimes such a case.
> > Python 3 changes the meaning of a number of elementary aspects of
> > Python so that the same code can run without error but with different
> > semantics under the two different version series.
>

...

In any case, arguments about defensive coding style are getting
> off-topic. The point is that there are various ways to test for the
> existence of features, and adding yet another coarse-grained test
> "sys.py3k" doesn't gain us much (if anything).
>

The problem is to maintain the code in the long term. Python 3
documentation already misses the things about Python 2 modules, so with
implicit feature tests legacy code checks may quickly get out of control.
It's not uncommon to see unwieldy projects with huge codebase of repetitive
code in corporate environment where people afraid to bring down legacy
stuff, because they don't know why it was inserted in the first place.

I thought of sys.py3k check as an explicit way to guard the code that
should be maintained extra carefully for Python 3 compatibility, so that
you can grep the source for this constant and remove all the hacks (such as
bytes to string conversion) required to maintain the compatibility when the
time comes to switch. Now I see that all points raised about it being too
late, not sufficient (because API breakage occurs even between minor
versions) are valid. The six module is an awesome alternative. Too bad it
doesn't come bundled by default or as an easy "after installation" update.

The granularity of a "feature" is interesting. Previously it was from
__future__ import 'feature'` for forward compatibility, but it required
planning features beforehand. Now there is a need to test for a feature,
which is not present in the early version, and we have only implicit ways.
These ways assume that we know what a feature and its symptoms are, which
is not the case when code is not yours or you have little experience with
either Python 2 or 3. I hoped that `if sys.py3k` could help make code more
readable, but it will be more useful to have shared 'features' module which
can contain explicit checks for existing features and return False for
things that are not present.

`six` is awesome though.
-- 
anatoly t.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121107/dcdd0a54/attachment.html>

From techtonik at gmail.com  Wed Nov  7 19:46:15 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Wed, 7 Nov 2012 21:46:15 +0300
Subject: [Python-ideas] Publishing ideas on ActiveState recipe site
In-Reply-To: <5064EA98-EDC2-4C29-A7E3-B25E0BE5D109@xmtp.net>
References: <CAPkN8xJ5Aam1tLyBPOgbBMwCuDoKs8gCp4aDwtr42H3YU56E+g@mail.gmail.com>
	<CANSw7KygPJxsNuWksjm-DVo5HKAJ8fRTyOCgG9BQRo=j=8akBA@mail.gmail.com>
	<CAPkN8xKT2Ww_h-njz=Ot_ZxVD4GPH8hXm-+WO=cJgvPV8gw6sw@mail.gmail.com>
	<k7d64j$gui$1@ger.gmane.org>
	<5064EA98-EDC2-4C29-A7E3-B25E0BE5D109@xmtp.net>
Message-ID: <CAPkN8xLmL2usprwRFkwTbeBWLUa=_U17Xwco-iea7EdMAxTtTA@mail.gmail.com>

On Wed, Nov 7, 2012 at 6:18 PM, Jay Wren <jrwren at xmtp.net> wrote:

>
> On Nov 7, 2012, at 3:29 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:
>
> > anatoly techtonik, 07.11.2012 09:19:
> >>>> 4. Recipes are licensed, which is a too much of a burden for a snippet
> >>
> >> I also found that ActiveState site doesn't allow to release recipes into
> >> Public Domain.
> >
> > Which is ok because "Public Domain" is not a universal concept. It won't
> > work in all countries where your recipes will be read and where people
> want
> > to use them. Better use a suitable license.
>
> Creative Commons did a lot of work on making CC0 a universal "Public
> Domain". While other CC licenses are not suitable for code, CC0 does make
> sense when you want to release code as what some of us know as "Public
> Domain".
>
> https://creativecommons.org/publicdomain/zero/1.0/
>
> As for ActiveState not allowing it, StackOverflow might confuse things as
> well since all contributions on StackExchange sites are licensed under
> CC-BY-SA. It may be difficult to put a CC0 license along side each post in
> StackOverflow.
>
> https://stackexchange.com/legal/terms-of-service


Wow. I didn't know that - it looks like all code on SO is copylefted and
can not be used in commercial products without giving up the rest of your
commercial code -
http://meta.stackoverflow.com/questions/18883/what-license-should-be-on-sample-code
-- 
anatoly t.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121107/7cf679b9/attachment.html>

From techtonik at gmail.com  Wed Nov  7 20:16:21 2012
From: techtonik at gmail.com (anatoly techtonik)
Date: Wed, 7 Nov 2012 22:16:21 +0300
Subject: [Python-ideas] Windows temporary file association for Python
	files
In-Reply-To: <k63gqe$643$1@ger.gmane.org>
References: <CAPkN8x+bEUaFuJ7mRO3vcOV0oBfNQn1f_ZnfzymwYrsq=jMXPA@mail.gmail.com>
	<CACac1F9LwEtbs0b8qY3NCK12Dn08GHtm6fLoMAwXrRdSid2eDA@mail.gmail.com>
	<CAPkN8xLGLTjBbDSTOHJKTzEwLvYT9EgzqkepN18-ZGdEk4jsdw@mail.gmail.com>
	<k63gqe$643$1@ger.gmane.org>
Message-ID: <CAPkN8xLMDm1rFFFaaHbzNZAcnr8RRkeDiO-m-Ymz6VDnMiKMPQ@mail.gmail.com>

On Mon, Oct 22, 2012 at 4:16 PM, Mark Lawrence <breamoreboy at yahoo.co.uk>wrote:

> On 22/10/2012 13:42, anatoly techtonik wrote:
>
>> On Mon, Oct 22, 2012 at 2:44 PM, Paul Moore <p.f.moore at gmail.com> wrote:
>>
>>> On 22 October 2012 11:51, anatoly techtonik <techtonik at gmail.com> wrote:
>>>
>>>> I wonder if it will make the life easier if Python was installed with
>>>> .py association to "%PYTHON_HOME%\python.exe" "%1" %*
>>>> It will remove the need to run .py scripts in virtualenv with explicit
>>>> 'python' prefix.
>>>>
>>>
>>> In Python 3.3 and later, the "py.exe" launcher is installed, and this
>>> is the association for ".py" files by default. It looks at the #! line
>>> of .py files, so you can run a specific Python interpreter by giving
>>> its full path. You can also specify (for example) "python3" or
>>> "python3.2" to run a specific Python version.
>>>
>>
>> Yes, I've noticed that this nasty launcher gets in the way. So, do you
>> propose to edit source files every time I need to test them with a new
>> version of Python? My original user story:
>>
>
> I see nothing nasty in the launcher, rather it's extremely useful.  You
> don't have to edit your scripts.  Just use py -3.2, py -2 or whatever to
> run the script, the launcher will work out which version to run for you if
> you're not specific.


Nice. Didn't know about that.


>       I want to execute scripts in virtual environment (i.e. with Python
>> installed for this virtual environment) without 'python' prefix.
>>
>> Here is another one. Currently Sphinx doesn't install with Python 3.2
>> and with Python 3.3 [1]. Normally I'd create 3 environments to
>> troubleshoot it and I can not modify all Sphinx files to point to the
>> correct interpreter to just execute 'setup.py install'.
>>
>
> Please try running your scripts with the mechanism I've given above and
> report back what happens, hopefully success :)


Not really. It doesn't work with virtualenv at all.

E:\p>cat version.py
import sys
print(sys.version)
E:\p>python version.py
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit
(Intel)]

E:\p>32\Scripts\activate
(32) E:\p>python version.py
3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]

(32) E:\p>version.py
2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)]

(32) E:\p>py version.py
2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)]


>
>> A solution would be to teach launcher to honor PYTHON_PATH variable if
>> it is set (please don't confuse it with PYTHONPATH which purpose is
>> still unclear on Windows).
>>
>
> What is PYTHON_PATH?  IIRC I was told years ago *NOT* to use PYTHONPATH on
> Windows so its purpose to me isn't unclear, it's completely baffling.
>

Sorry, it was PYTHON_HOME - variable from .py association I proposed, i.e.
"%PYTHON_HOME%\python.exe" "%1" %*
If association is made in this way, then virtualenv could override PYTHON_HOME
and get `version.py` command execute with it's own interpreter.

Now with py.exe it looks a better idea to have PYTHONBIN environment
variable and the association set to "%PYTHONBIN%" "%1" %*. This way
virtualenv can easily override it during activation to make .py files
executable with whatever Python (or PyPy) it has configured.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121107/8474212c/attachment.html>

From rosuav at gmail.com  Thu Nov  8 01:06:34 2012
From: rosuav at gmail.com (Chris Angelico)
Date: Thu, 8 Nov 2012 11:06:34 +1100
Subject: [Python-ideas] sys.py3k
In-Reply-To: <CAPkN8xLbr1Xn+m_LWw92KKrVB5K7AeZs_+Ju9CZq0Q=yG-5wSA@mail.gmail.com>
References: <CAPkN8xL1Hh72_cur91xTPVcyxMLzMVPpZcUp_oumUFKhMZZTDg@mail.gmail.com>
	<5096ED46.20502@pearwood.info>
	<CAHVvXxSJ1wP8+VK1gZP5cyCL9Xr9iu209DeQWgGAOo4AmZ6UZQ@mail.gmail.com>
	<20121105063008.GA14836@ando>
	<CAPkN8xLbr1Xn+m_LWw92KKrVB5K7AeZs_+Ju9CZq0Q=yG-5wSA@mail.gmail.com>
Message-ID: <CAPTjJmoes+=v5zyQm0xL_O=DDDuBGwJjmzU-hpVxZVGX25qZNw@mail.gmail.com>

On Thu, Nov 8, 2012 at 5:35 AM, anatoly techtonik <techtonik at gmail.com> wrote:
> I thought of sys.py3k check as an explicit way to guard the code that should
> be maintained extra carefully for Python 3 compatibility, so that you can
> grep the source for this constant and remove all the hacks (such as bytes to
> string conversion) required to maintain the compatibility when the time
> comes to switch.

I agree about greppability, it's a huge help. Hence the code comment;
as long as you're consistent and you pick a keyword long enough or
unusual enough to not occur anywhere else, you can easily do a "find
across files" or "grep XYZ *" to find them all. And if you put the
comment on the most significant line of code, line-based tools will be
more useful.

# Unideal:
# py3k
try:
    reload
except NameError:
    from imp import reload

# Better:
try: # py3k
    reload
except NameError:
    from imp import reload

# Best:
try:
    reload # py3k
except NameError:
    from imp import reload

# Also best:
try:
    reload
except NameError:
    from imp import reload # py3k

Taking just the line with the keyword "py3k" on it will tell you
exactly what that file is doing.

ChrisA


From senthil at uthcode.com  Thu Nov  8 07:11:18 2012
From: senthil at uthcode.com (Senthil Kumaran)
Date: Wed, 7 Nov 2012 22:11:18 -0800
Subject: [Python-ideas] Support data: URLs in urllib
In-Reply-To: <509A9945.6040609@gmx.net>
References: <5090B0FC.1030801@gmx.net>
	<CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
	<50945B9D.8010002@gmx.net>
	<CACac1F_P4L7b26fu1sh7hz0QMLKRP-vpLAx45MGBOgd9JNOoow@mail.gmail.com>
	<5095CAC2.6010309@gmx.net>
	<CACac1F8AnEsairyxf8YKYxMERan+C04rGRaik_OxAdpEBz6wfg@mail.gmail.com>
	<5099D96A.2090602@gmx.net>
	<CAPOVWOTNTZ9Enxc_PbetpgNu6v9iS3xf9G92Jgu4tzOuH5BpjA@mail.gmail.com>
	<509A9945.6040609@gmx.net>
Message-ID: <CAPOVWOQ6KYBOCWQkRhrzpYBUhBBfVeZAkAYFg0LqTghJOx4rQg@mail.gmail.com>

On Wed, Nov 7, 2012 at 9:24 AM, Mathias Panzenb?ck
<grosser.meister.morti at gmx.net> wrote:
> Sorry, I don't quite understand.
> Do you mean the parse_data_url function should be removed and put into
> DataResponse (or DataHandler)?
>
>
>> and expected results be returned should be considered?
>
>
> What expected results? And in what way should they be considered? Considered
> for what?

I meant, urlopen("data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==")

should work out of box, wherein the DataHandler example that is the
documentation is made available in request.py and added to
OpenerDirector by default. I find it hard to gauge the utility, but
documentation is ofcourse a +1.

Thanks,
Senthil


From christian at python.org  Thu Nov  8 23:13:49 2012
From: christian at python.org (Christian Heimes)
Date: Thu, 08 Nov 2012 23:13:49 +0100
Subject: [Python-ideas] CLI option for isolated mode
Message-ID: <509C2E9D.3080707@python.org>

Hi everybody,

I like to propose a new option for the Python interpreter:

  python -I

It shall start the interpreter in isolated mode which ignores any
environment variables set by the user and any files installed by the
user. The mode segregate a Python program from anything an unpriviliged
user is able to modify and uses only files that are installed by a
system adminstrator.

The isolated mode implies -E (ignore all PYTHON* environment vars) and
-s (don't add user site directory). It also refrains from the inclusion
of '' or getcwd() to sys.path. TKinter doesn't load and execute Python
scripts from the user's home directory. Other parts of the stdlib should
be checked, too.

The option is intended for OS and application scripts that doesn't want
to become affected by user installed files or files in the current
working path of a user.

The idea is motivated by a couple of bug reports, for example:

https://bugs.launchpad.net/bugs/938869  lsb_release crashed with SIGABRT
in Py_FatalError()

http://bugs.python.org/issue16202  sys.path[0] security issues

http://bugs.python.org/issue16248  Security bug in tkinter allows for
untrusted, arbitrary code execution.

Regards,
Christian


From michael.weylandt at gmail.com  Thu Nov  8 23:35:02 2012
From: michael.weylandt at gmail.com (R. Michael Weylandt)
Date: Thu, 8 Nov 2012 22:35:02 +0000
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <509C2E9D.3080707@python.org>
References: <509C2E9D.3080707@python.org>
Message-ID: <CAAmySGMKogmQTn97=dwRtLk8mZgCYZm2ZSw5jXMUau2cUCtsbg@mail.gmail.com>

On Thu, Nov 8, 2012 at 10:13 PM, Christian Heimes <christian at python.org> wrote:
> Hi everybody,
>
> I like to propose a new option for the Python interpreter:
>
>   python -I
>
> It shall start the interpreter in isolated mode which ignores any
> environment variables set by the user and any files installed by the
> user. The mode segregate a Python program from anything an unpriviliged
> user is able to modify and uses only files that are installed by a
> system adminstrator.
>
> The isolated mode implies -E (ignore all PYTHON* environment vars) and
> -s (don't add user site directory). It also refrains from the inclusion
> of '' or getcwd() to sys.path. TKinter doesn't load and execute Python
> scripts from the user's home directory. Other parts of the stdlib should
> be checked, too.
>
> The option is intended for OS and application scripts that doesn't want
> to become affected by user installed files or files in the current
> working path of a user.
>

R allows something quite similar with the "--vanilla" option. I can
attest that it's also quite helpful for debugging, particularly for
working around corrupted startup settings.

I do think I slightly prefer "vanilla" to possible I/l ["i" or "L"]
confusion, but I do appreciate that long flags are not the norm for
Python.

Michael


From barry at python.org  Fri Nov  9 02:16:15 2012
From: barry at python.org (Barry Warsaw)
Date: Thu, 8 Nov 2012 20:16:15 -0500
Subject: [Python-ideas] CLI option for isolated mode
References: <509C2E9D.3080707@python.org>
Message-ID: <20121108201615.7e429e46@resist.wooz.org>

On Nov 08, 2012, at 11:13 PM, Christian Heimes wrote:

>I like to propose a new option for the Python interpreter:
>
>  python -I
>
>It shall start the interpreter in isolated mode which ignores any
>environment variables set by the user and any files installed by the
>user. The mode segregate a Python program from anything an unpriviliged
>user is able to modify and uses only files that are installed by a
>system adminstrator.
>
>The isolated mode implies -E (ignore all PYTHON* environment vars) and
>-s (don't add user site directory). It also refrains from the inclusion
>of '' or getcwd() to sys.path. TKinter doesn't load and execute Python
>scripts from the user's home directory. Other parts of the stdlib should
>be checked, too.
>
>The option is intended for OS and application scripts that doesn't want
>to become affected by user installed files or files in the current
>working path of a user.
>
>The idea is motivated by a couple of bug reports, for example:
>
>https://bugs.launchpad.net/bugs/938869  lsb_release crashed with SIGABRT
>in Py_FatalError()
>
>http://bugs.python.org/issue16202  sys.path[0] security issues
>
>http://bugs.python.org/issue16248  Security bug in tkinter allows for
>untrusted, arbitrary code execution.

As someone who worked on the lsb_release problem, I'm generally supportive of
this proposal.  Here's a link to the thread on the debian-python mailing list
where I suggested "system" scripts always use -Es in the shebang line:

http://thread.gmane.org/gmane.linux.debian.devel.python/8188

The responses were cautious but mostly supportive.  One poster said:

"If I set PYTHONWARNINGS, I want it to affect all Python scripts."

I wonder also if we might want some other set of defaults, like -B enabled.

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121108/f89a89d5/attachment.pgp>

From mal at egenix.com  Fri Nov  9 09:19:04 2012
From: mal at egenix.com (M.-A. Lemburg)
Date: Fri, 09 Nov 2012 09:19:04 +0100
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <509C2E9D.3080707@python.org>
References: <509C2E9D.3080707@python.org>
Message-ID: <509CBC78.4040602@egenix.com>

On 08.11.2012 23:13, Christian Heimes wrote:
> Hi everybody,
> 
> I like to propose a new option for the Python interpreter:
> 
>   python -I
> 
> It shall start the interpreter in isolated mode which ignores any
> environment variables set by the user and any files installed by the
> user. The mode segregate a Python program from anything an unpriviliged
> user is able to modify and uses only files that are installed by a
> system adminstrator.
> 
> The isolated mode implies -E (ignore all PYTHON* environment vars) and
> -s (don't add user site directory). It also refrains from the inclusion
> of '' or getcwd() to sys.path. TKinter doesn't load and execute Python
> scripts from the user's home directory. Other parts of the stdlib should
> be checked, too.
> 
> The option is intended for OS and application scripts that doesn't want
> to become affected by user installed files or files in the current
> working path of a user.
> 
> The idea is motivated by a couple of bug reports, for example:
> 
> https://bugs.launchpad.net/bugs/938869  lsb_release crashed with SIGABRT
> in Py_FatalError()
> 
> http://bugs.python.org/issue16202  sys.path[0] security issues
> 
> http://bugs.python.org/issue16248  Security bug in tkinter allows for
> untrusted, arbitrary code execution.

Sounds like a good idea. I'd be interested in this, because it would
make debugging user installation problems easier.

The only thing I'm not sure about is the option character "-I". It
reminds me too much of the -I typically used for include paths
in C compilers :-)

BTW: In order to have Python applications respect this flag, there
should be an easy way to access this flag in Python programs, e.g.
sys.ignore_user_env.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 09 2012)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


From benhoyt at gmail.com  Fri Nov  9 11:29:33 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 9 Nov 2012 23:29:33 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file attributes
 from FindFirst/NextFile() and readdir()
Message-ID: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>

I've noticed that os.walk() is a lot slower than it needs to be because it
does an os.stat() call for every file/directory. It does this because it uses
listdir(), which doesn't return any file attributes.

So instead of using the file information provided by FindFirstFile() /
FindNextFile() and readdir() -- which listdir() calls -- os.walk() does a
stat() on every file to see whether it's a directory or not. Both
FindFirst/FindNext and readdir() give this information already. Using it would
basically bring the number of system calls down from O(N) to O(log N).

I've written a proof-of-concept (see [1] below) using ctypes and
FindFirst/FindNext on Windows, showing that for sizeable directory trees it
gives a 4x to 6x speedup -- so this is not a micro-optimization!

I started trying the same thing with opendir/readdir on Linux, but don't have
as much experience there, and wanted to get some feedback on the concept
first. I assume it'd be a similar speedup by using d_type & DT_DIR from
readdir().

The problem is even worse when you're calling os.walk() and then doing your
own stat() on each file, for example, to get the total size of all files in a
tree -- see [2]. It means it's calling stat() twice on every file, and I see
about a 9x speedup in this scenario using the info FindFirst/Next provide.

So there are a couple of things here:

1) The simplest thing to do would be to keep the APIs exactly the same, and
get the ~5x speedup on os.walk() -- on Windows, unsure of the exact speedup on
Linux. And on OS's where readdir() doesn't return "is directory" information,
obviously it'd fall back to using the stat on each file.

2) We could significantly improve the API by adding a listdir_stat() or
similar function, which would return a list of (filename, stat_result) tuples
instead of just the names. That might be useful in its own right, but of
course os.walk() could use it to speed itself up. Then of course it might be
good to have walk_stat() which you could use to speed up the "summing sizes"
cases.

Other related improvements to the listdir/walk APIs that could be considered
are:

* Using the wildcard/glob that FindFirst/FindNext take to do filtering -- this
would avoid fnmatch-ing and might speed up large operations, though I don't
have numbers. Obviously we'd have to simulate this with fnmatch on non-Windows
OSs, but this kind of filtering is something I've done with os.walk() many
times, so just having the API option would be useful ("glob" keyword arg?).

* Changing listdir() to yield instead of return a list (or adding yieldir?).
This fits both the FindNext/readdir APIs, and would address issues like [3].

Anyway, cutting a long story short -- do folks think 1) is a good idea? What
about some of the thoughts in 2)? In either case, what would be the best way
to go further on this?

Thanks,
Ben.

[1]: https://gist.github.com/4044946
[2]: http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder/2485843#2485843
[3]: http://stackoverflow.com/questions/4403598/list-files-in-a-folder-as-a-stream-to-begin-process-immediately


From ncoghlan at gmail.com  Fri Nov  9 13:23:38 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 9 Nov 2012 22:23:38 +1000
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
Message-ID: <CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>

On Fri, Nov 9, 2012 at 8:29 PM, Ben Hoyt <benhoyt at gmail.com> wrote:

> Anyway, cutting a long story short -- do folks think 1) is a good idea?
> What
> about some of the thoughts in 2)? In either case, what would be the best
> way
> to go further on this?
>

It's even worse when you add NFS (and other network filesystems) into the
mix, so yes, +1 on devising a more efficient API design for bulk stat
retrieval than the current listdir+explicit-stat approach that can lead to
an excessive number of network round trips.

It's a complex enough idea that it definitely needs some iteration outside
the stdlib before it could be added, though.

You could either start exploring this as a new project, or else if you
wanted to fork my walkdir project on BitBucket I'd be interested in
reviewing any pull requests you made along those lines - redundant stat
calls are currently one of the issues with using walkdir for more complex
tasks. (However you decide to proceed, you'll need to set things up to
build an extension module, though - walkdir is pure Python at this point).

Another alternative you may want to explore is whether or not Antoine
Pitrou would be interested in adding such a capability to his pathlib
module. pathlib already includes stat result caching in Path objects, and
thus may be able to support a clean API for returning path details with the
stat results precached.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121109/72fa6505/attachment.html>

From christian at python.org  Fri Nov  9 15:42:18 2012
From: christian at python.org (Christian Heimes)
Date: Fri, 09 Nov 2012 15:42:18 +0100
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
Message-ID: <509D164A.7080901@python.org>

Am 09.11.2012 11:29, schrieb Ben Hoyt:
> * Changing listdir() to yield instead of return a list (or adding yieldir?).
> This fits both the FindNext/readdir APIs, and would address issues like [3].
> 
> Anyway, cutting a long story short -- do folks think 1) is a good idea? What
> about some of the thoughts in 2)? In either case, what would be the best way
> to go further on this?

+1 for something like yielddir().

I while ago I proposed that the os module shall get another function for
iterating over a directory. The new function is to return a generator
that yields structs. The structs contain the name and additional
metadata that like the d_type. On Unix it should use the reentrant
version of readdir(), too.

A struct has the benefit that it can grow additional fields or contain
operating dependent information like inode on Unix.

Christian


From storchaka at gmail.com  Fri Nov  9 15:54:44 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Fri, 09 Nov 2012 16:54:44 +0200
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <509D164A.7080901@python.org>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<509D164A.7080901@python.org>
Message-ID: <k7j5fm$dvu$1@ger.gmane.org>

On 09.11.12 16:42, Christian Heimes wrote:
> +1 for something like yielddir().

See also http://bugs.python.org/issue11406.

> I while ago I proposed that the os module shall get another function for
> iterating over a directory. The new function is to return a generator
> that yields structs. The structs contain the name and additional
> metadata that like the d_type. On Unix it should use the reentrant
> version of readdir(), too.

The only fields in the dirent structure that are mandated by POSIX.1 are: d_name[], of unspecified size, with at most NAME_MAX characters preceding the terminating null byte; and (as an XSI extension) d_ino. The other fields are unstandardized, and not present on all systems.




From christian at python.org  Fri Nov  9 16:06:48 2012
From: christian at python.org (Christian Heimes)
Date: Fri, 09 Nov 2012 16:06:48 +0100
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <k7j5fm$dvu$1@ger.gmane.org>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<509D164A.7080901@python.org> <k7j5fm$dvu$1@ger.gmane.org>
Message-ID: <509D1C08.8000400@python.org>

Am 09.11.2012 15:54, schrieb Serhiy Storchaka:
> The only fields in the dirent structure that are mandated by POSIX.1 are: d_name[], of unspecified size, with at most NAME_MAX characters preceding the terminating null byte; and (as an XSI extension) d_ino. The other fields are unstandardized, and not present on all systems.


I'm well aware of the fact.The idea is to use as many information as we
can get for free and acquire missing information from other sources like
stat().

Also some information depend on the file system:

Currently,  only  some file systems (among them: Btrfs, ext2, ext3, and
ext4) have full support returning the file type in d_type.  All
applications must properly handle a return of DT_UNKNOWN.





From christian at python.org  Fri Nov  9 17:27:28 2012
From: christian at python.org (Christian Heimes)
Date: Fri, 09 Nov 2012 17:27:28 +0100
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <509CBC78.4040602@egenix.com>
References: <509C2E9D.3080707@python.org> <509CBC78.4040602@egenix.com>
Message-ID: <509D2EF0.8010209@python.org>

Am 09.11.2012 09:19, schrieb M.-A. Lemburg:
> Sounds like a good idea. I'd be interested in this, because it would
> make debugging user installation problems easier.
> 
> The only thing I'm not sure about is the option character "-I". It
> reminds me too much of the -I typically used for include paths
> in C compilers :-)

I'm open to suggestions for a better name and character. Michael also
pointed out that capital i (india) can look like a lower case l (lima).
-R is still unused. I hesitate to call it restricted mode because it can
be confused with PyPy's restricted Python.

> BTW: In order to have Python applications respect this flag, there
> should be an easy way to access this flag in Python programs, e.g.
> sys.ignore_user_env.

Of course! I assumed that I don't have to spell it out explicitly. A new
CLI option will be accompanied with a new sys.flags attribute like
sys.flags.isolated_mode.

Christian


From mark.hackett at metoffice.gov.uk  Fri Nov  9 17:33:54 2012
From: mark.hackett at metoffice.gov.uk (Mark Hackett)
Date: Fri, 9 Nov 2012 16:33:54 +0000
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <509D2EF0.8010209@python.org>
References: <509C2E9D.3080707@python.org> <509CBC78.4040602@egenix.com>
	<509D2EF0.8010209@python.org>
Message-ID: <201211091633.54293.mark.hackett@metoffice.gov.uk>

On Friday 09 Nov 2012, Christian Heimes wrote:
> Am 09.11.2012 09:19, schrieb M.-A. Lemburg:
> > Sounds like a good idea. I'd be interested in this, because it would
> > make debugging user installation problems easier.
> >
> > The only thing I'm not sure about is the option character "-I". It
> > reminds me too much of the -I typically used for include paths
> > in C compilers :-)
> 
> I'm open to suggestions for a better name and character. Michael also
> pointed out that capital i (india) can look like a lower case l (lima).
> -R is still unused. I hesitate to call it restricted mode because it can
> be confused with PyPy's restricted Python.

Are you restricted to the restricted ASCII set?

-^

as an option? The caret isn't already taken on the command line like & or ;.

Mind you, it may be best to bite the bullet and go "No one character option". 
You only get 52 options there and every option wants to be "e" because e's are 
good (sorry, UK pop quiz pun there). You see it often enough. Since -h is 
taken for help, you need -x for hex. Picking a character from the long word 
only gets you so far.

Very few z's.

So why not just call it a day and use long case rather than contort the 
language like acronym writers.


From python at mrabarnett.plus.com  Fri Nov  9 18:13:15 2012
From: python at mrabarnett.plus.com (MRAB)
Date: Fri, 09 Nov 2012 17:13:15 +0000
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <509D164A.7080901@python.org>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<509D164A.7080901@python.org>
Message-ID: <509D39AB.7030309@mrabarnett.plus.com>

On 2012-11-09 14:42, Christian Heimes wrote:
> Am 09.11.2012 11:29, schrieb Ben Hoyt:
>> * Changing listdir() to yield instead of return a list (or adding yieldir?).
>> This fits both the FindNext/readdir APIs, and would address issues like [3].
>>
>> Anyway, cutting a long story short -- do folks think 1) is a good idea? What
>> about some of the thoughts in 2)? In either case, what would be the best way
>> to go further on this?
>
> +1 for something like yielddir().
>
+1, although I would prefer it to be called something like iterdir().

> I while ago I proposed that the os module shall get another function for
> iterating over a directory. The new function is to return a generator
> that yields structs. The structs contain the name and additional
> metadata that like the d_type. On Unix it should use the reentrant
> version of readdir(), too.
>
> A struct has the benefit that it can grow additional fields or contain
> operating dependent information like inode on Unix.
>



From vinay_sajip at yahoo.co.uk  Fri Nov  9 21:30:10 2012
From: vinay_sajip at yahoo.co.uk (Vinay Sajip)
Date: Fri, 9 Nov 2012 20:30:10 +0000 (UTC)
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
	attributes from FindFirst/NextFile() and readdir()
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
Message-ID: <loom.20121109T212622-692@post.gmane.org>

Ben Hoyt <benhoyt at ...> writes:

> I've written a proof-of-concept (see [1] below) using ctypes and
> FindFirst/FindNext on Windows, showing that for sizeable directory trees it
> gives a 4x to 6x speedup -- so this is not a micro-optimization!
> 
> I started trying the same thing with opendir/readdir on Linux, but don't have
> as much experience there, and wanted to get some feedback on the concept
> first. I assume it'd be a similar speedup by using d_type & DT_DIR from
> readdir().
> 
> The problem is even worse when you're calling os.walk() and then doing your
> own stat() on each file, for example, to get the total size of all files in a
> tree -- see [2]. It means it's calling stat() twice on every file, and I see
> about a 9x speedup in this scenario using the info FindFirst/Next provide.

Sounds good. I recently answered a Stack Overflow question [1] which showed
Python performing an order of magnitude slower than Ruby. Ruby's Dir
implementation is written in C and less flexible than os.walk, but there's room
for improvement, as you've shown.

Regards,

Vinay Sajip


[1]
http://stackoverflow.com/questions/13138160/benchmarks-does-python-have-a-faster-way-of-walking-a-network-folder



From storchaka at gmail.com  Fri Nov  9 23:56:47 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Sat, 10 Nov 2012 00:56:47 +0200
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <loom.20121109T212622-692@post.gmane.org>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<loom.20121109T212622-692@post.gmane.org>
Message-ID: <k7k1nh$dln$1@ger.gmane.org>

On 09.11.12 22:30, Vinay Sajip wrote:
> Sounds good. I recently answered a Stack Overflow question [1] which showed
> Python performing an order of magnitude slower than Ruby. Ruby's Dir
> implementation is written in C and less flexible than os.walk, but there's room
> for improvement, as you've shown.

This is not so much about walking, as about recursive globbing.  See http://bugs.python.org/issue13968.

Also note that os.fwalk can be must faster than os.walk (if OS supports it).



From sam-pydeas at rushing.nightmare.com  Sat Nov 10 04:06:44 2012
From: sam-pydeas at rushing.nightmare.com (Sam Rushing)
Date: Fri, 09 Nov 2012 19:06:44 -0800
Subject: [Python-ideas] CPS transform for Python
Message-ID: <509DC4C4.8000802@rushing.nightmare.com>

The discussion last week with Greg Ewing got me to thinking about the
CPS transform, and how it might be a useful technique for
callback/event-driven code like asyncore & Twisted.  I'm pretty sure
when I first thought about this eons ago it was before Python had
closures.  They definitely make it a bit easier!

I put some simple demo code together, I think it demonstrates that the
idea is feasible, I'm curious to know if anyone is interested.

Don't get hung up on the poor quality of the generated code, big
improvements could be made with a little bit of work.

https://github.com/samrushing/cps-python/

-Sam


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 194 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121109/828e006f/attachment.pgp>

From _ at lvh.cc  Sat Nov 10 12:57:57 2012
From: _ at lvh.cc (Laurens Van Houtven)
Date: Sat, 10 Nov 2012 12:57:57 +0100
Subject: [Python-ideas] CPS transform for Python
In-Reply-To: <509DC4C4.8000802@rushing.nightmare.com>
References: <509DC4C4.8000802@rushing.nightmare.com>
Message-ID: <CAE_Hg6bu_biCwPmr+JM0suCPXDHuAs_OJr1NYysyQt6v6x5QjA@mail.gmail.com>

The README suggests that doing this optimization on a bytecode level may be
better than doing it on a source/AST level. Can you explain why?


On Sat, Nov 10, 2012 at 4:06 AM, Sam Rushing <
sam-pydeas at rushing.nightmare.com> wrote:

> The discussion last week with Greg Ewing got me to thinking about the
> CPS transform, and how it might be a useful technique for
> callback/event-driven code like asyncore & Twisted.  I'm pretty sure
> when I first thought about this eons ago it was before Python had
> closures.  They definitely make it a bit easier!
>
> I put some simple demo code together, I think it demonstrates that the
> idea is feasible, I'm curious to know if anyone is interested.
>
> Don't get hung up on the poor quality of the generated code, big
> improvements could be made with a little bit of work.
>
> https://github.com/samrushing/cps-python/
>
> -Sam
>
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
>


-- 
cheers
lvh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121110/1ae8218d/attachment.html>

From jeanpierreda at gmail.com  Sat Nov 10 16:50:26 2012
From: jeanpierreda at gmail.com (Devin Jeanpierre)
Date: Sat, 10 Nov 2012 10:50:26 -0500
Subject: [Python-ideas] CPS transform for Python
In-Reply-To: <CAE_Hg6bu_biCwPmr+JM0suCPXDHuAs_OJr1NYysyQt6v6x5QjA@mail.gmail.com>
References: <509DC4C4.8000802@rushing.nightmare.com>
	<CAE_Hg6bu_biCwPmr+JM0suCPXDHuAs_OJr1NYysyQt6v6x5QjA@mail.gmail.com>
Message-ID: <CABicbJJ0PdSQv==UqEgc+FzaT2xZbmui26T=PBwo48_3b2K5DQ@mail.gmail.com>

On Sat, Nov 10, 2012 at 6:57 AM, Laurens Van Houtven <_ at lvh.cc> wrote:
> The README suggests that doing this optimization on a bytecode level may be
> better than doing it on a source/AST level. Can you explain why?

I would suggest it's because CPS directly represents control flow, as
does bytecode (which has explicit jumps and so on). ASTs represent
control flow indirectly, in that there are more constructs that can do
more varied forms of jumping around. I think that it's often harder to
write program analysis or transformation on syntax trees than on
bytecode or control flow graphs, but that's from my own experience
doing such, and may not be true for some sorts of situations.

In particular there's a directly analogous CPS form for bytecode,
where every bytecode instruction in a bytecode string is considered to
be equivalent to a CPS function, and continuations represent returns
from functions and not expression evaluation, and the CPS functions
just navigate through the bytecode stream using tail calls, while
keeping track of the expression evaluation stack and the call stack
and globals dict.

Some examples follow here (I'm doing purely functional stuff because
too tired to think through the effects of mutation). I'm obviously not
trying to be completely correct here, just trying to get the idea
across that bytecode and CPS are more closely linked than AST and CPS.
Note that exception support is entirely missing, and needs its own
stack to keep track of exception handlers.

A[i] = JUMP_FORWARD(d)
def B[i](continuation, stack, callstack, globals):
    B[i + d](continuation, stack, callstack, globals)

A[i] = BINARY_MULTIPLY()
def B[i](continuation, stack, callstack, globals):
    B[i+1](continuation, stack[:-2] + [stack[-2] * stack[-1]],
callstack, globals)

A[i] = LOAD_FAST(local)
def B[i](continuation, stack, callstack, globals):
    B[i+1](continuation, stack + [callstack[local]], callstack, globals)

A[i] = RETURN_VALUE()
def B[i](continuation, stack, callstack, globals):
    continuation(stack[-1])

A[i] = CALL_FUNCTION(argc)
def B[i](continuation, stack, callstack, globals):
    f = stack[-1];
    stack = stack[:-1]
    args = stack[-argc:]
    stack = stack[- argc]
    # let's please pretend getcallargs does the right thing
    # (it doesn't.)
    locals = inspect.getcallargs(f, *args)
    # get where f is in the bytecode (magic pseudocode)
    jump_location = f.magic_bytecode_location
    B[jump_location](
        lambda returnvalue: B[i+1](
            continuation,
            stack + [returnvalue],
            callstack,
            globals),
        [],
        callstack + [{}], # new locals dict
        f.func_globals)

and so on.

At least, I think I have that right.

-- Devin


From sam-pydeas at rushing.nightmare.com  Sat Nov 10 21:11:40 2012
From: sam-pydeas at rushing.nightmare.com (Sam Rushing)
Date: Sat, 10 Nov 2012 12:11:40 -0800
Subject: [Python-ideas] CPS transform for Python
In-Reply-To: <CABicbJJ0PdSQv==UqEgc+FzaT2xZbmui26T=PBwo48_3b2K5DQ@mail.gmail.com>
References: <509DC4C4.8000802@rushing.nightmare.com>
	<CAE_Hg6bu_biCwPmr+JM0suCPXDHuAs_OJr1NYysyQt6v6x5QjA@mail.gmail.com>
	<CABicbJJ0PdSQv==UqEgc+FzaT2xZbmui26T=PBwo48_3b2K5DQ@mail.gmail.com>
Message-ID: <509EB4FC.6060908@rushing.nightmare.com>

On 11/10/12 7:50 AM, Devin Jeanpierre wrote:
> On Sat, Nov 10, 2012 at 6:57 AM, Laurens Van Houtven <_ at lvh.cc> wrote:
>> The README suggests that doing this optimization on a bytecode level may be
>> better than doing it on a source/AST level. Can you explain why?
> [...]
>
> In particular there's a directly analogous CPS form for bytecode,
> where every bytecode instruction in a bytecode string is considered to
> be equivalent to a CPS function, and continuations represent returns
> from functions and not expression evaluation, and the CPS functions
> just navigate through the bytecode stream using tail calls, while
> keeping track of the expression evaluation stack and the call stack
> and globals dict.

Right... looking at an example output from the transform:

    v13 = n
    v14 = 1
    v4 = v13 == v14
    if v4:
       ... 


Viewed in CPS this might look like:

(VAR, [v13, (INT, [v14, ((EQ, v13, v14), TEST, ...)])])

Where each node is (EXP, CONT).  In this case the result of each
operation is put into a variable/register (e.g., 'v13'), but python's
bytecodes actually operate on the frame stack.  So if there were some
way to change this to

(VAR, [PUSH, (INT, [PUSH, ((EQ 0 1)), TEST, ...)])])

Where (EQ 0 1) means 'apply EQ to the top two items on the stack'.

The code above puts each value into a local variable, which gets pushed
onto the stack anyway by the compiled bytecode.

Another advantage to generating bytecode directly would be support for
python 2, since I think 'nonlocal' can be done at the bytecode level.

-Sam

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121110/2a853cfe/attachment.html>

From benhoyt at gmail.com  Mon Nov 12 10:17:41 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Mon, 12 Nov 2012 22:17:41 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
Message-ID: <CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>

It seems many folks think that an os.iterdir() is a good idea, and
some that agree that something like os.iterdir_stat() for efficient
directory traversal + stat combination is a good idea. And if we get a
faster os.walk() for free, that's great too. :-)

Nick Coughlan mentioned his walkdir and Antoine's pathlib. While I
think these are good third-party libraries, I admit I'm not the
biggest fan of either of their APIs. HOWEVER, mainly I think that the
stdlib's os.listdir() and os.walk() aren't going away anytime soon, so
we might as well make incremental (though significant) improvements to
them in the meantime.

So I'm going to propose a couple of minimally-invasive changes (API-
wise), in what I think is order of importance, highest to lowest:

1) Speeding up os.walk(). I've shown we can easily get a ~5x speedup
on Windows by not calling stat() on each file. And on Linux/BSD this
same data is available from readdir()'s dirent, so I presume there's
be a similar speedup, though it may not be quite 5x.

2) I also propose adding os.iterdir(path='.') to do exactly the same
thing as os.listdir(), but yield the results as it gets them instead
of returning the whole list at once.

3) Partly for implementing the more efficient walk(), but also for
general use, I propose adding os.iterdir_stat() which would be like
iterdir but yield (filename, stat) tuples. If stat-while-iterating
isn't available on the system, the stat item would be None. If it is
available, the stat_result fields that the OS presents would be
available -- the other fields would be None. In practice,
iterdir_stat() would call FindFirst/Next on Windows and readdir_r on
Linux/BSD/Mac OS X, and be implemented in posixmodule.c.

This means that on Linux/BSD/Mac OS X it'd return a stat_result with
st_mode set but the other fields None, on Windows it'd basically
return the full stat_result, and on other systems it'd return
(filename, None).

The usage pattern (and exactly how os.walk would use it) would be as
follows:

    for filename, st in os.iterdir_stat(path):
        if st is None or st.st_mode is None:
            st = os.stat(os.path.join(path, filename))
        if stat.S_ISDIR(st.st_mode):
            # handle directory
        else:
            # handle file

I'm very keen on 1). And I think adding 2) and 3) make sense, because
they're (a) asked for by various folks, (b) fairly simple and self-
explanatory APIs, and (c) they'll be needed to implement the faster
os.walk() anyway.

Thoughts? What's the next step? If I come up with a patch against
posixmodule.c, tests, etc, is this likely to be accepted? I could
also flesh out my pure-Python proof of concept [1] to do what I'm
suggesting above and go from there...

Thanks,
Ben

[1] https://gist.github.com/4044946

On Sat, Nov 10, 2012 at 1:23 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On Fri, Nov 9, 2012 at 8:29 PM, Ben Hoyt <benhoyt at gmail.com> wrote:
>>
>> Anyway, cutting a long story short -- do folks think 1) is a good idea?
>> What
>> about some of the thoughts in 2)? In either case, what would be the best
>> way
>> to go further on this?
>
>
> It's even worse when you add NFS (and other network filesystems) into the
> mix, so yes, +1 on devising a more efficient API design for bulk stat
> retrieval than the current listdir+explicit-stat approach that can lead to
> an excessive number of network round trips.
>
> It's a complex enough idea that it definitely needs some iteration outside
> the stdlib before it could be added, though.
>
> You could either start exploring this as a new project, or else if you
> wanted to fork my walkdir project on BitBucket I'd be interested in
> reviewing any pull requests you made along those lines - redundant stat
> calls are currently one of the issues with using walkdir for more complex
> tasks. (However you decide to proceed, you'll need to set things up to build
> an extension module, though - walkdir is pure Python at this point).
>
> Another alternative you may want to explore is whether or not Antoine Pitrou
> would be interested in adding such a capability to his pathlib module.
> pathlib already includes stat result caching in Path objects, and thus may
> be able to support a clean API for returning path details with the stat
> results precached.
>
> Cheers,
> Nick.
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


From ncoghlan at gmail.com  Mon Nov 12 12:37:20 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Mon, 12 Nov 2012 21:37:20 +1000
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
Message-ID: <CADiSq7dUo_p+1af8Bs=Znhq5Zmr657JP=z_O6uzLC-yAWptqiA@mail.gmail.com>

On Mon, Nov 12, 2012 at 7:17 PM, Ben Hoyt <benhoyt at gmail.com> wrote:

> Thoughts? What's the next step? If I come up with a patch against
> posixmodule.c, tests, etc, is this likely to be accepted? I could
> also flesh out my pure-Python proof of concept [1] to do what I'm
> suggesting above and go from there...
>

The issue with patching the stdlib directly rather than releasing something
on PyPI is that you likely won't get any design or usability feedback until
the first 3.4 alpha, unless it happens to catch the interest of someone
willing to tinker with a patched version earlier than that.

It only takes one or two savvy users to get solid feedback by publishing
something on PyPI (that's all I got for contextlb2, and the design of
contextlib.ExitStack in 3.3 benefited greatly from the process). Just the
discipline of writing docs, tests and giving people a rationale for
downloading your module can help a great deal with making the case for the
subsequent stdlib change.

The reason I suggested walkdir as a possible venue is that I think your
idea here may help with some of walkdir's *other* API design problems (of
which there are quite a few, which is why I stopped pushing for it as a
stdlib addition in its current state - it has too many drawbacks to be
consistently superior to rolling your own custom solution)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121112/ac78d3ee/attachment.html>

From python at mrabarnett.plus.com  Mon Nov 12 18:43:15 2012
From: python at mrabarnett.plus.com (MRAB)
Date: Mon, 12 Nov 2012 17:43:15 +0000
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
Message-ID: <50A13533.8090807@mrabarnett.plus.com>

On 2012-11-12 09:17, Ben Hoyt wrote:
> It seems many folks think that an os.iterdir() is a good idea, and
> some that agree that something like os.iterdir_stat() for efficient
> directory traversal + stat combination is a good idea. And if we get a
> faster os.walk() for free, that's great too. :-)
>
> Nick Coughlan mentioned his walkdir and Antoine's pathlib. While I
> think these are good third-party libraries, I admit I'm not the
> biggest fan of either of their APIs. HOWEVER, mainly I think that the
> stdlib's os.listdir() and os.walk() aren't going away anytime soon, so
> we might as well make incremental (though significant) improvements to
> them in the meantime.
>
> So I'm going to propose a couple of minimally-invasive changes (API-
> wise), in what I think is order of importance, highest to lowest:
>
> 1) Speeding up os.walk(). I've shown we can easily get a ~5x speedup
> on Windows by not calling stat() on each file. And on Linux/BSD this
> same data is available from readdir()'s dirent, so I presume there's
> be a similar speedup, though it may not be quite 5x.
>
> 2) I also propose adding os.iterdir(path='.') to do exactly the same
> thing as os.listdir(), but yield the results as it gets them instead
> of returning the whole list at once.
>
> 3) Partly for implementing the more efficient walk(), but also for
> general use, I propose adding os.iterdir_stat() which would be like
> iterdir but yield (filename, stat) tuples. If stat-while-iterating
> isn't available on the system, the stat item would be None. If it is
> available, the stat_result fields that the OS presents would be
> available -- the other fields would be None. In practice,
> iterdir_stat() would call FindFirst/Next on Windows and readdir_r on
> Linux/BSD/Mac OS X, and be implemented in posixmodule.c.
>
> This means that on Linux/BSD/Mac OS X it'd return a stat_result with
> st_mode set but the other fields None, on Windows it'd basically
> return the full stat_result, and on other systems it'd return
> (filename, None).
>
> The usage pattern (and exactly how os.walk would use it) would be as
> follows:
>
>      for filename, st in os.iterdir_stat(path):
>          if st is None or st.st_mode is None:
>              st = os.stat(os.path.join(path, filename))
>          if stat.S_ISDIR(st.st_mode):
>              # handle directory
>          else:
>              # handle file
>
I'm not sure that I like "st is None or st.st_mode is None".

You say that if a stat field is not available, it's None.

That being the case, if no stat fields are available, couldn't their
fields be None?

That would lead to just "st.st_mode is None".

> I'm very keen on 1). And I think adding 2) and 3) make sense, because
> they're (a) asked for by various folks, (b) fairly simple and self-
> explanatory APIs, and (c) they'll be needed to implement the faster
> os.walk() anyway.
>
> Thoughts? What's the next step? If I come up with a patch against
> posixmodule.c, tests, etc, is this likely to be accepted? I could
> also flesh out my pure-Python proof of concept [1] to do what I'm
> suggesting above and go from there...
>
[snip]




From benhoyt at gmail.com  Mon Nov 12 21:55:19 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Tue, 13 Nov 2012 09:55:19 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CADiSq7dUo_p+1af8Bs=Znhq5Zmr657JP=z_O6uzLC-yAWptqiA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CADiSq7dUo_p+1af8Bs=Znhq5Zmr657JP=z_O6uzLC-yAWptqiA@mail.gmail.com>
Message-ID: <CAL9jXCHuxvNnT-JEkWHF_ZNLPMDsRHywxmkmuK39u6G4ju7XZA@mail.gmail.com>

> The issue with patching the stdlib directly rather than releasing something
> on PyPI is that you likely won't get any design or usability feedback ...
> ... Just the
> discipline of writing docs, tests and giving people a rationale for
> downloading your module can help a great deal with making the case for the
> subsequent stdlib change.

Yes, those are good points. I'll see about making a "betterwalk" or similar
module and releasing on PyPI.

-Ben


From benhoyt at gmail.com  Mon Nov 12 21:58:49 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Tue, 13 Nov 2012 09:58:49 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
Message-ID: <CAL9jXCGJbCzxxo9jzmm4TcD1YFw8mNLVXVqKSHb0j2BcWW-ugA@mail.gmail.com>

MRAB said:
> I'm not sure that I like "st is None or st.st_mode is None".
> You say that if a stat field is not available, it's None.
> That being the case, if no stat fields are available, couldn't their
> fields be None?
> That would lead to just "st.st_mode is None".

Ah, yes, that's a good idea. That would simplify still further.

-Ben


From ronaldoussoren at mac.com  Tue Nov 13 08:07:32 2012
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Tue, 13 Nov 2012 08:07:32 +0100
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
Message-ID: <4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>


On 12 Nov, 2012, at 10:17, Ben Hoyt <benhoyt at gmail.com> wrote:
> 
> 
> This means that on Linux/BSD/Mac OS X it'd return a stat_result with
> st_mode set but the other fields None, on Windows it'd basically
> return the full stat_result, and on other systems it'd return
> (filename, None).

Where would st_mode be retrieved from?  The readdir(3) interface
only provides d_type (and that field is not in POSIX or SUS).

The d_type field contains a file type, and while you could use that
to construct a value for st_mode that can be used to test the file
type, you cannot reconstruct the file permissions from that.

Ronald




From benhoyt at gmail.com  Tue Nov 13 08:13:23 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Tue, 13 Nov 2012 20:13:23 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
Message-ID: <CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>

> Where would st_mode be retrieved from?  The readdir(3) interface
> only provides d_type (and that field is not in POSIX or SUS).
>
> The d_type field contains a file type, and while you could use that
> to construct a value for st_mode that can be used to test the file
> type, you cannot reconstruct the file permissions from that.

Yes, you're right. The amount of information in st_mode would be
implementation dependent. I don't see a huge problem with that -- it's
already true for os.stat(), because on Windows stat()'s st_mode
contains a much more limited subset of info than on POSIX systems.

-Ben


From ronaldoussoren at mac.com  Tue Nov 13 10:06:30 2012
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Tue, 13 Nov 2012 10:06:30 +0100
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
Message-ID: <7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>


On 13 Nov, 2012, at 8:13, Ben Hoyt <benhoyt at gmail.com> wrote:

>> Where would st_mode be retrieved from?  The readdir(3) interface
>> only provides d_type (and that field is not in POSIX or SUS).
>> 
>> The d_type field contains a file type, and while you could use that
>> to construct a value for st_mode that can be used to test the file
>> type, you cannot reconstruct the file permissions from that.
> 
> Yes, you're right. The amount of information in st_mode would be
> implementation dependent. I don't see a huge problem with that -- it's
> already true for os.stat(), because on Windows stat()'s st_mode
> contains a much more limited subset of info than on POSIX systems.

It would be very odd to have an st_mode that contains a subset of 
the information the platform can provide. In particular having st_mode
would give the impression that it is the full mode.

Why not return the information that can be cheaply provided, is useful
and can be provided by major platforms (at least Windows, Linux and
OS X)?  And "useful" would be information for which there is a clear 
usecase, such as the filetype as it can significantly speed up os.walk.

OTOH, FindNextFile on Windows can return most information in struct stat.

Ronald



From benhoyt at gmail.com  Tue Nov 13 21:00:10 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Wed, 14 Nov 2012 09:00:10 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
Message-ID: <CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>

> It would be very odd to have an st_mode that contains a subset of
> the information the platform can provide. In particular having st_mode
> would give the impression that it is the full mode.

Yes, it's slightly odd, but not as odd as you'd think. This is
especially true for Windows users, because we're used to st_mode only
being a subset of the information -- the permission bits are basically
meaningless on Windows.

The alternative is to introduce yet another new tuple/struct with
"type size atime ctime mtime" fields. But you still have to specify
that it's implementation dependent (Linux/BSD only provides type,
Windows provides all those fields), and then you have to have ways of
testing what type the type is. stat_result and the stat module already
give you those things, which is why I think it's best to stick with
the stat_result structure.

In terms of what's useful, certainly "type" and "size" are, so you may
as well throw in atime/ctime/mtime, which Windows also gives us for
free.

-Ben


From random832 at fastmail.us  Wed Nov 14 00:28:27 2012
From: random832 at fastmail.us (Random832)
Date: Tue, 13 Nov 2012 18:28:27 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
Message-ID: <50A2D79B.6020803@fastmail.us>

My very first post, and I screw up the reply to list option. Sorry about 
that.

On 11/13/2012 2:07 AM, Ronald Oussoren wrote:
> Where would st_mode be retrieved from?  The readdir(3) interface
> only provides d_type (and that field is not in POSIX or SUS).
>
> The d_type field contains a file type, and while you could use that
> to construct a value for st_mode that can be used to test the file
> type, you cannot reconstruct the file permissions from that.

I think he had the idea that it would just return this incomplete
st_mode, and you'd have to deal with it. But maybe the solution is to
add a st_type property to stat_result, that returns either this or the
appropriate bits from st_mode - is there a reason these are a single
field other than a historical artifact of the fact that they are/were in
a single 16-bit field in UNIX?

Note that according to the documentation not all filesystems on linux
support d_type, either. You have to be prepared for the possibility of
getting DT_UNKNOWN





From random832 at fastmail.us  Wed Nov 14 00:50:23 2012
From: random832 at fastmail.us (Random832)
Date: Tue, 13 Nov 2012 18:50:23 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <50A2D79B.6020803@fastmail.us>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<50A2D79B.6020803@fastmail.us>
Message-ID: <50A2DCBF.5050109@fastmail.us>

P.S. Something else occurs to me.

In the presence of reparse points (i.e. symbolic links) on win32, I 
believe information about whether the destination is meant to be a 
directory is still provided (I haven't confirmed this, but I know you're 
required to provide it when making a symlink). This is discarded when 
the st_mode field is populated with the information that it is a 
symlink. If the goal is "speed up os.walk", it might be worth keeping 
this information and using it in os.walk(..., followlinks=True) - maybe 
the windows version of the stat result has a field for the windows 
attributes?

It's arguable, though, that symbolic links on windows are rare enough 
not to matter.


From ronaldoussoren at mac.com  Wed Nov 14 08:14:52 2012
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Wed, 14 Nov 2012 08:14:52 +0100
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
Message-ID: <2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>


On 13 Nov, 2012, at 21:00, Ben Hoyt <benhoyt at gmail.com> wrote:

>> It would be very odd to have an st_mode that contains a subset of
>> the information the platform can provide. In particular having st_mode
>> would give the impression that it is the full mode.
> 
> Yes, it's slightly odd, but not as odd as you'd think. This is
> especially true for Windows users, because we're used to st_mode only
> being a subset of the information -- the permission bits are basically
> meaningless on Windows.

That's one more reason for returning a new tuple/struct with a type field:
the full st_mode is not useful on Windows, and on Unix readdir doesn't
return a full st_mode in the first place.

> 
> The alternative is to introduce yet another new tuple/struct with
> "type size atime ctime mtime" fields. But you still have to specify
> that it's implementation dependent (Linux/BSD only provides type,
> Windows provides all those fields), and then you have to have ways of
> testing what type the type is. stat_result and the stat module already
> give you those things, which is why I think it's best to stick with
> the stat_result structure.

The interface of the stat module for determining the file type is not very
pretty.

> 
> In terms of what's useful, certainly "type" and "size" are, so you may
> as well throw in atime/ctime/mtime, which Windows also gives us for
> free.

How did you measure the 5x speedup you saw with you modified os.walk?

It would be interesting to see if Unix platforms have a simular speedup, because
if they don't the new API could just return the results of stat (or lstat ...).

Ronald



From benhoyt at gmail.com  Wed Nov 14 08:22:40 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Wed, 14 Nov 2012 20:22:40 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
Message-ID: <CAL9jXCEiSqR00fRePO=62nrSg=_eZEswfKXt26iYpjzkDQ1KHw@mail.gmail.com>

>> Yes, it's slightly odd, but not as odd as you'd think. This is
>> especially true for Windows users, because we're used to st_mode only
>> being a subset of the information -- the permission bits are basically
>> meaningless on Windows.
>
> That's one more reason for returning a new tuple/struct with a type field:
> the full st_mode is not useful on Windows, and on Unix readdir doesn't
> return a full st_mode in the first place.

Hmmm, I'm not sure I agree: st_mode from the new iterdir_stat() will
be as useful as that currently returned by os.stat(), and it is very
useful (mainly for checking whether an entry is a directory or not).
You're right that it won't return a full st_mode on Linux/BSD, but I
think it's better for folks to use the existing "if
stat.S_ISDIR(st.st_mode): ..." idiom than introduce a new thing.

> How did you measure the 5x speedup you saw with you modified os.walk?

Just by os.walk()ing through a large directory tree with basically
nothing in the inner loop, and comparing that to the same thing with
my version.

> It would be interesting to see if Unix platforms have a simular speedup, because
> if they don't the new API could just return the results of stat (or lstat ...).

Yeah, true. I'll do that and post results in the next few days when I
get it done. I'm *hoping* for a similar speedup there too, given the
increase in system calls, but you never know till you benchmark ...
maybe system calls are much faster on Linux, or stat() is cached
better or whatever.

-Ben


From ncoghlan at gmail.com  Wed Nov 14 10:33:03 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 14 Nov 2012 19:33:03 +1000
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
Message-ID: <CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>

On Wed, Nov 14, 2012 at 5:14 PM, Ronald Oussoren <ronaldoussoren at mac.com>wrote:

> How did you measure the 5x speedup you saw with you modified os.walk?
>
> It would be interesting to see if Unix platforms have a simular speedup,
> because
> if they don't the new API could just return the results of stat (or lstat
> ...).
>
>
One thing to keep in mind with these kind of metrics is that I/O latency is
a major factor. Solid state vs spinning disk vs network drive is going to
make a *big* difference to the relative performance of the different
mechanisms. With NFS (et al), it's particularly important to minimise the
number of round trips to the server (that's why the new dir listing caching
in the 3.3 import system results in such dramatic speed-ups when some of
the sys.path entries are located on network drives).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121114/257a1a03/attachment.html>

From robertc at robertcollins.net  Wed Nov 14 10:53:44 2012
From: robertc at robertcollins.net (Robert Collins)
Date: Wed, 14 Nov 2012 22:53:44 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
Message-ID: <CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>

On Wed, Nov 14, 2012 at 10:33 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On Wed, Nov 14, 2012 at 5:14 PM, Ronald Oussoren <ronaldoussoren at mac.com>
> wrote:
>>
>> How did you measure the 5x speedup you saw with you modified os.walk?
>>
>> It would be interesting to see if Unix platforms have a simular speedup,
>> because
>> if they don't the new API could just return the results of stat (or lstat
>> ...).
>>
>
> One thing to keep in mind with these kind of metrics is that I/O latency is
> a major factor. Solid state vs spinning disk vs network drive is going to
> make a *big* difference to the relative performance of the different
> mechanisms. With NFS (et al), it's particularly important to minimise the
> number of round trips to the server (that's why the new dir listing caching
> in the 3.3 import system results in such dramatic speed-ups when some of the
> sys.path entries are located on network drives).


Data from bzr:
 you can get a very significant speed up by doing two things:
 - use readdir to get the inode numbers of the files in the directory
and stat the files in-increasing-number-order. (this gives you
monotonically increasing IO).
 - chdir to the directory before you stat and use a relative path: it
turns out when working with many files that the overhead of absolute
paths is substantial.

We got a (IIRC 90% reduction in 'bzr status' time applying both of
these things, and you can grab the pyrex module needed to do readdir
from bzr - though we tuned what we had to match the needs of a VCS, so
its likely too convoluted for general purpose use).

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Cloud Services


From benhoyt at gmail.com  Wed Nov 14 11:08:57 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Wed, 14 Nov 2012 23:08:57 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
Message-ID: <CAL9jXCGnQpR4K7t9PG3BcxD14DeOX2bv_SEoOWMK3JJ2ZAzUOQ@mail.gmail.com>

> One thing to keep in mind with these kind of metrics is that I/O latency is
> a major factor. Solid state vs spinning disk vs network drive is going to
> make a *big* difference to the relative performance of the different
> mechanisms.

Yeah, you're right. The benchmarks I've been doing are only stable
after the first run, when I suppose everything is cached and you've
taken most of the I/O latency out of the picture. I guess this also
means the speed-up for the first run won't be nearly as significant.
But I'll do some further testing.

-Ben


From benhoyt at gmail.com  Wed Nov 14 11:15:07 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Wed, 14 Nov 2012 23:15:07 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
Message-ID: <CAL9jXCG-3+KnrTdrAbZkXB=NQu1N-+25DGZrihvyiMwjWVq0jA@mail.gmail.com>

> Data from bzr:
>  you can get a very significant speed up by doing two things:
>  - use readdir to get the inode numbers of the files in the directory
> and stat the files in-increasing-number-order. (this gives you
> monotonically increasing IO).
>  - chdir to the directory before you stat and use a relative path: it
> turns out when working with many files that the overhead of absolute
> paths is substantial.

Huh, very interesting, thanks. On the first point, did you need to
separately stat() the files after the readdir()? Presumably you needed
information other than the info in the d_type field from readdir's
dirent struct.

> We got a (IIRC 90% reduction in 'bzr status' time applying both of
> these things, and you can grab the pyrex module needed to do readdir
> from bzr - though we tuned what we had to match the needs of a VCS, so
> its likely too convoluted for general purpose use).

Do you have a web link to said source code? I'm having trouble (read:
being lazy) figuring out the bzr source repo.

-Ben


From storchaka at gmail.com  Wed Nov 14 11:54:07 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Wed, 14 Nov 2012 12:54:07 +0200
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
Message-ID: <k7vt8h$jau$1@ger.gmane.org>

On 14.11.12 11:53, Robert Collins wrote:
>   - chdir to the directory before you stat and use a relative path: it
> turns out when working with many files that the overhead of absolute
> paths is substantial.

Look at fwalk(). It reaches the same benefits without changing of process-global cwd (i.e. it is thread-safe).




From solipsis at pitrou.net  Wed Nov 14 11:52:04 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 14 Nov 2012 11:52:04 +0100
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
Message-ID: <20121114115204.4a4b98d3@pitrou.net>

Le Wed, 14 Nov 2012 22:53:44 +1300,
Robert Collins
<robertc at robertcollins.net> a ?crit :
> 
> Data from bzr:
>  you can get a very significant speed up by doing two things:
>  - use readdir to get the inode numbers of the files in the directory
> and stat the files in-increasing-number-order. (this gives you
> monotonically increasing IO).

This assumes directory entries are sorted by inode number (in a btree,
I imagine). Is this assumption specific to some Linux / Ubuntu
filesystem?

>  - chdir to the directory before you stat and use a relative path: it
> turns out when working with many files that the overhead of absolute
> paths is substantial.

How about using fstatat() instead? chdir() is a no-no since it's
a process-wide setting.

Regards

Antoine.




From mwm at mired.org  Wed Nov 14 13:16:13 2012
From: mwm at mired.org (Mike Meyer)
Date: Wed, 14 Nov 2012 06:16:13 -0600
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <20121114115204.4a4b98d3@pitrou.net>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
	<20121114115204.4a4b98d3@pitrou.net>
Message-ID: <CAD=7U2CvPMhVJzy_trgV2X9denDQJJ-=SeCXF4Xq7Ggfi-Jx_A@mail.gmail.com>

On Nov 14, 2012 4:55 AM, "Antoine Pitrou" <solipsis at pitrou.net> wrote:
>
> Le Wed, 14 Nov 2012 22:53:44 +1300,
> Robert Collins
> <robertc at robertcollins.net> a ?crit :
> >
> > Data from bzr:
> >  you can get a very significant speed up by doing two things:
> >  - use readdir to get the inode numbers of the files in the directory
> > and stat the files in-increasing-number-order. (this gives you
> > monotonically increasing IO).
>
> This assumes directory entries are sorted by inode number (in a btree,
> I imagine). Is this assumption specific to some Linux / Ubuntu
> filesystem?

It doesn't assume that, because inodes aren't stored in directories on
Posix file systems. Instead, they have names & inode numbers . The inode
(which is where the data stat returns lives) is stored elsewhere in the
file system, typically in on-disk arrays indexed  by inode number (and
that's grossly oversimplified).

I'm not sure how this would work on modern file systems (zfs, btrfs).

     <mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121114/1d0cb039/attachment.html>

From eliben at gmail.com  Wed Nov 14 14:57:37 2012
From: eliben at gmail.com (Eli Bendersky)
Date: Wed, 14 Nov 2012 05:57:37 -0800
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <509D2EF0.8010209@python.org>
References: <509C2E9D.3080707@python.org> <509CBC78.4040602@egenix.com>
	<509D2EF0.8010209@python.org>
Message-ID: <CAF-Rda8CbrGnQw7qhr_jWfak7Jt6ATER2pNrLCEnx4y0Lv-Zug@mail.gmail.com>

On Fri, Nov 9, 2012 at 8:27 AM, Christian Heimes <christian at python.org>wrote:

> Am 09.11.2012 09:19, schrieb M.-A. Lemburg:
> > Sounds like a good idea. I'd be interested in this, because it would
> > make debugging user installation problems easier.
> >
> > The only thing I'm not sure about is the option character "-I". It
> > reminds me too much of the -I typically used for include paths
> > in C compilers :-)
>
> I'm open to suggestions for a better name and character. Michael also
> pointed out that capital i (india) can look like a lower case l (lima).
> -R is still unused. I hesitate to call it restricted mode because it can
> be confused with PyPy's restricted Python.
>

Why does it have to be a single letter? Many tools today demand fully
spelled out command-line flags for readability, and this also helps avoid
clashes and cryptic flags no one remembers anyway. So if you want isolate
mode, what's wrong with "--isolated" ?

Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121114/ad13dbc2/attachment.html>

From random832 at fastmail.us  Wed Nov 14 19:20:31 2012
From: random832 at fastmail.us (random832 at fastmail.us)
Date: Wed, 14 Nov 2012 13:20:31 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <20121114115204.4a4b98d3@pitrou.net>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
	<20121114115204.4a4b98d3@pitrou.net>
Message-ID: <1352917231.6612.140661153656349.08D119BE@webmail.messagingengine.com>

On Wed, Nov 14, 2012, at 5:52, Antoine Pitrou wrote:
> This assumes directory entries are sorted by inode number (in a btree,
> I imagine). Is this assumption specific to some Linux / Ubuntu
> filesystem?

I think he was proposing listing the whole directory in advance (which
os.walk already does), sorting it, and then looping over it calling
stat. If the idea is for an API that exposes more information returned
by readdir, though, why not get d_type too when it's available?

> >  - chdir to the directory before you stat and use a relative path: it
> > turns out when working with many files that the overhead of absolute
> > paths is substantial.
> 
> How about using fstatat() instead? chdir() is a no-no since it's
> a process-wide setting.

A) is fstatat even implemented in python?
B) is fstatat even possible under windows?
C) using *at functions for this the usual way incurs overhead in the
form of having to maintain a number of open file handles equal to the
depth of your directory. IIRC, some gnu tools will fork the process to
avoid this limit. Though, since we're not doing this for security
reasons we could fall back on absolute [or deep relative] paths or
reopen '..' to ascend back up, instead.
D) you have to close those handles eventually. what if the caller
doesn't finish the generator?.


From solipsis at pitrou.net  Wed Nov 14 19:43:31 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 14 Nov 2012 19:43:31 +0100
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
	<20121114115204.4a4b98d3@pitrou.net>
	<1352917231.6612.140661153656349.08D119BE@webmail.messagingengine.com>
Message-ID: <20121114194331.79386a1c@pitrou.net>

On Wed, 14 Nov 2012 13:20:31 -0500
random832 at fastmail.us wrote:
> On Wed, Nov 14, 2012, at 5:52, Antoine Pitrou wrote:
> > This assumes directory entries are sorted by inode number (in a btree,
> > I imagine). Is this assumption specific to some Linux / Ubuntu
> > filesystem?
> 
> I think he was proposing listing the whole directory in advance (which
> os.walk already does), sorting it, and then looping over it calling
> stat.

But I don't understand why sorting (by inode? by name?) would make
stat() calls faster. That's what I'm trying to understand.

> If the idea is for an API that exposes more information returned
> by readdir, though, why not get d_type too when it's available?
> 
> > >  - chdir to the directory before you stat and use a relative path: it
> > > turns out when working with many files that the overhead of absolute
> > > paths is substantial.
> > 
> > How about using fstatat() instead? chdir() is a no-no since it's
> > a process-wide setting.
> 
> A) is fstatat even implemented in python?

Yup, it's available as a special parameter to os.stat():
http://docs.python.org/dev/library/os.html#os.stat

> B) is fstatat even possible under windows?

No, but Windows has its own functions to solve the issue (as explained
elsewhere in this thread).

> C) using *at functions for this the usual way incurs overhead in the
> form of having to maintain a number of open file handles equal to the
> depth of your directory.

Indeed. But directory trees are usually much wider than they are deep.

Regards

Antoine.




From robertc at robertcollins.net  Wed Nov 14 19:52:12 2012
From: robertc at robertcollins.net (Robert Collins)
Date: Thu, 15 Nov 2012 07:52:12 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <k7vt8h$jau$1@ger.gmane.org>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
	<k7vt8h$jau$1@ger.gmane.org>
Message-ID: <CAJ3HoZ3f9bdd5uVa7ko-JNpsFC+YuBM4XNpA-jesCx06tHn_vA@mail.gmail.com>

On Wed, Nov 14, 2012 at 11:54 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:
> On 14.11.12 11:53, Robert Collins wrote:
>>   - chdir to the directory before you stat and use a relative path: it
>> turns out when working with many files that the overhead of absolute
>> paths is substantial.
>
> Look at fwalk(). It reaches the same benefits without changing of process-global cwd (i.e. it is thread-safe).

cwd is thread-safe on unix (well, Linux anyhow :P), and the native OS
dir walking on Windows is better itself.

The only definitions I can find for fwalk are on BSD, not on Linux, so
fwalk is also likely something that would need porting; The
definitions I did find take a vector of file pointers, and so are
totally irrelevant for the point I made.

-Rob


From robertc at robertcollins.net  Wed Nov 14 19:55:32 2012
From: robertc at robertcollins.net (Robert Collins)
Date: Thu, 15 Nov 2012 07:55:32 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <20121114115204.4a4b98d3@pitrou.net>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
	<20121114115204.4a4b98d3@pitrou.net>
Message-ID: <CAJ3HoZ3xL72Q67Td1+NYz4uoTp8uSYxeoxkj+wvOmV4o+sf-+g@mail.gmail.com>

On Wed, Nov 14, 2012 at 11:52 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Le Wed, 14 Nov 2012 22:53:44 +1300,
> Robert Collins
> <robertc at robertcollins.net> a ?crit :
>>
>> Data from bzr:
>>  you can get a very significant speed up by doing two things:
>>  - use readdir to get the inode numbers of the files in the directory
>> and stat the files in-increasing-number-order. (this gives you
>> monotonically increasing IO).
>
> This assumes directory entries are sorted by inode number (in a btree,
> I imagine). Is this assumption specific to some Linux / Ubuntu
> filesystem?

Its definitely not applicable globally ( but its no worse in general
than any arbitrary sort, so safe to do everywhere). On the ext* family
of file systems inode A < inode B implies inode A is located on a
lower sector than B.

>>  - chdir to the directory before you stat and use a relative path: it
>> turns out when working with many files that the overhead of absolute
>> paths is substantial.
>
> How about using fstatat() instead? chdir() is a no-no since it's
> a process-wide setting.

fstatat looks *perfect*. Thanks for the pointer.

I forget the win32 behaviour, but on Linux a thread is a process ->
chdir is localised to the process and not altered across threads.

-Rob

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Cloud Services


From robertc at robertcollins.net  Wed Nov 14 20:00:49 2012
From: robertc at robertcollins.net (Robert Collins)
Date: Thu, 15 Nov 2012 08:00:49 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <20121114194331.79386a1c@pitrou.net>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
	<20121114115204.4a4b98d3@pitrou.net>
	<1352917231.6612.140661153656349.08D119BE@webmail.messagingengine.com>
	<20121114194331.79386a1c@pitrou.net>
Message-ID: <CAJ3HoZ1be=Tf3A4NCUUxeh8SuTZkh2OqDecGWr+-xDLru8HtvA@mail.gmail.com>

On Thu, Nov 15, 2012 at 7:43 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Wed, 14 Nov 2012 13:20:31 -0500
> random832 at fastmail.us wrote:
>> On Wed, Nov 14, 2012, at 5:52, Antoine Pitrou wrote:
>> > This assumes directory entries are sorted by inode number (in a btree,
>> > I imagine). Is this assumption specific to some Linux / Ubuntu
>> > filesystem?
>>
>> I think he was proposing listing the whole directory in advance (which
>> os.walk already does), sorting it, and then looping over it calling
>> stat.
>
> But I don't understand why sorting (by inode? by name?) would make
> stat() calls faster. That's what I'm trying to understand.

Less head seeks.

Consider 1000 files in a directory. They will likely be grouped on the
disk for locality of reference (e.g. same inode group) though mv's and
other operations mean this is only 'likely' not 'a given'.

when you do 1000 stat calls in a row, the kernel is working with a
command depth of 1, so it can't elevator seek optimise the work load.
If it sees linear IO it will trigger readahead, but thats less likely.
What the sort by inode does is mean that the disk head never needs to
seek backwards, so you get a single outward pass over the needed
sectors.

This is a widely spread technique:
http://git.661346.n2.nabble.com/RFC-HACK-refresh-index-lstat-in-inode-order-td7347768.html
is a recent recurrence of the discussion, with sources cited.

-Rob

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Cloud Services


From solipsis at pitrou.net  Wed Nov 14 20:00:55 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 14 Nov 2012 20:00:55 +0100
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
	<k7vt8h$jau$1@ger.gmane.org>
	<CAJ3HoZ3f9bdd5uVa7ko-JNpsFC+YuBM4XNpA-jesCx06tHn_vA@mail.gmail.com>
Message-ID: <20121114200055.6fc7b25d@pitrou.net>

On Thu, 15 Nov 2012 07:52:12 +1300
Robert Collins
<robertc at robertcollins.net> wrote:
> On Wed, Nov 14, 2012 at 11:54 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:
> > On 14.11.12 11:53, Robert Collins wrote:
> >>   - chdir to the directory before you stat and use a relative path: it
> >> turns out when working with many files that the overhead of absolute
> >> paths is substantial.
> >
> > Look at fwalk(). It reaches the same benefits without changing of process-global cwd (i.e. it is thread-safe).
> 
> cwd is thread-safe on unix (well, Linux anyhow :P)

Not really:

>>> print(os.getcwd())
/home/antoine/cpython/opt
>>> evt = threading.Event()
>>> def f():
...   evt.wait()
...   print(os.getcwd())
... 
>>> threading.Thread(target=f).start()
>>> os.chdir('/tmp')
>>> evt.set()
/tmp


Regards

Antoine.




From Steve.Dower at microsoft.com  Wed Nov 14 20:04:18 2012
From: Steve.Dower at microsoft.com (Steve Dower)
Date: Wed, 14 Nov 2012 19:04:18 +0000
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAJ3HoZ3xL72Q67Td1+NYz4uoTp8uSYxeoxkj+wvOmV4o+sf-+g@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
	<20121114115204.4a4b98d3@pitrou.net>
	<CAJ3HoZ3xL72Q67Td1+NYz4uoTp8uSYxeoxkj+wvOmV4o+sf-+g@mail.gmail.com>
Message-ID: <A7269F03D11BC245BD52843B195AC4F0019CAF05@TK5EX14MBXC294.redmond.corp.microsoft.com>

Robert Collins wrote:
>On Wed, Nov 14, 2012 at 11:52 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
>>>  - chdir to the directory before you stat and use a relative path: it
>>> turns out when working with many files that the overhead of absolute
>>> paths is substantial.
>>
>> How about using fstatat() instead? chdir() is a no-no since it's
>> a process-wide setting.
>
> fstatat looks *perfect*. Thanks for the pointer.
>
> I forget the win32 behaviour, but on Linux a thread is a process ->
> chdir is localised to the process and not altered across threads.

chdir is a very bad idea for libraries on Windows - it is meant mainly for the user and not the application. It has also been removed completely for Windows 8 apps (not desktop applications, just the new style ones).

Full paths may involve some overhead, but at least with FindFirst/NextFile this should be limited to memory copies and not file lookups.

Cheers,
Steve



From random832 at fastmail.us  Wed Nov 14 20:10:12 2012
From: random832 at fastmail.us (random832 at fastmail.us)
Date: Wed, 14 Nov 2012 14:10:12 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAJ3HoZ3xL72Q67Td1+NYz4uoTp8uSYxeoxkj+wvOmV4o+sf-+g@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
	<20121114115204.4a4b98d3@pitrou.net>
	<CAJ3HoZ3xL72Q67Td1+NYz4uoTp8uSYxeoxkj+wvOmV4o+sf-+g@mail.gmail.com>
Message-ID: <1352920212.23571.140661153681457.44EE5441@webmail.messagingengine.com>

On Wed, Nov 14, 2012, at 13:55, Robert Collins wrote:
> I forget the win32 behaviour, but on Linux a thread is a process ->
> chdir is localised to the process and not altered across threads.

This claim needs to be unpacked, to point out where you are right and
where you are wrong:
Linux threads are processes: true.
cwd _can be_ localized to a single [thread] process rather than shared
in the whole [traditional] process: also true.
(see http://linux.die.net/man/2/clone, note specifically the CLONE_FS
flag)
cwd _actually is_ localized to a process when created in the ordinary
manner as a thread: false.
(see http://linux.die.net/man/7/pthreads )


From barry at python.org  Wed Nov 14 22:57:32 2012
From: barry at python.org (Barry Warsaw)
Date: Wed, 14 Nov 2012 16:57:32 -0500
Subject: [Python-ideas] CLI option for isolated mode
References: <509C2E9D.3080707@python.org> <509CBC78.4040602@egenix.com>
	<509D2EF0.8010209@python.org>
	<CAF-Rda8CbrGnQw7qhr_jWfak7Jt6ATER2pNrLCEnx4y0Lv-Zug@mail.gmail.com>
Message-ID: <20121114165732.69dcd274@resist.wooz.org>

On Nov 14, 2012, at 05:57 AM, Eli Bendersky wrote:

>Why does it have to be a single letter? Many tools today demand fully
>spelled out command-line flags for readability, and this also helps avoid
>clashes and cryptic flags no one remembers anyway. So if you want isolate
>mode, what's wrong with "--isolated" ?

% head -1 foo.py
#!/usr/bin/python3 -Es
% ./foo.py
hello world

% head -1 bar.py
#!/usr/bin/python3 -E -s
% ./bar.py
Unknown option: - 
usage: /usr/bin/python3 [option] ... [-c cmd | -m mod | file | -] [arg] ...
Try `python -h' for more information.

So if you need to put multiple options on your shebang line, long options
won't work.

Cheers,
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121114/5fcdacb5/attachment.pgp>

From greg.ewing at canterbury.ac.nz  Wed Nov 14 21:55:27 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 15 Nov 2012 09:55:27 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2CvPMhVJzy_trgV2X9denDQJJ-=SeCXF4Xq7Ggfi-Jx_A@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<4BBEEE93-E2F6-487B-BD92-BE9792ABB5CC@mac.com>
	<CAL9jXCHzVd1f+Zs5twB5ODQ8N28FL59GyXU0awA_8dBExvh3qA@mail.gmail.com>
	<7858D168-7675-4A29-937A-4E9F4FD11F8E@mac.com>
	<CAL9jXCGE4DCxHGKSLCSvqjdjMLVn6-Ki4y09qM9aMUQ0-p_CeQ@mail.gmail.com>
	<2D464467-28B5-4CE5-83CA-AC7034D8933D@mac.com>
	<CADiSq7c3jzyAAsqQnkPO1CXpsKWYuqt7m9O4iuGH=9HNfKB6ZA@mail.gmail.com>
	<CAJ3HoZ0iWXmcZgR7suND3vgt79k3LzpmqxQHCe5SyMShTYMbvA@mail.gmail.com>
	<20121114115204.4a4b98d3@pitrou.net>
	<CAD=7U2CvPMhVJzy_trgV2X9denDQJJ-=SeCXF4Xq7Ggfi-Jx_A@mail.gmail.com>
Message-ID: <50A4053F.7080207@canterbury.ac.nz>

Mike Meyer wrote:
> 
> On Nov 14, 2012 4:55 AM, "Antoine Pitrou" <solipsis at pitrou.net 
> <mailto:solipsis at pitrou.net>> wrote:
>  >
>  > This assumes directory entries are sorted by inode number
> 
> It doesn't assume that, because inodes aren't stored in directories on 
> Posix file systems.

It seems to assume that the inodes referenced by a particular
directory will often be clustered, so that many of them reside
in the same or nearby disk blocks. But it's hard to imagine this
having much effect with the large file system caches used these
days.

-- 
Greg


From jimjjewett at gmail.com  Thu Nov 15 00:51:42 2012
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 14 Nov 2012 18:51:42 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
Message-ID: <CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>

On 11/12/12, Ben Hoyt <benhoyt at gmail.com> wrote:

> 1) Speeding up os.walk(). I've shown we can easily get a ~5x speedup
> on Windows by not calling stat() on each file. And on Linux/BSD this
> same data is available from readdir()'s dirent, so I presume there's
> be a similar speedup, though it may not be quite 5x.

> 2) I also propose adding os.iterdir(path='.') to do exactly the same
> thing as os.listdir(), but yield the results as it gets them instead
> of returning the whole list at once.

I know that two functions may be better than a keyword, but a
combinatorial explosion of functions ... isn't.  Even given that
listdir can't change for backwards compatibility, and given that
iteration might be better for large directories, I'm still not sure an
exact analogue is worth it.

Could this be someone combined with your 3rd proposal?  For example,
instead of returning a list of str (or bytes) names, could you return
a generator that would yield some sort of File objects?  (Well,
obviously you *could*, the question is whether that goes too far down
the rabbit hole of what a Path object should have.)  My strawman is an
object such that

(a)  No extra system calls will be made just to fill in data not
available from the dir entry itself.  I wouldn't even promise a name,
though I can't think of a sensible directory listing that doesn't
provide the name.

(b)  Any metadata you do have -- name, fd, results of stat, results of
getxattr ... will be available as an attribute.  That way users of
filesystems that do send back the size or type won't have to call
stat.

(c)  Attributes will default to None, supporting the "if x is None:
x=stat()" pattern for the users who do care about attributes that were
not available quickly.  (If there is an attribute for which "None" is
actually meaningful, the user can use hasattr -- but that is a corner
case, not worth polluting the API for.)

*Maybe* teach open about these objects, so that it can look for the
name or fd attributes.

Alternatively, it could return a str (or bytes) subclass that has the
other attributes when they are available.  That seems a bit contrived,
but might be better for backwards compatibility.  (os.walk could
return such objects too, instead of extracting the name.)

-jJ


From random832 at fastmail.us  Thu Nov 15 00:57:09 2012
From: random832 at fastmail.us (Random832)
Date: Wed, 14 Nov 2012 18:57:09 -0500
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <20121114165732.69dcd274@resist.wooz.org>
References: <509C2E9D.3080707@python.org> <509CBC78.4040602@egenix.com>
	<509D2EF0.8010209@python.org>
	<CAF-Rda8CbrGnQw7qhr_jWfak7Jt6ATER2pNrLCEnx4y0Lv-Zug@mail.gmail.com>
	<20121114165732.69dcd274@resist.wooz.org>
Message-ID: <50A42FD5.6050405@fastmail.us>

On 11/14/2012 4:57 PM, Barry Warsaw wrote:
> So if you need to put multiple options on your shebang line, long options
> won't work.
This seems like something that could be worked around in the interpreter.


From benhoyt at gmail.com  Thu Nov 15 01:12:48 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Thu, 15 Nov 2012 13:12:48 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
Message-ID: <CAL9jXCG+K9UtTBP58ffjg4OAasXvyny8shV_JNeZZ1ofSKYjMQ@mail.gmail.com>

> I know that two functions may be better than a keyword, but a
> combinatorial explosion of functions ... isn't.  Even given that
> listdir can't change for backwards compatibility, and given that
> iteration might be better for large directories, I'm still not sure an
> exact analogue is worth it.

"Combinatorial explosion of functions"? Slight exaggeration. :-) I'm
proposing adding two new functions, iterdir and iterdir_stat. And I
definitely think two new functions with pretty standard names and
almost self-documenting signatures is much simpler than keyword args
and magical return values. For example, Python 2.x has dict.items()
and dict.iteritems(), and it's clear what the "iter" version does at a
glance.

> Could this be someone combined with your 3rd proposal?  For example,
> instead of returning a list of str (or bytes) names, could you return
> a generator that would yield some sort of File objects?

I considered this, as well as a str subclass with a ".stat" attribute.
But iterdir_stat's (filename, stat) tuple is much more explicit,
whereas the str subclass just seemed too magical -- though I might be
able to be convinced otherwise.

Yes, as you mention, I really think the File/Path object is a rabbit
hole, and the intent of my proposal is very minimal changes / minimal
learning curve -- an incremental addition.

-Ben


From steve at pearwood.info  Thu Nov 15 01:25:09 2012
From: steve at pearwood.info (Steven D'Aprano)
Date: Thu, 15 Nov 2012 11:25:09 +1100
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <50A42FD5.6050405@fastmail.us>
References: <509C2E9D.3080707@python.org> <509CBC78.4040602@egenix.com>
	<509D2EF0.8010209@python.org>
	<CAF-Rda8CbrGnQw7qhr_jWfak7Jt6ATER2pNrLCEnx4y0Lv-Zug@mail.gmail.com>
	<20121114165732.69dcd274@resist.wooz.org>
	<50A42FD5.6050405@fastmail.us>
Message-ID: <50A43665.4010406@pearwood.info>

On 15/11/12 10:57, Random832 wrote:
> On 11/14/2012 4:57 PM, Barry Warsaw wrote:
>> So if you need to put multiple options on your shebang line, long options
>> won't work.
> This seems like something that could be worked around in the interpreter.


Shebang lines aren't interpreted by Python, but by the shell.

To be precise, it isn't the shell either, but the program loader, I think.
But whatever it is, it isn't Python.


-- 
Steven


From phd at phdru.name  Thu Nov 15 01:57:27 2012
From: phd at phdru.name (Oleg Broytman)
Date: Thu, 15 Nov 2012 04:57:27 +0400
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <50A43665.4010406@pearwood.info>
References: <509C2E9D.3080707@python.org> <509CBC78.4040602@egenix.com>
	<509D2EF0.8010209@python.org>
	<CAF-Rda8CbrGnQw7qhr_jWfak7Jt6ATER2pNrLCEnx4y0Lv-Zug@mail.gmail.com>
	<20121114165732.69dcd274@resist.wooz.org>
	<50A42FD5.6050405@fastmail.us> <50A43665.4010406@pearwood.info>
Message-ID: <20121115005727.GA31723@iskra.aviel.ru>

On Thu, Nov 15, 2012 at 11:25:09AM +1100, Steven D'Aprano <steve at pearwood.info> wrote:
> Shebang lines aren't interpreted by Python, but by the shell.
> 
> To be precise, it isn't the shell either, but the program loader, I think.
> But whatever it is, it isn't Python.

   By the OS kernel -- it looks into the header to find out if the file
to be exec'd is ELF or DWARF or a script -- and then it invokes an
appropriate loader.

Oleg.
-- 
     Oleg Broytman            http://phdru.name/            phd at phdru.name
           Programmers don't die, they just GOSUB without RETURN.


From mwm at mired.org  Thu Nov 15 02:30:07 2012
From: mwm at mired.org (Mike Meyer)
Date: Wed, 14 Nov 2012 19:30:07 -0600
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
Message-ID: <CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>

On Wed, Nov 14, 2012 at 5:51 PM, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 11/12/12, Ben Hoyt <benhoyt at gmail.com> wrote:
[...]
> (c)  Attributes will default to None, supporting the "if x is None:
> x=stat()" pattern for the users who do care about attributes that were
> not available quickly.  (If there is an attribute for which "None" is
> actually meaningful, the user can use hasattr -- but that is a corner
> case, not worth polluting the API for.)

Two questions:

1) Is there some way to distinguish that your st_mode field is only
partially there (i.e. - you get the Linux/BSD d_type value, but not
the rest of st_mode)?

2) How about making these attributes properties, so that touching one
that isn't there causes them all to be populated.

     <mike


From benhoyt at gmail.com  Thu Nov 15 02:37:08 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Thu, 15 Nov 2012 14:37:08 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
Message-ID: <CAL9jXCFaT6tbT49yH5tRDbW09vyj4dPSO9c5q71aeFghhj9+ZA@mail.gmail.com>

Very good questions.

> 1) Is there some way to distinguish that your st_mode field is only
> partially there (i.e. - you get the Linux/BSD d_type value, but not
> the rest of st_mode)?

Not currently. I haven't thought about this too hard -- there way be a
bit that's always set/not set within st_mode itself. Otherwise I'd
have to add a full_st_mode or similar property

> 2) How about making these attributes properties, so that touching one
> that isn't there causes them all to be populated.

Interesting thought. Might be a bit too magical for my taste.

One concern I have with either of the above (adding a property to
stat_result) is that then I'd have to use my own class or namedtuple
rather than os.stat_result. Presumably that's not a big deal, though,
as you'd never do isinstance testing on it. In the case of 2), it
might have to be a class, meaning somewhat heavier memory-wise for
lots and lots of files.

-Ben


From greg.ewing at canterbury.ac.nz  Thu Nov 15 03:13:15 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 15 Nov 2012 15:13:15 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCFaT6tbT49yH5tRDbW09vyj4dPSO9c5q71aeFghhj9+ZA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CAL9jXCFaT6tbT49yH5tRDbW09vyj4dPSO9c5q71aeFghhj9+ZA@mail.gmail.com>
Message-ID: <50A44FBB.3020105@canterbury.ac.nz>

On 15/11/12 14:37, Ben Hoyt wrote:
>
>> 1) Is there some way to distinguish that your st_mode field is only
>> partially there (i.e. - you get the Linux/BSD d_type value, but not
>> the rest of st_mode)?
>
> Not currently. I haven't thought about this too hard -- there way be a
> bit that's always set/not set within st_mode itself.

Maybe the call should have a bit mask indicating which of
the st_mode fields you want populated.

-- 
Greg



From abarnert at yahoo.com  Thu Nov 15 04:44:44 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Wed, 14 Nov 2012 19:44:44 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
Message-ID: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>

First, I realize that people regularly propose with expressions. This is not the 
same thing.

The problem with the with statement is not that it can't be postfixed 
perl-style, or used in expressions. The problem is that it can't be used with 
generator expressions.

Here's the suggestion:

    upperlines = (lines.upper() for line in file with open('foo', 'r') as file)

This would be equivalent to:

    def foo():
        with open('foo', 'r') as file:
            for line in file:
                yield line.upper()
    upperlines = foo()

The motivation is that there is no way to write this properly using a with 
statement and a generator expression?in fact, the only way to get this right is 
with the generator function above. And almost nobody ever gets it right, even 
when you push them in the right direction (although occasionally they write a 
complex class that has the same effect).

That's why we still have tons of code like this lying around:

    upperlines = (lines.upper() for line in open('foo', 'r'))

Everyone knows that this only works with CPython, and isn't even quite right 
there, and yet people write it anyway, because there's no good alternative.

The with clause is inherently part of the generator expression, because the 
scope has to be dynamic. The file has to be closed when iteration finishes, not 
when creating the generator finishes (or when the generator is cleaned up?which 
is closer, but still wrong).

That's why a general-purpose "with expression" wouldn't actually help here; in 
fact, it would just make generator expressions with with clauses harder to 
parse. A with expression would have to be statically scoped to be general.

For more details, see this:

http://stupidpythonideas.blogspot.com/2012/11/with-clauses-for-generator-expressions.html


From jimjjewett at gmail.com  Thu Nov 15 05:19:29 2012
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 14 Nov 2012 23:19:29 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
Message-ID: <CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>

On 11/14/12, Mike Meyer <mwm at mired.org> wrote:
> On Wed, Nov 14, 2012 at 5:51 PM, Jim Jewett <jimjjewett at gmail.com> wrote:
>> On 11/12/12, Ben Hoyt <benhoyt at gmail.com> wrote:

>> (c)  Attributes will default to None, supporting the "if x is None:
>> x=stat()" pattern for the users who do care about attributes that were
>> not available quickly.  ...

> Two questions:

> 1) Is there some way to distinguish that your st_mode field is only
> partially there (i.e. - you get the Linux/BSD d_type value, but not
> the rest of st_mode)?

os.iterdir did not call stat; you have partial information.

Or are you saying that you want to distinguish between "This
filesystem doesn't track that information", "This process couldn't get
that information right now", and "That particular piece of information
requires a second call that hasn't been made yet"?

> 2) How about making these attributes properties, so that touching one
> that isn't there causes them all to be populated.

Part of the motivation was to minimize extra system calls; that
suggests making another one should be a function call instead of a
property.

That said, I can see value in making that optional call return
something with the same API.  Perhaps:

    if thefile.X is None:
        thefile.restat()    # namespace clash, but is "restat" really
a likely file attribute?

or

    if thefile.X is None:
        thefile=stat(thefile)

-jJ


From mwm at mired.org  Thu Nov 15 06:03:21 2012
From: mwm at mired.org (Mike Meyer)
Date: Wed, 14 Nov 2012 23:03:21 -0600
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
Message-ID: <CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>

On Nov 14, 2012 10:19 PM, "Jim Jewett" <jimjjewett at gmail.com> wrote:
>
> On 11/14/12, Mike Meyer <mwm at mired.org> wrote:
> > On Wed, Nov 14, 2012 at 5:51 PM, Jim Jewett <jimjjewett at gmail.com>
wrote:
> >> On 11/12/12, Ben Hoyt <benhoyt at gmail.com> wrote:
>
> >> (c)  Attributes will default to None, supporting the "if x is None:
> >> x=stat()" pattern for the users who do care about attributes that were
> >> not available quickly.  ...
>
> > Two questions:
>
> > 1) Is there some way to distinguish that your st_mode field is only
> > partially there (i.e. - you get the Linux/BSD d_type value, but not
> > the rest of st_mode)?
>
> os.iterdir did not call stat; you have partial information.

Note that you're eliding the proposal these questions were about, that
os.iterdir return some kind of object that had attributes that carried the
stat values, or None if they weren't available.

> Or are you saying that you want to distinguish between "This
> filesystem doesn't track that information", "This process couldn't get
> that information right now", and "That particular piece of information
> requires a second call that hasn't been made yet"?

I want to distinguish between the case where st_mode is filled from the
BSD/Unix d_type directory entry - meaning there is information so st_mode
is not None, but the information is incomplete and requires a second system
call to fetch - and the case where it's filled via the Windows calls which
provide all the information that is available for st_mode, so no second
system call is needed.

> > 2) How about making these attributes properties, so that touching one
> > that isn't there causes them all to be populated.
> Part of the motivation was to minimize extra system calls; that
> suggests making another one should be a function call instead of a
> property.

Except that I don't see that there's anything to do once you've found a
None-valued attribute *except* make that extra call. If there's a use case
where you find one of the attributes is None and then not get the value
from the system, I agree with you. If there isn't, then you might as well
roll that one use case into the object rather than force every client to do
the stat call and extract the information from it in that case.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121114/c9f70611/attachment.html>

From jimjjewett at gmail.com  Thu Nov 15 06:53:39 2012
From: jimjjewett at gmail.com (Jim Jewett)
Date: Thu, 15 Nov 2012 00:53:39 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
Message-ID: <CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>

On 11/15/12, Mike Meyer <mwm at mired.org> wrote:
> On Nov 14, 2012 10:19 PM, "Jim Jewett" <jimjjewett at gmail.com> wrote:

> Note that you're eliding the proposal these questions were about, that
> os.iterdir return some kind of object that had attributes that carried the
> stat values, or None if they weren't available.

>> Or are you saying that you want to distinguish between "This
>> filesystem doesn't track that information", "This process couldn't get
>> that information right now", and "That particular piece of information
>> requires a second call that hasn't been made yet"?

> I want to distinguish between the case where st_mode is filled from the
> BSD/Unix d_type directory entry - meaning there is information so st_mode
> is not None, but the information is incomplete and requires a second system
> call to fetch - and the case where it's filled via the Windows calls which
> provide all the information that is available for st_mode, so no second
> system call is needed.

So you're basically saying that you want to know whether an explicit
stat call would make a difference?  (Other than updating the
information if it has changed.)

>> > 2) How about making these attributes properties, so that touching one
>> > that isn't there causes them all to be populated.
>> Part of the motivation was to minimize extra system calls; that
>> suggests making another one should be a function call instead of a
>> property.

> Except that I don't see that there's anything to do once you've found a
> None-valued attribute *except* make that extra call.

ah.  I was thinking of reporting, where you could just leave a column
off the report.

Or of some sort of optimization, where knowing the size (or last
change date) is not required, but may be helpful.  I suppose these
might be sufficiently uncommon that triggering the extra stat call
instead of returning None might be justified.

-jJ


From mwm at mired.org  Thu Nov 15 07:09:41 2012
From: mwm at mired.org (Mike Meyer)
Date: Thu, 15 Nov 2012 00:09:41 -0600
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
Message-ID: <CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>

On Wed, Nov 14, 2012 at 11:53 PM, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 11/15/12, Mike Meyer <mwm at mired.org> wrote:
>> On Nov 14, 2012 10:19 PM, "Jim Jewett" <jimjjewett at gmail.com> wrote:
>> I want to distinguish between the case where st_mode is filled from the
>> BSD/Unix d_type directory entry - meaning there is information so st_mode
>> is not None, but the information is incomplete and requires a second system
>> call to fetch - and the case where it's filled via the Windows calls which
>> provide all the information that is available for st_mode, so no second
>> system call is needed.
> So you're basically saying that you want to know whether an explicit
> stat call would make a difference?  (Other than updating the
> information if it has changed.)

That's one way of looking at it. The problem is that you tell if a
value has been filled or not by having a None value.  But st_mode is
itself multi-valued, and you don't always get all available
value. Maybe d_type should be it's own attribute? If readdir returns
it, we use it as is. If not, then the caller either does the None/stat
dance or we make it a property that gets filled from the stat
structure.

	Thanks,
	<mike


From pconnell at gmail.com  Thu Nov 15 08:35:55 2012
From: pconnell at gmail.com (Phil Connell)
Date: Thu, 15 Nov 2012 07:35:55 +0000
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
Message-ID: <20121115073555.GA7582@odell.Belkin>

On Wed, Nov 14, 2012 at 07:44:44PM -0800, Andrew Barnert wrote:
> First, I realize that people regularly propose with expressions. This is not the 
> same thing.
> 
> The problem with the with statement is not that it can't be postfixed 
> perl-style, or used in expressions. The problem is that it can't be used with 
> generator expressions.
> 
> Here's the suggestion:
> 
>     upperlines = (lines.upper() for line in file with open('foo', 'r') as file)

While this looks very clean, how do you propose the following should be written
as a generator expression?

def foo():
    with open('foo') as f:
        for line in f:
            if 'bar' in line:
                yield line


An obvious suggestion is as follows, but I'm not totally convinced about the
out-of-order with, for and if clauses (compared with the equivalent generator)

(line
 for line in f
 if bar in 'line'
 with open('foo') as f)

Cheers,
Phil

> 
> This would be equivalent to:
> 
>     def foo():
>         with open('foo', 'r') as file:
>             for line in file:
>                 yield line.upper()
>     upperlines = foo()
> 
> The motivation is that there is no way to write this properly using a with 
> statement and a generator expression?in fact, the only way to get this right is 
> with the generator function above. And almost nobody ever gets it right, even 
> when you push them in the right direction (although occasionally they write a 
> complex class that has the same effect).
> 
> That's why we still have tons of code like this lying around:
> 
>     upperlines = (lines.upper() for line in open('foo', 'r'))
> 
> Everyone knows that this only works with CPython, and isn't even quite right 
> there, and yet people write it anyway, because there's no good alternative.
> 
> The with clause is inherently part of the generator expression, because the 
> scope has to be dynamic. The file has to be closed when iteration finishes, not 
> when creating the generator finishes (or when the generator is cleaned up?which 
> is closer, but still wrong).
> 
> That's why a general-purpose "with expression" wouldn't actually help here; in 
> fact, it would just make generator expressions with with clauses harder to 
> parse. A with expression would have to be statically scoped to be general.
> 
> For more details, see this:
> 
> http://stupidpythonideas.blogspot.com/2012/11/with-clauses-for-generator-expressions.html
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas


From benhoyt at gmail.com  Thu Nov 15 08:40:57 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Thu, 15 Nov 2012 20:40:57 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
Message-ID: <CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>

> That's one way of looking at it. The problem is that you tell if a
> value has been filled or not by having a None value.  But st_mode is
> itself multi-valued, and you don't always get all available
> value. Maybe d_type should be it's own attribute? If readdir returns
> it, we use it as is. If not, then the caller either does the None/stat
> dance or we make it a property that gets filled from the stat
> structure.

I'm inclined to KISS and just let the caller handle it. Many other
things in the "os" module are system dependent, including os.stat(),
so if the stat_results results returned by iterdir_stat() are system
dependent, that's just par for the course. I'm thinking of a docstring
something like:

"""Yield tuples of (filename, stat_result) for each filename in
directory given by "path". Like listdir(), '.' and '..' are skipped.
The values are yielded in system-dependent order.

Each stat_result is an object like you'd get by calling os.stat() on
that file, but not all information is present on all systems, and st_*
fields that are not available will be None.

In practice, stat_result is a full os.stat() on Windows, but only the
"is type" bits of the st_mode field are available on Linux/OS X/BSD.
"""

So in real life, if you're using more than stat.S_ISDIR() of st_mode,
you'll need to call stat separately. But 1) it's quite rare to need eg
the permissions bits in this context, and 2) if you're expecting
st_mode to have that extra stuff your code's already system-dependent,
as permission bits don't mean much on Windows.

But the main point is that what the OS gives you for free is easily available.

-Ben


From storchaka at gmail.com  Thu Nov 15 08:50:52 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Thu, 15 Nov 2012 09:50:52 +0200
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
Message-ID: <k82b48$u25$1@ger.gmane.org>

On 15.11.12 05:44, Andrew Barnert wrote:
> That's why we still have tons of code like this lying around:
>
>      upperlines = (lines.upper() for line in open('foo', 'r'))
>
> Everyone knows that this only works with CPython, and isn't even quite right
> there, and yet people write it anyway, because there's no good alternative.

Not every piece of code should be written as one-liner.

Use a generator function.




From masklinn at masklinn.net  Thu Nov 15 10:29:22 2012
From: masklinn at masklinn.net (Masklinn)
Date: Thu, 15 Nov 2012 10:29:22 +0100
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
Message-ID: <2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>


On 2012-11-15, at 04:44 , Andrew Barnert wrote:

> First, I realize that people regularly propose with expressions. This is not the 
> same thing.
> 
> The problem with the with statement is not that it can't be postfixed 
> perl-style, or used in expressions. The problem is that it can't be used with 
> generator expressions.
> 
> Here's the suggestion:
> 
>    upperlines = (lines.upper() for line in file with open('foo', 'r') as file)
> 
> This would be equivalent to:
> 
>    def foo():
>        with open('foo', 'r') as file:
>            for line in file:
>                yield line.upper()
>    upperlines = foo()
> 
> The motivation is that there is no way to write this properly using a with 
> statement and a generator expression?in fact, the only way to get this right is 
> with the generator function above.

Actually, it's extremely debatable that the generator function is
correct: if the generator is not fully consumed (terminating iteration
on the file) I'm pretty sure the file will *not* get closed save by the
GC doing a pass on all dead objects maybe. This means this function is
*not safe* as a lazy source to an arbitrary client, as that client may
very well use itertools.slice or itertools.takewhile and only partially
consume the generator.

Here's an example:

--
import itertools

class Manager(object):
    def __enter__(self):
        return self

    def __exit__(self, *args):
        print("Exited")

    def __iter__(self):
        for i in range(5):
            yield i

def foo():
    with Manager() as ms:
        for m in ms:
            yield m

def bar():
    print("1")
    f = foo()
    print("2")
    # Only consume part of the iterable
    list(itertools.islice(f, None, 2))
    print("3")

bar()
print("4")
--

CPython output, I'm impressed that the refcounting GC actually bothers
unwinding the stack and running the __exit__ handler *once bar has
finished executing*:

> python3 withgen.py 
1
2
3
Exited
4

But here's the (just as correct, as far as I can tell) output from pypy:

> pypy-c withgen.py 
1
2
3
4

If the program was long running, it is possible that pypy would run
__exit__ when the containing generator is released (though by no means
certain, I don't know if this is specified at all).

This is in fact one of the huge issues with faking dynamic scopes via
threadlocals and context managers (as e.g. Flask might do, I'm not sure
what actual strategy it uses), they interact rather weirdly with
generators (it's also why I think Python should support actually
dynamically scoped variables, it would also fix the thread-broken
behavior of e.g. warnings.catch_warnings)

From abarnert at yahoo.com  Thu Nov 15 11:08:27 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 15 Nov 2012 02:08:27 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>
Message-ID: <1352974107.39869.YahooMailRC@web184704.mail.ne1.yahoo.com>

From: Masklinn <masklinn at masklinn.net>
Sent: Thu, November 15, 2012 1:29:46 AM


> On 2012-11-15, at 04:44 , Andrew Barnert wrote:
> 
> > Here's the  suggestion:
> > 
> >    upperlines = (lines.upper() for line  in file with open('foo', 'r') as 
>file)
> > 
> > This would be equivalent  to:
> > 
> >    def foo():
> >        with open('foo', 'r') as file:
> >            for line in file:
> >                yield line.upper()
> >    upperlines = foo()
> > 
> > The  motivation is that there is no way to write this properly using a with 
> >  statement and a generator expression?in fact, the only way to get this 
>right 
>is 
>
> > with the generator function above.
> 
> Actually, it's extremely  debatable that the generator function is
> correct: if the generator is not  fully consumed (terminating iteration
> on the file) I'm pretty sure the file  will *not* get closed save by the
> GC doing a pass on all dead objects maybe.  This means this function is
> *not safe* as a lazy source to an arbitrary  client, as that client may
> very well use itertools.slice or  itertools.takewhile and only partially
> consume the generator.

Well, yes, *no possible object* is safe as a lazy source to an arbitrary client 
that might not fully consume, close, or destroy it. By definition, the object 
must stay alive as long as an arbitrary client might use it, so a client that 
never finishes using it means the object must stay alive forever. And, 
similarly, in the case of a client that does finish using it, but the only way 
to detect that is by GCing the client, the object must stay alive until the GC 
collects the client. So, the correct thing for the generator function to do in 
that case is? exactly what it does.

Of course in that case, it would arguably be just as correct to just do "ms = 
Manager()" or "file = open('foo', 'r')" instead of "with Manager() as ms:" or 
"with open('foo', 'r') as file:".

The difference is that, in cases where the client does fully consume, close, or 
destroy the iterator deterministically, the with version will still do the 
right 
thing, while the leaky version will not. You can test this very easily by 
adding an "f.close()" to the end of bar, or changing "f = foo()" to "with 
closing(foo()) as f:", and compare the two versions of the generator function.

Put another way, if your point is an argument against with clauses, it's also 
an 
argument against with statements, and manual resource cleanup, and in fact 
anything but a magical GC.

> This is  in fact one of the huge issues with faking dynamic scopes via
> threadlocals  and context managers (as e.g. Flask might do, I'm not sure
> what actual  strategy it uses), they interact rather weirdly with
> generators (it's also  why I think Python should support actually
> dynamically scoped variables, it  would also fix the thread-broken
> behavior of e.g.  warnings.catch_warnings)


This is an almost-unrelated side issue. A generator used in a single thread 
defines a fully deterministic dynamic scope, one that can and often should be 
used for cleanup. The fact that sometimes it's not the right scope for some 
cleanups, or that you can use them in multithreaded programs in a way that 
makes 
them indeterministic, isn't an argument that it should be hard to use them for 
cleanup when appropriate, is it?


From masklinn at masklinn.net  Thu Nov 15 11:25:27 2012
From: masklinn at masklinn.net (Masklinn)
Date: Thu, 15 Nov 2012 11:25:27 +0100
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352974107.39869.YahooMailRC@web184704.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>
	<1352974107.39869.YahooMailRC@web184704.mail.ne1.yahoo.com>
Message-ID: <894D7633-596F-4E16-85D6-42562AADEB02@masklinn.net>

On 2012-11-15, at 11:08 , Andrew Barnert wrote:
> This is an almost-unrelated side issue. A generator used in a single thread 
> defines a fully deterministic dynamic scope

I think you meant "a context manager" not "a generator", and my example
quite clearly demonstrates that the interaction between context managers
and generators completely break context managers as dynamic scopes.

> , one that can and often should be 
> used for cleanup. The fact that sometimes it's not the right scope for some 
> cleanups, or that you can use them in multithreaded programs

Using context managers on threadlocals means the context manager itself
is in a single-threaded environment, the multithreading is not the
issue, the interaction between context managers and generators is.

> isn't an argument that it should be hard to use them for 
> cleanup when appropriate, is it?

I never wrote that, I only noted that your assertion about the function
you posted (namely that it is "properly written") is dubious and risky.

From mwm at mired.org  Thu Nov 15 11:29:24 2012
From: mwm at mired.org (Mike Meyer)
Date: Thu, 15 Nov 2012 04:29:24 -0600
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
Message-ID: <CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>

On Nov 15, 2012 1:40 AM, "Ben Hoyt" <benhoyt
<benhoyt at gmail.com>@<benhoyt at gmail.com>
gmail.com <benhoyt at gmail.com>> wrote:
>
> > That's one way of looking at it. The problem is that you tell if a
> > value has been filled or not by having a None value.  But st_mode is
> > itself multi-valued, and you don't always get all available
> > value. Maybe d_type should be it's own attribute? If readdir returns
> > it, we use it as is. If not, then the caller either does the None/stat
> > dance or we make it a property that gets filled from the stat
> > structure.
>
> I'm inclined to KISS and just let the caller handle it. Many other
> things in the "os" module are system dependent, including os.stat(),
> so if the stat_results results returned by iterdir_stat() are system
> dependent, that's just par for the course. I'm thinking of a docstring
> something like:
>
> """Yield tuples of (filename, stat_result) for each filename in
> directory given by "path". Like listdir(), '.' and '..' are skipped.
> The values are yielded in system-dependent order.
>
> Each stat_result is an object like you'd get by calling os.stat() on
> that file, but not all information is present on all systems, and st_*
> fields that are not available will be None.
>
> In practice, stat_result is a full os.stat() on Windows, but only the
> "is type" bits of the st_mode field are available on Linux/OS X/BSD.
> """

There's a code smell here, in that the doc for Unix variants is incomplete
and wrong. Whether or not you get the d_type values depends on the OS
having that extension. Further, there's a d_type value (DT_UNKNOWN) that
isn't a valid value for the S_IFMT bits in st_mode (at least on BSD).

> So in real life, if you're using more than stat.S_ISDIR() of st_mode,
> you'll need to call stat separately. But 1) it's quite rare to need eg
> the permissions bits in this context, and 2) if you're expecting
> st_mode to have that extra stuff your code's already system-dependent,
> as permission bits don't mean much on Windows.
>
> But the main point is that what the OS gives you for free is easily
available.

If the goal is to make os.walk fast, then it might be better (on Posix
systems, anyway) to see if it can be built on top of ftw instead of
low-level directory scanning routines.

    <mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121115/e4a49bda/attachment.html>

From abarnert at yahoo.com  Thu Nov 15 11:37:58 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 15 Nov 2012 02:37:58 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <k82b48$u25$1@ger.gmane.org>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<k82b48$u25$1@ger.gmane.org>
Message-ID: <1352975878.59084.YahooMailRC@web184702.mail.ne1.yahoo.com>

> From: Serhiy Storchaka <storchaka at gmail.com>
> Sent: Thu, November 15, 2012 1:05:27 AM
> 
> On 15.11.12 05:44, Andrew Barnert wrote:
> > That's why we still have tons  of code like this lying around:
> >
> >      upperlines =  (lines.upper() for line in open('foo', 'r'))
> >
> > Everyone knows that  this only works with CPython, and isn't even quite 
right
> > there, and yet  people write it anyway, because there's no good alternative.
> 
> Not every  piece of code should be written as one-liner.

But a piece of code that everyone needs on a regular basis should be writable, 
and readable, by a novice Python user. I don't care whether it's one line or 
four, but I do care that a task that seems to require nothing that you don't 
learn in your first week with the language is beyond the ability of not just 
novices, but people who post modules on PyPI, write answers on StackOverflow, 
etc.

> Use a generator function.

Of course the right answer is obvious to you and me, because we understand the 
difference between static and dynamic scopes, and that a generator defines a 
dynamic scope, and what context managers actually do, and how to translate a 
generator expression into a generator function.

It's not that the generator function is hard to write; it's that people who 
don't understand how all this stuff works won't even think of the idea that an 
explicit generator function would help them here.


From abarnert at yahoo.com  Thu Nov 15 12:11:07 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 15 Nov 2012 03:11:07 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <20121115073555.GA7582@odell.Belkin>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
Message-ID: <1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>

> From: Phil Connell <pconnell at gmail.com>
> Sent: Thu, November 15, 2012 1:24:52 AM
> 
> On Wed, Nov 14, 2012 at 07:44:44PM -0800, Andrew Barnert wrote:
> > 
> >     upperlines =  (lines.upper() for line in file with open('foo', 'r') as 
>file)
> 
> While this  looks very clean, how do you propose the following should be 
>written
> as a  generator expression?
> 
> def foo():
>     with open('foo') as  f:
>         for line in f:
>              if 'bar' in line:
>                  yield line

Exactly as you suggest (quoting you out of order to make the answer clearer):

> (line
> for line in f
> if bar in 'line'
> with open('foo') as f)


> An obvious suggestion is as follows, but I'm  not totally convinced about the
> out-of-order with, for and if clauses  (compared with the equivalent 
generator)


The clauses have *always* been out of order. In the function, the "if" comes 
between the "for" and the yield expression. In the expression, the "for" comes 
in between. If the clause order implies the statement order (I would have put it 
in terms of the clause structure implying the scoping, but they're effectively 
the same idea), then our syntax has been wrong since list comprehensions were 
added in 2.0. So, I think (and hope!) that implication was never intended.

Which means the only question is, which one looks more readable:

1. (foo(line) for line in baz(f) if 'bar' in line with open('foo') as f)
2. (foo(line) for line in baz(f) with open('foo') as f if 'bar' in line)
3. (foo(line) with open('foo') as f for line in baz(f) if 'bar' in line)

Or, in the trivial case (where versions 1 and 2 are indistinguishable):

1. (line for line in f with open('foo') as f)
2. (line for line in f with open('foo') as f)
3. (line with open('foo') as f for line in f)

My own intuition is that 1 is the clearest, and 3 by far the worst. So, that's 
why I proposed order 1. But I'm not at all married to it.


From pconnell at gmail.com  Thu Nov 15 12:22:51 2012
From: pconnell at gmail.com (Phil Connell)
Date: Thu, 15 Nov 2012 11:22:51 +0000
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
Message-ID: <20121115112251.GA13472@phconnel-ws.cisco.com>

On Thu, Nov 15, 2012 at 03:11:07AM -0800, Andrew Barnert wrote:
> > From: Phil Connell <pconnell at gmail.com>
> > Sent: Thu, November 15, 2012 1:24:52 AM
> > 
> > On Wed, Nov 14, 2012 at 07:44:44PM -0800, Andrew Barnert wrote:
> > > 
> > >     upperlines =  (lines.upper() for line in file with open('foo', 'r') as 
> >file)
> > 
> > While this  looks very clean, how do you propose the following should be 
> >written
> > as a  generator expression?
> > 
> > def foo():
> >     with open('foo') as  f:
> >         for line in f:
> >              if 'bar' in line:
> >                  yield line
> 
> Exactly as you suggest (quoting you out of order to make the answer clearer):
> 
> > (line
> > for line in f
> > if bar in 'line'
> > with open('foo') as f)
> 
> 
> > An obvious suggestion is as follows, but I'm  not totally convinced about the
> > out-of-order with, for and if clauses  (compared with the equivalent 
> generator)
> 
> 
> The clauses have *always* been out of order. In the function, the "if" comes 
> between the "for" and the yield expression. In the expression, the "for" comes 
> in between. If the clause order implies the statement order (I would have put it 
> in terms of the clause structure implying the scoping, but they're effectively 
> the same idea), then our syntax has been wrong since list comprehensions were 
> added in 2.0. So, I think (and hope!) that implication was never intended.

I was mostly playing devil's advocate :)

In my experience, the ordering of comprehension clauses is already a source of
confusion for those new to the language. So, if it's not obvious where the "if"
should come it may well make matters worse in this regard (but I wouldn't say
that this is enough to kill the proposal).


> 
> Which means the only question is, which one looks more readable:
> 
> 1. (foo(line) for line in baz(f) if 'bar' in line with open('foo') as f)
> 2. (foo(line) for line in baz(f) with open('foo') as f if 'bar' in line)
> 3. (foo(line) with open('foo') as f for line in baz(f) if 'bar' in line)

To me, 1 feels like it captures the semantics the best - the "with" clause is
tacked onto the generator expression "(foo(line) ... for ... if)" and applies
to the whole of that expression.


Cheers,
Phil

> 
> Or, in the trivial case (where versions 1 and 2 are indistinguishable):
> 
> 1. (line for line in f with open('foo') as f)
> 2. (line for line in f with open('foo') as f)
> 3. (line with open('foo') as f for line in f)
> 
> My own intuition is that 1 is the clearest, and 3 by far the worst. So, that's 
> why I proposed order 1. But I'm not at all married to it.


From abarnert at yahoo.com  Thu Nov 15 12:37:32 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 15 Nov 2012 03:37:32 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <894D7633-596F-4E16-85D6-42562AADEB02@masklinn.net>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>
	<1352974107.39869.YahooMailRC@web184704.mail.ne1.yahoo.com>
	<894D7633-596F-4E16-85D6-42562AADEB02@masklinn.net>
Message-ID: <1352979452.86358.YahooMailRC@web184703.mail.ne1.yahoo.com>

> From: Masklinn <masklinn at masklinn.net>
> Sent: Thu, November 15, 2012 2:25:47 AM
> 
> On 2012-11-15, at 11:08 , Andrew Barnert wrote:
> > This is an  almost-unrelated side issue. A generator used in a single thread 

> >  defines a fully deterministic dynamic scope
> 
> I think you meant "a context  manager" not "a generator"

No, I meant a generator. "As long as the generator has values to generate, and 
has not been closed or destroyed" is a dynamic scope. "Until the end of this 
with statement block" is a static scope. The only reason the context managers in 
both your example and mine have dynamic scope is because they're embedded in 
generators.

> and my example
> quite clearly demonstrates that  the interaction between context managers
> and generators completely break  context managers as dynamic scopes.

No it doesn't. It demonstrates that it's possible to create indeterminate 
scopes, and context managers cannot help you if you do so. "Until the client 
exhausts the iterator, given that the client is not going to exhaust the 
iterator" effectively means "Until the client goes away". Which means you need a 
context manager around the client. The fact that you don't have one means that 
your client is inherently broken. You'll have the exact same problems with a 
trivial local object (e.g., its __del__ method won't get called by PyPy).

However, if the client *did* have a context manager (or exhausted, closed, or 
explicitly deleted the generator), a properly-written generator would clean 
itself up, while a naively-written one would not. That's what I meant by 
"properly-written". Not that it's guaranteed to clean up even when used by a 
broken client, because that is completely impossible for any object (generator 
or otherwise), but that it is guaranteed to clean up when used by a 
properly-written client.


From storchaka at gmail.com  Thu Nov 15 13:16:42 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Thu, 15 Nov 2012 14:16:42 +0200
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
Message-ID: <k82mfv$120$1@ger.gmane.org>

On 15.11.12 13:11, Andrew Barnert wrote:
> Which means the only question is, which one looks more readable:
>
> 1. (foo(line) for line in baz(f) if 'bar' in line with open('foo') as f)
> 2. (foo(line) for line in baz(f) with open('foo') as f if 'bar' in line)
> 3. (foo(line) with open('foo') as f for line in baz(f) if 'bar' in line)

What about such generator?

def gen():
     with a() as f:
         for x in f:
             if p(x):
                 with b(x) as g:
                     for y in g:
                         if q(y):
                             yield y




From masklinn at masklinn.net  Thu Nov 15 13:20:49 2012
From: masklinn at masklinn.net (Masklinn)
Date: Thu, 15 Nov 2012 13:20:49 +0100
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352979452.86358.YahooMailRC@web184703.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>
	<1352974107.39869.YahooMailRC@web184704.mail.ne1.yahoo.com>
	<894D7633-596F-4E16-85D6-42562AADEB02@masklinn.net>
	<1352979452.86358.YahooMailRC@web184703.mail.ne1.yahoo.com>
Message-ID: <2128CF08-7C13-48D4-AA6C-7508D749D25B@masklinn.net>

On 2012-11-15, at 12:37 , Andrew Barnert wrote:
>> From: Masklinn <masklinn at masklinn.net>
>> Sent: Thu, November 15, 2012 2:25:47 AM
>> 
>> On 2012-11-15, at 11:08 , Andrew Barnert wrote:
>>> This is an  almost-unrelated side issue. A generator used in a single thread 
> 
>>> defines a fully deterministic dynamic scope
>> 
>> I think you meant "a context  manager" not "a generator"
> 
> No, I meant a generator. "As long as the generator has values to generate, and 
> has not been closed or destroyed" is a dynamic scope.

It isn't a dynamic scope in the sense of "dynamic scoping" which is the
one I used it in, and the one usually understood when talking about
dynamic scopes, which is a function of the stack context in which the
code executes not the lifecycle of an object.

> "Until the end of this with statement block" is a static scope.

Not from the POV of callees within the stack of which the with block is
part, which again is the standard interpretation for "dynamic scopes".

>> and my example
>> quite clearly demonstrates that  the interaction between context managers
>> and generators completely break  context managers as dynamic scopes.
> 
> No it doesn't. It demonstrates that it's possible to create indeterminate 
> scopes

There is nothing indeterminate about the scopes in a classical and usual
sense, neither the dynamic scope nor the lexical scope. And languages with
proper dynamic scoping support have no issue with this kind of constructs.

Neither does Python when walking through the whole stack, naturally.

From ncoghlan at gmail.com  Thu Nov 15 13:39:40 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 15 Nov 2012 22:39:40 +1000
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
Message-ID: <CADiSq7dYXhZKbQgMuMr1XBMRPUhL5MZ3Nd0rYFazr1G9cGbB2A@mail.gmail.com>

On Thu, Nov 15, 2012 at 9:11 PM, Andrew Barnert <abarnert at yahoo.com> wrote:

> The clauses have *always* been out of order. In the function, the "if"
> comes
> between the "for" and the yield expression. In the expression, the "for"
> comes
> in between. If the clause order implies the statement order (I would have
> put it
> in terms of the clause structure implying the scoping, but they're
> effectively
> the same idea), then our syntax has been wrong since list comprehensions
> were
> added in 2.0. So, I think (and hope!) that implication was never intended.
>

One, and only one, clause in a comprehension or generator expression is
written out of sequence: the innermost clause is lifted out and written
first. The rest of the expression is just written in the normal statement
order with the colons and newlines omitted. This is why you can chain
comprehensions to arbitrary depths without any ambiguity from the
compiler's point of view:

>>> seq = [0, [0, [0, 1]]]
>>> [z for x in seq if x for y in x if y for z in y if z]
[1]

(Note: even though you *can* chain the clauses like this, please don't, as
it's thoroughly unreadable for humans, even though it makes sense to the
compiler)

So *if* a context management clause was added to comprehension syntax, it
couldn't reasonably be added using the same design as was used to determine
the placement of the current iteration and filtering clauses.

If the determination is "place it at the end, and affect the whole
comprehension/generator expression regardless of the number of clauses",
then you're now very close to the point of it making more sense to propose
allowing context managers on arbitrary expressions, as it would be
impossible to explain why this was allowed:

    lines = list(line for line in f with open(name) as f)

But this was illegal:

    lines = (f.readlines() with open(name) as f)

And if arbitrary subexpressions are allowed, *then* you're immediately
going to have people wanting a "bind" builtin:

    class bind:
        def __init__(self, value):
            self.value = value
        def __enter__(self):
            return self.value
        def __exit__(self, *args):
            pass

    if (m is None with bind(pattern.match(data) as m):
        raise ValueError("{} does not match {}".format(data, pattern))
    # Do things with m...

This is not an endorsement of the above concepts, just making it clear that
I don't believe that attempting to restrict this to generator expressions
is a viable proposal, as the restriction is far too arbitrary (from a user
perspective) to form part of a coherent language design.

Generator expressions, like lambda expressions, are deliberately limited.
If you want to avoid those limits, it's time to upgrade to a named
generator or function. If you feel that puts things in the wrong order in
your code then please, send me your use cases so I can considering adding
them as examples in PEP 403 and PEP 3150. If you really want to enhance the
capabilities of expressions, then the more general proposal is the only one
with even a remote chance, and that's contingent on proving that the
compiler can be set up to give generator expressions the semantics you
propose (Off the top of my head, I suspect it should be possible, since the
compiler will know it's in a comprehension or generator expression by the
time it hits the with token, but there may be other technical limitations
that ultimately rule it out).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121115/15bfd2c1/attachment.html>

From abarnert at yahoo.com  Thu Nov 15 15:25:57 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 15 Nov 2012 06:25:57 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <k82mfv$120$1@ger.gmane.org>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<k82mfv$120$1@ger.gmane.org>
Message-ID: <1352989557.54116.YahooMailRC@web184705.mail.ne1.yahoo.com>

From: Serhiy Storchaka <storchaka at gmail.com>
Sent: Thu, November 15, 2012 4:17:42 AM


> On 15.11.12 13:11, Andrew Barnert wrote:
> > Which means the only question  is, which one looks more readable:
> > 
> > 1. (foo(line) for line in  baz(f) if 'bar' in line with open('foo') as f)
> > 2. (foo(line) for line in  baz(f) with open('foo') as f if 'bar' in line)
> > 3. (foo(line) with  open('foo') as f for line in baz(f) if 'bar' in line)
> 
> What about such  generator?
> 
> def gen():
>     with a() as f:
>          for x in f:
>             if  p(x):
>                 with b(x) as  g:
>                     for  y in g:
>                          if q(y):
>                              yield  y


Mechanically transforming that is easy. You just insert each with along with its 
corresponding for and if. There are no ambiguities for any of the three 
potential rules:

1. (y for x in f if p(x) with a() as f for y in g if q(y) with b(x) as g)
2. (y for x in f with a() as f if p(x) for y in g with b(x) as g if q(y))
3. (y with a() as f for x in f if p(x) with b(x) as g for y in g if q(y))

I suppose you could also argue for a "super-1" where we stick all the withs at 
the end, or a "super-3" where we stick them all at the beginning? but it's hard 
to see a compelling argument for that. In fact, despite everything I said about 
clause structure not implying nesting, either one of those would look to me as 
if all the withs were at the outermost scope.

At any rate, unlike the simpler cases, here I have no opinion on which of those 
is clearest. They're all impossible to read at a glance (although breaking them 
up into multiple lines helps, I still don't have any clue what even the original 
function means?all those one-letter variables and functions, with 
easily-confused letters to boot?). But they're all quite easy to work out, or to 
construct, if you understand nested generator expressions and know the rule for 
where each clause goes.


From storchaka at gmail.com  Thu Nov 15 15:37:59 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Thu, 15 Nov 2012 16:37:59 +0200
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352989557.54116.YahooMailRC@web184705.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<k82mfv$120$1@ger.gmane.org>
	<1352989557.54116.YahooMailRC@web184705.mail.ne1.yahoo.com>
Message-ID: <k82uo9$bro$1@ger.gmane.org>

On 15.11.12 16:25, Andrew Barnert wrote:
> Mechanically transforming that is easy. You just insert each with along with its
> corresponding for and if. There are no ambiguities for any of the three
> potential rules:
>
> 1. (y for x in f if p(x) with a() as f for y in g if q(y) with b(x) as g)
> 2. (y for x in f with a() as f if p(x) for y in g with b(x) as g if q(y))
> 3. (y with a() as f for x in f if p(x) with b(x) as g for y in g if q(y))

And what about this (only one for/if for simplicity)?

     with a() as f:
         for x in f:
             with b() as g:
                 if p(x):
                     with c() as h:
                         yield x




From abarnert at yahoo.com  Thu Nov 15 16:25:14 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 15 Nov 2012 07:25:14 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <CADiSq7dYXhZKbQgMuMr1XBMRPUhL5MZ3Nd0rYFazr1G9cGbB2A@mail.gmail.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<CADiSq7dYXhZKbQgMuMr1XBMRPUhL5MZ3Nd0rYFazr1G9cGbB2A@mail.gmail.com>
Message-ID: <1352993114.99957.YahooMailRC@web184701.mail.ne1.yahoo.com>

From: Nick Coghlan <ncoghlan at gmail.com>
Sent: Thu, November 15, 2012 4:39:42 AM


>On Thu, Nov 15, 2012 at 9:11 PM, Andrew Barnert <abarnert at yahoo.com> wrote:


> One, and only one, clause in a comprehension or generator expression is written 
>
> out of sequence: the innermost clause is lifted out and written first.

Given that there are only three clauses, "flatten in order, then move expression 
to front" and "flatten in reverse order, then move if clause to back" are 
identical. I suppose you're right that, given that the rule for nested 
expressions is to preserve the order of nesting, the first description is more 
natural.

But at any rate, I don't think any such rule is what most Python programmers 
have internalized. People obviously know how to nest clauses in general (we 
couldn't speak human languages otherwise), but they do not know how to write, or 
even read, nested comprehensions. What they know is that there are three 
clauses, and they go expression-for-if, period. And those who do learn about 
nesting seem to guess the order wrong at least half the time (hence all the 
StackOverflow posts on "why does [x for x in range(y) for y in range(5)] give me 
a NameError?").

> So *if* a context management clause was added to comprehension syntax, it 
> couldn't reasonably be added using the same design as was used to determine the 
>
> placement of the current iteration and filtering clauses.

Sure it could. If you want to flatten in order, then lift the innermost 
expression to the left, exactly my "option 3". And any of the three options nest 
just as well as current generator expressions. This follows exactly the same 
rules you described:

(line with open(path, 'r') as file for line in file if line 
 with open('filelist', 'r') as filelistfile for path in filelistfile)

I personally find option 1 more readable than option 3, but as I said, that's 
just a bare intuition, and I'm not married to it at all.

> If the determination is "place it at the end, and affect the whole 
> comprehension/generator expression regardless of the number of clauses"

I didn't even consider that. For the nested case, each for clause can have 0 or 
1 with clauses, just as it can have 0 or 1 if clauses. I can't see how anything 
else is reasonable?how else could you handle the example above without keeping 
hundreds of files open unnecessarily? Of course for the non-nested case, there 
is no real distinction between "at the end" and "at the end of each nesting 
level"?

> you're now very close to the point of it making more sense to propose 
allowing 
> context managers on arbitrary expressions

No, not at all. If you read the blog post I linked, I explain this. But I'll try 
to answer your points below.

> as it would be impossible to explain why this was allowed:
> 
>     lines = list(line for line in f with open(name) as f)
>
> But this was illegal:
> 
>     lines = (f.readlines() with open(name) as f)

Those aren't at all the same. Or, if they are the same, the first one doesn't 
work.

In the second one, your with expression clearly modifies the expression 
f.readlines(), so the file is open until f.readlines() finishes. Great. 

But in the first, it clearly modifies f, so the file is open until f 
finishes?that is, it gets closed immediately. There may be other things you 
could attach it to, but I can't think of any way either a human or a computer 
could parse it as being attached to the iteration inside the implicit generator 
created by the larger expression that this expression is a part of. And you're 
going to have the exact same problem with, e.g., "lines = (f with open(name) as 
f)"?the only useful thing this could do is attach to something *inside the file 
object*, which is ridiculous. 

In fact, any statement that cannot be trivially rewritten as "with open(name) as 
f: lines = f.readlines()" also cannot possibly work right using a general with 
expression. Which means it's useless.

The only way you could possibly make this work is to make a context manager mean 
different things in different kinds of expressions. That's a horribly bad idea. 
It means you're building the with clause that I wanted, and a variety of other 
with clauses, all of which look similar enough to confuse both parsers and human 
beings, despite doing different things. It's like suggesting that we don't need 
if clauses in generator expressions because the more general ternary if 
expression already takes care of it.

So, in short, adding general with expressions not only doesn't solve my problem, 
it makes my problem harder to solve. And it's a bad idea in general, because it 
only works in cases where it's not needed. So, I'm very strongly against it.

> Generator expressions, like lambda expressions, are deliberately limited. If 
>you 
> want to avoid those limits, it's time to upgrade to a named generator or 
> function. If you feel that puts things in the wrong order in your code then 
> please, send me your use cases so I can considering adding them as examples in 

> PEP 403 and PEP 3150.

My use case was at the top of my first email:

    upperlines = (line.upper() for line in file with open(path, 'r') as file)

This is the kind of thing beginners need to write and don't know how, and 
experienced developers do all the time in quick-n-dirty scripts and don't do 
properly even if they do know how, because it's difficult to write today.

And I'm not sure how PEP 403 or PEP 3150 would help.

> If you really want to enhance the capabilities of 
> expressions, then the more general proposal is the only one with even a remote 

> chance

I don't want to do things like turn assignments into expressions, add general 
with expressions, add bind expressions, etc. I want to make iterating over a 
context manager easy, and the with clause is the best way I could come up with 
to do it.

> and that's contingent on proving that the compiler can be set up to give 
> generator expressions the semantics you propose (Off the top of my head, I  
> suspect it should be possible, since the compiler will know it's in a  
> comprehension or generator expression by the time it hits the with  token, but 

> there may be other technical limitations that ultimately rule  it out). 


I plan to experiment with implementing it in PyPy over the weekend, and if that 
works out I'll take a look at CPython. But I don't see any reason that options 1 
or 2 should be any problem; option 3 might be, but I'll see when I get there.


From abarnert at yahoo.com  Thu Nov 15 16:32:06 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 15 Nov 2012 07:32:06 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <k82uo9$bro$1@ger.gmane.org>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<k82mfv$120$1@ger.gmane.org>
	<1352989557.54116.YahooMailRC@web184705.mail.ne1.yahoo.com>
	<k82uo9$bro$1@ger.gmane.org>
Message-ID: <1352993526.23110.YahooMailRC@web184701.mail.ne1.yahoo.com>

> From: Serhiy Storchaka <storchaka at gmail.com>
> Sent: Thu, November 15, 2012 6:42:35 AM
> 
> And what about this (only one for/if for  simplicity)?
> 
>     with a() as f:
>          for x in f:
>             with b() as  g:
>                 if p(x):
>                      with c() as  h:
>                          yield  x


If you can only have one with per for, this doesn't have a direct translation.

However, if you want to extend it to have any number of withs per for, that 
seems to rule out option 2, and maybe option 1, but seems fine with option 3:

(x with a() as f for x in f with b() as g if p(x) with c() as h)

The fact that option 3 can obviously do something which seems impossible in 
option 2, and which I can't work out in a few seconds off the top of my head 
with option 1, may be a more compelling argument than the fact that option 1 
instinctively looked cleaner to me (and the one other person who commented on 
the three choices).



From abarnert at yahoo.com  Thu Nov 15 16:36:34 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 15 Nov 2012 07:36:34 -0800 (PST)
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
	attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
Message-ID: <1352993794.20158.YahooMailRC@web184703.mail.ne1.yahoo.com>

From: Mike Meyer <mwm at mired.org>
Sent: Thu, November 15, 2012 2:29:44 AM

>If the goal is to make os.walk fast, then it might be better (on Posix systems, 

>anyway) to see if it can be built on top of ftw instead of low-level directory 
>scanning routines.


You can't actually use ftw, because it doesn't give any way to handle the 
options to os.walk. Plus, it's "obsolescent" (POSIX2008), "deprecated" (linux), 
or "legacy" (OS X), and at least some platforms will throw warnings when you 
call it. It's also somewhat underspecified, and different platforms, even 
different versions of the same platform, will give you different behavior in 
various cases (especially with symlinks).

But you could, e.g., use fts on platforms that have it, nftw on platforms that 
have a version close enough to recent linux/glibc for our purposes, and fall 
back to readdir+stat for the rest. That could give a big speed improvement on 
the most popular platforms, and on the others, at least things would be no worse 
than today (and anyone who cared could much more easily write the appropriate 
nftw/fts/whatever port for their platform once the framework was in place).



From tjreedy at udel.edu  Thu Nov 15 18:54:49 2012
From: tjreedy at udel.edu (Terry Reedy)
Date: Thu, 15 Nov 2012 12:54:49 -0500
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
Message-ID: <k83a99$r6k$1@ger.gmane.org>

On 11/15/2012 6:11 AM, Andrew Barnert wrote:
>> From: Phil Connell <pconnell at gmail.com>

>> While this  looks very clean, how do you propose the following should be
>> written
>> as a  generator expression?
>>
>> def foo():
>>      with open('foo') as  f:
>>          for line in f:
>>               if 'bar' in line:
>>                   yield line
>
> Exactly as you suggest (quoting you out of order to make the answer clearer):
>
>> (line
>> for line in f
>> if bar in 'line'
>> with open('foo') as f)

The simple rule for comprehensions is that the append (l.c.) or yield 
(g.e) is moved from last to first and the other statements/clauses are 
left in the same order.

> Which means the only question is, which one looks more readable:
>
> 1. (foo(line) for line in baz(f) if 'bar' in line with open('foo') as f)
> 2. (foo(line) for line in baz(f) with open('foo') as f if 'bar' in line)
> 3. (foo(line) with open('foo') as f for line in baz(f) if 'bar' in line)

Which means that 3 is the proper one. In particular, if with clauses 
were added, f must be defined in the with clause before used in the for 
clause, just as line must be defined in the for clause before used in 
the if clause.

-- 
Terry Jan Reedy



From tjreedy at udel.edu  Thu Nov 15 18:56:47 2012
From: tjreedy at udel.edu (Terry Reedy)
Date: Thu, 15 Nov 2012 12:56:47 -0500
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <k82mfv$120$1@ger.gmane.org>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<k82mfv$120$1@ger.gmane.org>
Message-ID: <k83acu$r6k$2@ger.gmane.org>

On 11/15/2012 7:16 AM, Serhiy Storchaka wrote:
> On 15.11.12 13:11, Andrew Barnert wrote:
>> Which means the only question is, which one looks more readable:
>>
>> 1. (foo(line) for line in baz(f) if 'bar' in line with open('foo') as f)
>> 2. (foo(line) for line in baz(f) with open('foo') as f if 'bar' in line)
>> 3. (foo(line) with open('foo') as f for line in baz(f) if 'bar' in line)
>
> What about such generator?
>
> def gen():
>      with a() as f:
>          for x in f:
>              if p(x):
>                  with b(x) as g:
>                      for y in g:
>                          if q(y):
>                              yield y

The yield expression is moved from last to first and the rest are left 
in order, as now.

-- 
Terry Jan Reedy



From storchaka at gmail.com  Thu Nov 15 19:15:30 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Thu, 15 Nov 2012 20:15:30 +0200
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352993526.23110.YahooMailRC@web184701.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<k82mfv$120$1@ger.gmane.org>
	<1352989557.54116.YahooMailRC@web184705.mail.ne1.yahoo.com>
	<k82uo9$bro$1@ger.gmane.org>
	<1352993526.23110.YahooMailRC@web184701.mail.ne1.yahoo.com>
Message-ID: <k83bg3$6pg$1@ger.gmane.org>

On 15.11.12 17:32, Andrew Barnert wrote:
> If you can only have one with per for, this doesn't have a direct translation.

Even with one "with" per "for" an ambiguity is possible for some 
options. with/for/if/yield, for/with/if/yield, for/if/with/yield, 
for/if/with/if/yield,... should have different transcription.

> The fact that option 3 can obviously do something which seems impossible in
> option 2, and which I can't work out in a few seconds off the top of my head
> with option 1, may be a more compelling argument than the fact that option 1
> instinctively looked cleaner to me (and the one other person who commented on
> the three choices).

Yes, that is what I wanted to show. Even if for you, the author of the 
proposal, the most consistent option is the least obvious, then for 
others it will always lead to confusion.



From tjreedy at udel.edu  Thu Nov 15 19:36:45 2012
From: tjreedy at udel.edu (Terry Reedy)
Date: Thu, 15 Nov 2012 13:36:45 -0500
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352993114.99957.YahooMailRC@web184701.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<CADiSq7dYXhZKbQgMuMr1XBMRPUhL5MZ3Nd0rYFazr1G9cGbB2A@mail.gmail.com>
	<1352993114.99957.YahooMailRC@web184701.mail.ne1.yahoo.com>
Message-ID: <k83cnt$iau$1@ger.gmane.org>

On 11/15/2012 10:25 AM, Andrew Barnert wrote:
> From: Nick Coghlan <ncoghlan at gmail.com>
> Sent: Thu, November 15, 2012 4:39:42 AM
>
>
>> On Thu, Nov 15, 2012 at 9:11 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
>
>
>> One, and only one, clause in a comprehension or generator expression is written
>> out of sequence: the innermost clause is lifted out and written first.

This is how list comps were designed and initially defined.

> Given that there are only three clauses,
 > "flatten in order, then move expression to front"

This is the simple and correct rule.

> and "flatten in reverse order, then move if clause to back"

This is more complicated and wrong.

> are identical.

Until one adds more clauses.

> I suppose you're right that, given that the rule for nested
> expressions is to preserve the order of nesting, the first description is more
> natural.
>
> But at any rate, I don't think any such rule is what most Python programmers
> have internalized.

It *is* the rule, and a very simple one. The reference manual gives it, 
though it could perhaps be clearer. The tutorial List Comprehension 
section does give a clear example:
'''
For example, this listcomp combines the elements of two lists if they 
are not equal:

 >>> [(x, y) for x in [1,2,3] for y in [3,1,4] if x != y]
[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]
and it?s equivalent to:

 >>> combs = []
 >>> for x in [1,2,3]:
...     for y in [3,1,4]:
...         if x != y:
...             combs.append((x, y))
...
 >>> combs
[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]
Note how the order of the for and if statements is the same in both 
these snippets.
'''

> People obviously know how to nest clauses in general (we
> couldn't speak human languages otherwise), but they do not know how to write, or
> even read, nested comprehensions. What they know is that there are three
> clauses, and they go expression-for-if, period. And those who do learn about
> nesting seem to guess the order wrong at least half the time (hence all the
> StackOverflow posts on "why does [x for x in range(y) for y in range(5)] give me
> a NameError?").

Anyone who read and understood that snippet in the tutorial, which took 
me a minute to find, would not ask such a question. There are people who 
program Python without ever reading the manuals and guess as they go 
and, when they stumble, prefer to post questions on forums and wait for 
a customized answer rather than dig it out themselves.

-- 
Terry Jan Reedy




From benhoyt at gmail.com  Thu Nov 15 21:06:31 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 16 Nov 2012 09:06:31 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
Message-ID: <CAL9jXCHFrtey3onf4dwUtkR1AmXeRwfy8OdK+NHjrAFehxqxWg@mail.gmail.com>

>> """Yield tuples of (filename, stat_result) for each filename in
>> directory given by "path". Like listdir(), '.' and '..' are skipped.
>> The values are yielded in system-dependent order.
>>
>> Each stat_result is an object like you'd get by calling os.stat() on
>> that file, but not all information is present on all systems, and st_*
>> fields that are not available will be None.
>>
>> In practice, stat_result is a full os.stat() on Windows, but only the
>> "is type" bits of the st_mode field are available on Linux/OS X/BSD.
>> """
>
> There's a code smell here, in that the doc for Unix variants is incomplete
> and wrong. Whether or not you get the d_type values depends on the OS having
> that extension. Further, there's a d_type value (DT_UNKNOWN) that isn't a
> valid value for the S_IFMT bits in st_mode (at least on BSD).

Not sure I understand why the docstring above is incomplete/wrong. I
say "Not all information is present on all systems" and "In practice
... only the 'is type' bits of the st_mode field are available on
Linux/OS X/BSD". All three of those systems provide d_type, so I think
that's correct.

-Ben


From greg.ewing at canterbury.ac.nz  Thu Nov 15 22:17:58 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Fri, 16 Nov 2012 10:17:58 +1300
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
Message-ID: <50A55C06.3060802@canterbury.ac.nz>

Andrew Barnert wrote:
> 1. (foo(line) for line in baz(f) if 'bar' in line with open('foo') as f)
> 2. (foo(line) for line in baz(f) with open('foo') as f if 'bar' in line)
> 3. (foo(line) with open('foo') as f for line in baz(f) if 'bar' in line)

Order 3 is the most consistent with existing features. The
only out-of-order thing about comprehensions currently is that
the result expression comes first instead of last. Everything
else is in the same order as the statement expansion.

-- 
Greg


From greg.ewing at canterbury.ac.nz  Thu Nov 15 22:30:32 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Fri, 16 Nov 2012 10:30:32 +1300
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352979452.86358.YahooMailRC@web184703.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>
	<1352974107.39869.YahooMailRC@web184704.mail.ne1.yahoo.com>
	<894D7633-596F-4E16-85D6-42562AADEB02@masklinn.net>
	<1352979452.86358.YahooMailRC@web184703.mail.ne1.yahoo.com>
Message-ID: <50A55EF8.2020909@canterbury.ac.nz>

Andrew Barnert wrote:
>"As long as the generator has values to generate, and 
> has not been closed or destroyed" is a dynamic scope.

It would be less confusing to call this a "dynamic lifetime".
The term "dynamic scoping" already exists and means something
different.

-- 
Greg




From greg.ewing at canterbury.ac.nz  Thu Nov 15 22:42:12 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Fri, 16 Nov 2012 10:42:12 +1300
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <20121115112251.GA13472@phconnel-ws.cisco.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<20121115112251.GA13472@phconnel-ws.cisco.com>
Message-ID: <50A561B4.2010408@canterbury.ac.nz>

Phil Connell wrote:
> To me, 1 feels like it captures the semantics the best - the "with" clause is
> tacked onto the generator expression "(foo(line) ... for ... if)" and applies
> to the whole of that expression.

But that would break the existing rule that binding clauses
in a comprehension have to precede the expressions that use
the bound variable. If you're allowed to write

   (foo(line) for line in baz(f) with open('foo') as f)

then it's not obvious why you can't write

   (foo(line) if 'bar' in line for line in lines)

Are you suggesting that the latter should be allowed as
well?

-- 
Greg


From random832 at fastmail.us  Thu Nov 15 22:43:08 2012
From: random832 at fastmail.us (random832 at fastmail.us)
Date: Thu, 15 Nov 2012 16:43:08 -0500
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <50A43665.4010406@pearwood.info>
References: <509C2E9D.3080707@python.org> <509CBC78.4040602@egenix.com>
	<509D2EF0.8010209@python.org>
	<CAF-Rda8CbrGnQw7qhr_jWfak7Jt6ATER2pNrLCEnx4y0Lv-Zug@mail.gmail.com>
	<20121114165732.69dcd274@resist.wooz.org>
	<50A42FD5.6050405@fastmail.us> <50A43665.4010406@pearwood.info>
Message-ID: <1353015788.22979.140661154200249.5F8D101F@webmail.messagingengine.com>

On Wed, Nov 14, 2012, at 19:25, Steven D'Aprano wrote:
> Shebang lines aren't interpreted by Python, but by the shell.
> 
> To be precise, it isn't the shell either, but the program loader, I
> think.
> But whatever it is, it isn't Python.

That's obviously untrue - the shell or the kernel or whatever piece it
is doesn't know what an -E or a -s does, it simply passes them to
python. Now, as the error messages show, it passes them as a single
string rather than (as you would ordinarily expect) as two strings, but
it's all _there_ for python to see, even without trying to read it from
the file (which it also could do).


From random832 at fastmail.us  Thu Nov 15 22:45:22 2012
From: random832 at fastmail.us (random832 at fastmail.us)
Date: Thu, 15 Nov 2012 16:45:22 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCFaT6tbT49yH5tRDbW09vyj4dPSO9c5q71aeFghhj9+ZA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CAL9jXCFaT6tbT49yH5tRDbW09vyj4dPSO9c5q71aeFghhj9+ZA@mail.gmail.com>
Message-ID: <1353015922.23812.140661154202993.035FA1E6@webmail.messagingengine.com>

On Wed, Nov 14, 2012, at 20:37, Ben Hoyt wrote:
> Not currently. I haven't thought about this too hard -- there way be a
> bit that's always set/not set within st_mode itself. Otherwise I'd
> have to add a full_st_mode or similar property

Why not just add d_type?


From benhoyt at gmail.com  Thu Nov 15 22:50:32 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 16 Nov 2012 10:50:32 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <1353015922.23812.140661154202993.035FA1E6@webmail.messagingengine.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CAL9jXCFaT6tbT49yH5tRDbW09vyj4dPSO9c5q71aeFghhj9+ZA@mail.gmail.com>
	<1353015922.23812.140661154202993.035FA1E6@webmail.messagingengine.com>
Message-ID: <CAL9jXCGOWZVCu5G6V34ER_GYY5wv1dPw30L=PMuTHC-=s_kHAw@mail.gmail.com>

>> Not currently. I haven't thought about this too hard -- there way be a
>> bit that's always set/not set within st_mode itself. Otherwise I'd
>> have to add a full_st_mode or similar property
>
> Why not just add d_type?

It's not a bad idea, but neither am I super-keen on it, because of
these two reasons:

1) You've have to add a whole new way / set of constants / functions
to test for the different values of d_type. Whereas there's already
stuff (stat module) to test for st_mode values.

2) It'd make the typical use case more complex, for example, the
straight "if st.st_mode is None ... else ..." I gave earlier becomes
this:

for filename, st in iterdir_stat(path):
     if st.d_type is None:
          if st.st_mode is None:
               st = os.stat(os.path.join(path, filename))
          is_dir = stat.S_ISDIR(st.st_mode)
     else:
          is_dir = st.d_type == DT_DIR

-Ben


From grosser.meister.morti at gmx.net  Thu Nov 15 23:45:31 2012
From: grosser.meister.morti at gmx.net (=?UTF-8?B?TWF0aGlhcyBQYW56ZW5iw7Zjaw==?=)
Date: Thu, 15 Nov 2012 23:45:31 +0100
Subject: [Python-ideas] Support data: URLs in urllib
In-Reply-To: <CAPOVWOQ6KYBOCWQkRhrzpYBUhBBfVeZAkAYFg0LqTghJOx4rQg@mail.gmail.com>
References: <5090B0FC.1030801@gmx.net>
	<CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
	<50945B9D.8010002@gmx.net>
	<CACac1F_P4L7b26fu1sh7hz0QMLKRP-vpLAx45MGBOgd9JNOoow@mail.gmail.com>
	<5095CAC2.6010309@gmx.net>
	<CACac1F8AnEsairyxf8YKYxMERan+C04rGRaik_OxAdpEBz6wfg@mail.gmail.com>
	<5099D96A.2090602@gmx.net>
	<CAPOVWOTNTZ9Enxc_PbetpgNu6v9iS3xf9G92Jgu4tzOuH5BpjA@mail.gmail.com>
	<509A9945.6040609@gmx.net>
	<CAPOVWOQ6KYBOCWQkRhrzpYBUhBBfVeZAkAYFg0LqTghJOx4rQg@mail.gmail.com>
Message-ID: <50A5708B.1020507@gmx.net>

On 11/08/2012 07:11 AM, Senthil Kumaran wrote:
> On Wed, Nov 7, 2012 at 9:24 AM, Mathias Panzenb?ck
> <grosser.meister.morti at gmx.net> wrote:
>> Sorry, I don't quite understand.
>> Do you mean the parse_data_url function should be removed and put into
>> DataResponse (or DataHandler)?
>>
>>
>>> and expected results be returned should be considered?
>>
>>
>> What expected results? And in what way should they be considered? Considered
>> for what?
>
> I meant, urlopen("data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
> AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
> 9TXL0Y4OHwAAAABJRU5ErkJggg==")
>
> should work out of box, wherein the DataHandler example that is the
> documentation is made available in request.py and added to
> OpenerDirector by default. I find it hard to gauge the utility, but
> documentation is ofcourse a +1.
>
> Thanks,
> Senthil
>

Yes, I would also be in favor to including this in python, but I was told here I should write it as 
recipe in the documentation.

It is e.g. useful for crawlers/spiders, that analyze webpages including their images.


From p.f.moore at gmail.com  Thu Nov 15 23:48:27 2012
From: p.f.moore at gmail.com (Paul Moore)
Date: Thu, 15 Nov 2012 22:48:27 +0000
Subject: [Python-ideas] Support data: URLs in urllib
In-Reply-To: <50A5708B.1020507@gmx.net>
References: <5090B0FC.1030801@gmx.net>
	<CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
	<50945B9D.8010002@gmx.net>
	<CACac1F_P4L7b26fu1sh7hz0QMLKRP-vpLAx45MGBOgd9JNOoow@mail.gmail.com>
	<5095CAC2.6010309@gmx.net>
	<CACac1F8AnEsairyxf8YKYxMERan+C04rGRaik_OxAdpEBz6wfg@mail.gmail.com>
	<5099D96A.2090602@gmx.net>
	<CAPOVWOTNTZ9Enxc_PbetpgNu6v9iS3xf9G92Jgu4tzOuH5BpjA@mail.gmail.com>
	<509A9945.6040609@gmx.net>
	<CAPOVWOQ6KYBOCWQkRhrzpYBUhBBfVeZAkAYFg0LqTghJOx4rQg@mail.gmail.com>
	<50A5708B.1020507@gmx.net>
Message-ID: <CACac1F8m3pKRS5bYr85oA=WwzhuP4OjhQUcYiA4jp8RuJWj8Dw@mail.gmail.com>

On 15 November 2012 22:45, Mathias Panzenb?ck <grosser.meister.morti at gmx.net
> wrote:

> Yes, I would also be in favor to including this in python, but I was told
> here I should write it as recipe in the documentation.
>
> It is e.g. useful for crawlers/spiders, that analyze webpages including
> their images.
>

It would be good in the stdlib. By all means submit a patch for adding it.
Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121115/53c54d37/attachment.html>

From mwm at mired.org  Fri Nov 16 00:18:10 2012
From: mwm at mired.org (Mike Meyer)
Date: Thu, 15 Nov 2012 17:18:10 -0600
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <1353015788.22979.140661154200249.5F8D101F@webmail.messagingengine.com>
References: <509C2E9D.3080707@python.org> <509CBC78.4040602@egenix.com>
	<509D2EF0.8010209@python.org>
	<CAF-Rda8CbrGnQw7qhr_jWfak7Jt6ATER2pNrLCEnx4y0Lv-Zug@mail.gmail.com>
	<20121114165732.69dcd274@resist.wooz.org>
	<50A42FD5.6050405@fastmail.us> <50A43665.4010406@pearwood.info>
	<1353015788.22979.140661154200249.5F8D101F@webmail.messagingengine.com>
Message-ID: <CAD=7U2BbudYcuJpmnY7PAw8cSOj7Y-SvFCEATDhvNOmc3_wkbQ@mail.gmail.com>

On Nov 15, 2012 3:43 PM, <random832 at fastmail.us> wrote:
>
> On Wed, Nov 14, 2012, at 19:25, Steven D'Aprano wrote:
> > Shebang lines aren't interpreted by Python, but by the shell.
> >
> > To be precise, it isn't the shell either, but the program loader, I
> > think.
> > But whatever it is, it isn't Python.
>
> That's obviously untrue - the shell or the kernel or whatever piece it
> is doesn't know what an -E or a -s does, it simply passes them to
> python. Now, as the error messages show, it passes them as a single
> string rather than (as you would ordinarily expect) as two strings, but
> it's all _there_ for python to see, even without trying to read it from
> the file (which it also could do).

It's obviously true. The kernel (or shell, as the case may be) interprets
the shebang line to find the executable an pick out the arguments to pass
to the executable. The executable (Python) then interprets the arguments,
without ever having seen the shebang line.

While python could in theory start reading and interpreting the shebang
line, I don't think there's a sane way to decide when to do so since you
can set the arguments on the command line by invoking scripts explicitly.

      <mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121115/cb1d9196/attachment.html>

From mwm at mired.org  Fri Nov 16 00:47:06 2012
From: mwm at mired.org (Mike Meyer)
Date: Thu, 15 Nov 2012 17:47:06 -0600
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCHFrtey3onf4dwUtkR1AmXeRwfy8OdK+NHjrAFehxqxWg@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
	<CAL9jXCHFrtey3onf4dwUtkR1AmXeRwfy8OdK+NHjrAFehxqxWg@mail.gmail.com>
Message-ID: <CAD=7U2A08BEPThUs8jmJDkyDOeLOybN-1BBVjgp+cSoMOzf+Ng@mail.gmail.com>

On Nov 15, 2012 2:06 PM, "Ben Hoyt" <benhoyt at gmail.com> wrote:
>
> >> """Yield tuples of (filename, stat_result) for each filename in
> >> directory given by "path". Like listdir(), '.' and '..' are skipped.
> >> The values are yielded in system-dependent order.
> >>
> >> Each stat_result is an object like you'd get by calling os.stat() on
> >> that file, but not all information is present on all systems, and st_*
> >> fields that are not available will be None.
> >>
> >> In practice, stat_result is a full os.stat() on Windows, but only the
> >> "is type" bits of the st_mode field are available on Linux/OS X/BSD.
> >> """
> >
> > There's a code smell here, in that the doc for Unix variants is
incomplete
> > and wrong. Whether or not you get the d_type values depends on the OS
having
> > that extension. Further, there's a d_type value (DT_UNKNOWN) that isn't
a
> > valid value for the S_IFMT bits in st_mode (at least on BSD).
>
> Not sure I understand why the docstring above is incomplete/wrong.

It's incomplete because it doesn't say  what happens on other Posix
systems. It's wrong because it implies that the type bits of st_mode are
always available, when that's not the case.

Better would be 'on Posix systems, if st_mode is not None only the type
bits are valid.' Assuming that the underlying code translates DT_UNKNOWN to
binding st_mode to None.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121115/502401fc/attachment.html>

From ncoghlan at gmail.com  Fri Nov 16 01:46:28 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 16 Nov 2012 10:46:28 +1000
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352993114.99957.YahooMailRC@web184701.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<CADiSq7dYXhZKbQgMuMr1XBMRPUhL5MZ3Nd0rYFazr1G9cGbB2A@mail.gmail.com>
	<1352993114.99957.YahooMailRC@web184701.mail.ne1.yahoo.com>
Message-ID: <CADiSq7dKE_7Z6s2dcRG5XHjdTvxELLYJyB8=VeRATuOvqv2o9w@mail.gmail.com>

On Fri, Nov 16, 2012 at 1:25 AM, Andrew Barnert <abarnert at yahoo.com> wrote:

> My use case was at the top of my first email:
>
>     upperlines = (line.upper() for line in file with open(path, 'r') as
> file)
>

And it's *awful*. If the only thing on the RHS of a simple assignment
statement is a lambda or generator expression, that code should almost
always be rewritten with def as a matter of style, regardless of other
considerations.

However, I realised there's a more serious problem with your idea: the
outermost clause in a list comprehension or generator expression is
evaluated immediately and passed as an argument to the inner scope that
implements the loop, so you have an unresolved sequencing problem between
the evaluation of that argument and the evaluation of the context manager.
If you want the context manager inside the generator, you *can't* reference
the name bound in the as clause in the outermost iterable.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121116/55497f1a/attachment.html>

From jimjjewett at gmail.com  Fri Nov 16 02:11:25 2012
From: jimjjewett at gmail.com (Jim Jewett)
Date: Thu, 15 Nov 2012 20:11:25 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2A08BEPThUs8jmJDkyDOeLOybN-1BBVjgp+cSoMOzf+Ng@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
	<CAL9jXCHFrtey3onf4dwUtkR1AmXeRwfy8OdK+NHjrAFehxqxWg@mail.gmail.com>
	<CAD=7U2A08BEPThUs8jmJDkyDOeLOybN-1BBVjgp+cSoMOzf+Ng@mail.gmail.com>
Message-ID: <CA+OGgf7JB9txVoRwOFeKvzS-P1T6De4OWQMuqO3CgUn_0y-+iw@mail.gmail.com>

On 11/15/12, Mike Meyer <mwm at mired.org> wrote:
> On Nov 15, 2012 2:06 PM, "Ben Hoyt" <benhoyt at gmail.com> wrote:
>>
>> >> """Yield tuples of (filename, stat_result) for each filename in
>> >> directory given by "path". Like listdir(), '.' and '..' are skipped.
>> >> The values are yielded in system-dependent order.

>> >> Each stat_result is an object like you'd get by calling os.stat() on
>> >> that file, but not all information is present on all systems, and st_*
>> >> fields that are not available will be None.

>> >> In practice, stat_result is a full os.stat() on Windows, but only the
>> >> "is type" bits of the st_mode field are available on Linux/OS X/BSD.
>> >> """

> Better would be 'on Posix systems, if st_mode is not None only the type
> bits are valid.' Assuming that the underlying code translates DT_UNKNOWN to
> binding st_mode to None.

The specification allows other fields as well; is it really the case
that *no* filesystem supports them?

Perhaps:

"""Yield tuples of (filename, stat_result) for each file in the "path"
directory, excluding the '.' and '..' entries.

The order of results is arbitrary, and the effect of modifying a
directory after generator creation is filesystem-dependent.

Each stat_result is similar to the result of os.stat(filename), except
that only the directory entry itself is examined; any attribute which
would require a second system call (even os.stat) is set to None.

In practice, Windows will typically fill in all attributes; other
systems are most likely to fill in only the "is type" bits, or even
nothing at all.
"""

-jJ


From benhoyt at gmail.com  Fri Nov 16 02:15:33 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 16 Nov 2012 14:15:33 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2A08BEPThUs8jmJDkyDOeLOybN-1BBVjgp+cSoMOzf+Ng@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
	<CAL9jXCHFrtey3onf4dwUtkR1AmXeRwfy8OdK+NHjrAFehxqxWg@mail.gmail.com>
	<CAD=7U2A08BEPThUs8jmJDkyDOeLOybN-1BBVjgp+cSoMOzf+Ng@mail.gmail.com>
Message-ID: <CAL9jXCH7L7gggmzYTaunp6hfYsLzT5wq2=k1Z0-04DYzdtAewg@mail.gmail.com>

> Better would be 'on Posix systems, if st_mode is not None only the type bits
> are valid.' Assuming that the underlying code translates DT_UNKNOWN to
> binding st_mode to None.

Yep, fair enough -- I'll update it.

-Ben


From grosser.meister.morti at gmx.net  Fri Nov 16 04:51:37 2012
From: grosser.meister.morti at gmx.net (=?ISO-8859-1?Q?Mathias_Panzenb=F6ck?=)
Date: Fri, 16 Nov 2012 04:51:37 +0100
Subject: [Python-ideas] Support data: URLs in urllib
In-Reply-To: <CACac1F8m3pKRS5bYr85oA=WwzhuP4OjhQUcYiA4jp8RuJWj8Dw@mail.gmail.com>
References: <5090B0FC.1030801@gmx.net>
	<CACac1F-j74ZbAwCq38KhkVB3iZCNC1aQM0wefcAYKm+1CNeppA@mail.gmail.com>
	<50945B9D.8010002@gmx.net>
	<CACac1F_P4L7b26fu1sh7hz0QMLKRP-vpLAx45MGBOgd9JNOoow@mail.gmail.com>
	<5095CAC2.6010309@gmx.net>
	<CACac1F8AnEsairyxf8YKYxMERan+C04rGRaik_OxAdpEBz6wfg@mail.gmail.com>
	<5099D96A.2090602@gmx.net>
	<CAPOVWOTNTZ9Enxc_PbetpgNu6v9iS3xf9G92Jgu4tzOuH5BpjA@mail.gmail.com>
	<509A9945.6040609@gmx.net>
	<CAPOVWOQ6KYBOCWQkRhrzpYBUhBBfVeZAkAYFg0LqTghJOx4rQg@mail.gmail.com>
	<50A5708B.1020507@gmx.net>
	<CACac1F8m3pKRS5bYr85oA=WwzhuP4OjhQUcYiA4jp8RuJWj8Dw@mail.gmail.com>
Message-ID: <50A5B849.40104@gmx.net>

On 11/15/2012 11:48 PM, Paul Moore wrote:
> On 15 November 2012 22:45, Mathias Panzenb?ck <grosser.meister.morti at gmx.net
> <mailto:grosser.meister.morti at gmx.net>> wrote:
>
>     Yes, I would also be in favor to including this in python, but I was told here I should write it
>     as recipe in the documentation.
>
>     It is e.g. useful for crawlers/spiders, that analyze webpages including their images.
>
>
> It would be good in the stdlib. By all means submit a patch for adding it.
> Paul

Ok, I added a patch that adds this to the stdlib to this issue:
http://bugs.python.org/issue16423

I changed my code so it is more aligned with the existing code in urllib.request.


	-panzi


From grosser.meister.morti at gmx.net  Fri Nov 16 05:12:06 2012
From: grosser.meister.morti at gmx.net (=?windows-1252?Q?Mathias_Panzenb=F6ck?=)
Date: Fri, 16 Nov 2012 05:12:06 +0100
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>
Message-ID: <50A5BD16.6040405@gmx.net>

I think this syntax would still make sense for list comprehensions:

upperlines = [lines.upper() for line in file with open('foo', 'r') as file]

On 11/15/2012 10:29 AM, Masklinn wrote:
>
> On 2012-11-15, at 04:44 , Andrew Barnert wrote:
>
>> First, I realize that people regularly propose with expressions. This is not the
>> same thing.
>>
>> The problem with the with statement is not that it can't be postfixed
>> perl-style, or used in expressions. The problem is that it can't be used with
>> generator expressions.
>>
>> Here's the suggestion:
>>
>>     upperlines = (lines.upper() for line in file with open('foo', 'r') as file)
>>
>> This would be equivalent to:
>>
>>     def foo():
>>         with open('foo', 'r') as file:
>>             for line in file:
>>                 yield line.upper()
>>     upperlines = foo()
>>
>> The motivation is that there is no way to write this properly using a with
>> statement and a generator expression?in fact, the only way to get this right is
>> with the generator function above.
>
> Actually, it's extremely debatable that the generator function is
> correct: if the generator is not fully consumed (terminating iteration
> on the file) I'm pretty sure the file will *not* get closed save by the
> GC doing a pass on all dead objects maybe. This means this function is
> *not safe* as a lazy source to an arbitrary client, as that client may
> very well use itertools.slice or itertools.takewhile and only partially
> consume the generator.
>
> Here's an example:
>
> --
> import itertools
>
> class Manager(object):
>      def __enter__(self):
>          return self
>
>      def __exit__(self, *args):
>          print("Exited")
>
>      def __iter__(self):
>          for i in range(5):
>              yield i
>
> def foo():
>      with Manager() as ms:
>          for m in ms:
>              yield m
>
> def bar():
>      print("1")
>      f = foo()
>      print("2")
>      # Only consume part of the iterable
>      list(itertools.islice(f, None, 2))
>      print("3")
>
> bar()
> print("4")
> --
>
> CPython output, I'm impressed that the refcounting GC actually bothers
> unwinding the stack and running the __exit__ handler *once bar has
> finished executing*:
>
>> python3 withgen.py
> 1
> 2
> 3
> Exited
> 4
>
> But here's the (just as correct, as far as I can tell) output from pypy:
>
>> pypy-c withgen.py
> 1
> 2
> 3
> 4
>
> If the program was long running, it is possible that pypy would run
> __exit__ when the containing generator is released (though by no means
> certain, I don't know if this is specified at all).
>
> This is in fact one of the huge issues with faking dynamic scopes via
> threadlocals and context managers (as e.g. Flask might do, I'm not sure
> what actual strategy it uses), they interact rather weirdly with
> generators (it's also why I think Python should support actually
> dynamically scoped variables, it would also fix the thread-broken
> behavior of e.g. warnings.catch_warnings)
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>



From guido at python.org  Fri Nov 16 05:27:49 2012
From: guido at python.org (Guido van Rossum)
Date: Thu, 15 Nov 2012 20:27:49 -0800
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <50A5BD16.6040405@gmx.net>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>
	<50A5BD16.6040405@gmx.net>
Message-ID: <CAP7+vJL8=qs80QATqNqKSAB7crc_ouugrB_3tNdtVWap67ptNw@mail.gmail.com>

On Thu, Nov 15, 2012 at 8:12 PM, Mathias Panzenb?ck
<grosser.meister.morti at gmx.net> wrote:
> I think this syntax would still make sense for list comprehensions:
>
> upperlines = [lines.upper() for line in file with open('foo', 'r') as file]

-1000. There is no discernible advantage over

with open(...) as file:
  upperlines = [lines.upper() for line in file]

Also you've got the order backwards -- when there's a sequence of
'for' and 'if' clauses in a comprehension, they are to be read from
left to right, but here you're tacking something onto the end that's
supposed to go first.

Please don't destroy my beautiful language.

--Guido

> On 11/15/2012 10:29 AM, Masklinn wrote:
>>
>>
>> On 2012-11-15, at 04:44 , Andrew Barnert wrote:
>>
>>> First, I realize that people regularly propose with expressions. This is
>>> not the
>>> same thing.
>>>
>>> The problem with the with statement is not that it can't be postfixed
>>> perl-style, or used in expressions. The problem is that it can't be used
>>> with
>>> generator expressions.
>>>
>>> Here's the suggestion:
>>>
>>>     upperlines = (lines.upper() for line in file with open('foo', 'r') as
>>> file)
>>>
>>> This would be equivalent to:
>>>
>>>     def foo():
>>>         with open('foo', 'r') as file:
>>>             for line in file:
>>>                 yield line.upper()
>>>     upperlines = foo()
>>>
>>> The motivation is that there is no way to write this properly using a
>>> with
>>> statement and a generator expression?in fact, the only way to get this
>>> right is
>>> with the generator function above.
>>
>>
>> Actually, it's extremely debatable that the generator function is
>> correct: if the generator is not fully consumed (terminating iteration
>> on the file) I'm pretty sure the file will *not* get closed save by the
>> GC doing a pass on all dead objects maybe. This means this function is
>> *not safe* as a lazy source to an arbitrary client, as that client may
>> very well use itertools.slice or itertools.takewhile and only partially
>> consume the generator.
>>
>> Here's an example:
>>
>> --
>> import itertools
>>
>> class Manager(object):
>>      def __enter__(self):
>>          return self
>>
>>      def __exit__(self, *args):
>>          print("Exited")
>>
>>      def __iter__(self):
>>          for i in range(5):
>>              yield i
>>
>> def foo():
>>      with Manager() as ms:
>>          for m in ms:
>>              yield m
>>
>> def bar():
>>      print("1")
>>      f = foo()
>>      print("2")
>>      # Only consume part of the iterable
>>      list(itertools.islice(f, None, 2))
>>      print("3")
>>
>> bar()
>> print("4")
>> --
>>
>> CPython output, I'm impressed that the refcounting GC actually bothers
>> unwinding the stack and running the __exit__ handler *once bar has
>> finished executing*:
>>
>>> python3 withgen.py
>>
>> 1
>> 2
>> 3
>> Exited
>> 4
>>
>> But here's the (just as correct, as far as I can tell) output from pypy:
>>
>>> pypy-c withgen.py
>>
>> 1
>> 2
>> 3
>> 4
>>
>> If the program was long running, it is possible that pypy would run
>> __exit__ when the containing generator is released (though by no means
>> certain, I don't know if this is specified at all).
>>
>> This is in fact one of the huge issues with faking dynamic scopes via
>> threadlocals and context managers (as e.g. Flask might do, I'm not sure
>> what actual strategy it uses), they interact rather weirdly with
>> generators (it's also why I think Python should support actually
>> dynamically scoped variables, it would also fix the thread-broken
>> behavior of e.g. warnings.catch_warnings)
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> http://mail.python.org/mailman/listinfo/python-ideas
>>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas



-- 
--Guido van Rossum (python.org/~guido)


From grosser.meister.morti at gmx.net  Fri Nov 16 05:33:18 2012
From: grosser.meister.morti at gmx.net (=?UTF-8?B?TWF0aGlhcyBQYW56ZW5iw7Zjaw==?=)
Date: Fri, 16 Nov 2012 05:33:18 +0100
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
Message-ID: <50A5C20E.4050604@gmx.net>

Just throwing random syntax variations on the wall to see what/if anything sticks (because I think 
the "as file"-assignment serves no purpose here):

     upperlines = (lines.upper() for line in with open('foo', 'r'))
     upperlines = (lines.upper() for line with open('foo', 'r'))
     upperlines = (lines.upper() with for line in open('foo', 'r'))

Or should the for loop check if there are __enter__ and __exit__ methods and call them? Guess not, 
but I thought I just mention it as an alternative.

For now one can do this, which is functional equivalent but adds the overhead of another generator:

     def managed(sequence):
         with sequence:
             for item in sequence:
                 yield item

     upperlines = (lines.upper() for line in managed(open('foo', 'r')))

You could even call this helper function "with_", if you like.

Or write a helper like this:

     def iterlines(filename,*args,**kwargs):
         with open(filename,*args,**kwargs) as f:
             for line in f:
                 yield line

     upperlines = (lines.upper() for line in iterlines('foo', 'r'))

Maybe there should be a way to let a file be automatically closed when EOF is encountered? Maybe an 
"autoclose" wrapper object that passes through every method call to the file object but when EOF is 
encountered during a read it closes the file object? Then one could write:

     upperlines = (lines.upper() for line in autoclose(open('foo', 'r')))


On 11/15/2012 04:44 AM, Andrew Barnert wrote:
> First, I realize that people regularly propose with expressions. This is not the
> same thing.
>
> The problem with the with statement is not that it can't be postfixed
> perl-style, or used in expressions. The problem is that it can't be used with
> generator expressions.
>
> Here's the suggestion:
>
>      upperlines = (lines.upper() for line in file with open('foo', 'r') as file)
>
> This would be equivalent to:
>
>      def foo():
>          with open('foo', 'r') as file:
>              for line in file:
>                  yield line.upper()
>      upperlines = foo()
>
> The motivation is that there is no way to write this properly using a with
> statement and a generator expression?in fact, the only way to get this right is
> with the generator function above. And almost nobody ever gets it right, even
> when you push them in the right direction (although occasionally they write a
> complex class that has the same effect).
>
> That's why we still have tons of code like this lying around:
>
>      upperlines = (lines.upper() for line in open('foo', 'r'))
>
> Everyone knows that this only works with CPython, and isn't even quite right
> there, and yet people write it anyway, because there's no good alternative.
>
> The with clause is inherently part of the generator expression, because the
> scope has to be dynamic. The file has to be closed when iteration finishes, not
> when creating the generator finishes (or when the generator is cleaned up?which
> is closer, but still wrong).
>
> That's why a general-purpose "with expression" wouldn't actually help here; in
> fact, it would just make generator expressions with with clauses harder to
> parse. A with expression would have to be statically scoped to be general.
>
> For more details, see this:
>
> http://stupidpythonideas.blogspot.com/2012/11/with-clauses-for-generator-expressions.html



From grosser.meister.morti at gmx.net  Fri Nov 16 05:39:16 2012
From: grosser.meister.morti at gmx.net (=?windows-1252?Q?Mathias_Panzenb=F6ck?=)
Date: Fri, 16 Nov 2012 05:39:16 +0100
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <CAP7+vJL8=qs80QATqNqKSAB7crc_ouugrB_3tNdtVWap67ptNw@mail.gmail.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>
	<50A5BD16.6040405@gmx.net>
	<CAP7+vJL8=qs80QATqNqKSAB7crc_ouugrB_3tNdtVWap67ptNw@mail.gmail.com>
Message-ID: <50A5C374.2000306@gmx.net>

Oh yes, you're right. Didn't think of this. Maybe I should go to bed/not write comments here at 5:38am.

On 11/16/2012 05:27 AM, Guido van Rossum wrote:
> On Thu, Nov 15, 2012 at 8:12 PM, Mathias Panzenb?ck
> <grosser.meister.morti at gmx.net> wrote:
>> I think this syntax would still make sense for list comprehensions:
>>
>> upperlines = [lines.upper() for line in file with open('foo', 'r') as file]
>
> -1000. There is no discernible advantage over
>
> with open(...) as file:
>    upperlines = [lines.upper() for line in file]
>
> Also you've got the order backwards -- when there's a sequence of
> 'for' and 'if' clauses in a comprehension, they are to be read from
> left to right, but here you're tacking something onto the end that's
> supposed to go first.
>
> Please don't destroy my beautiful language.
>
> --Guido
>
>> On 11/15/2012 10:29 AM, Masklinn wrote:
>>>
>>>
>>> On 2012-11-15, at 04:44 , Andrew Barnert wrote:
>>>
>>>> First, I realize that people regularly propose with expressions. This is
>>>> not the
>>>> same thing.
>>>>
>>>> The problem with the with statement is not that it can't be postfixed
>>>> perl-style, or used in expressions. The problem is that it can't be used
>>>> with
>>>> generator expressions.
>>>>
>>>> Here's the suggestion:
>>>>
>>>>      upperlines = (lines.upper() for line in file with open('foo', 'r') as
>>>> file)
>>>>
>>>> This would be equivalent to:
>>>>
>>>>      def foo():
>>>>          with open('foo', 'r') as file:
>>>>              for line in file:
>>>>                  yield line.upper()
>>>>      upperlines = foo()
>>>>
>>>> The motivation is that there is no way to write this properly using a
>>>> with
>>>> statement and a generator expression?in fact, the only way to get this
>>>> right is
>>>> with the generator function above.
>>>
>>>
>>> Actually, it's extremely debatable that the generator function is
>>> correct: if the generator is not fully consumed (terminating iteration
>>> on the file) I'm pretty sure the file will *not* get closed save by the
>>> GC doing a pass on all dead objects maybe. This means this function is
>>> *not safe* as a lazy source to an arbitrary client, as that client may
>>> very well use itertools.slice or itertools.takewhile and only partially
>>> consume the generator.
>>>
>>> Here's an example:
>>>
>>> --
>>> import itertools
>>>
>>> class Manager(object):
>>>       def __enter__(self):
>>>           return self
>>>
>>>       def __exit__(self, *args):
>>>           print("Exited")
>>>
>>>       def __iter__(self):
>>>           for i in range(5):
>>>               yield i
>>>
>>> def foo():
>>>       with Manager() as ms:
>>>           for m in ms:
>>>               yield m
>>>
>>> def bar():
>>>       print("1")
>>>       f = foo()
>>>       print("2")
>>>       # Only consume part of the iterable
>>>       list(itertools.islice(f, None, 2))
>>>       print("3")
>>>
>>> bar()
>>> print("4")
>>> --
>>>
>>> CPython output, I'm impressed that the refcounting GC actually bothers
>>> unwinding the stack and running the __exit__ handler *once bar has
>>> finished executing*:
>>>
>>>> python3 withgen.py
>>>
>>> 1
>>> 2
>>> 3
>>> Exited
>>> 4
>>>
>>> But here's the (just as correct, as far as I can tell) output from pypy:
>>>
>>>> pypy-c withgen.py
>>>
>>> 1
>>> 2
>>> 3
>>> 4
>>>
>>> If the program was long running, it is possible that pypy would run
>>> __exit__ when the containing generator is released (though by no means
>>> certain, I don't know if this is specified at all).
>>>
>>> This is in fact one of the huge issues with faking dynamic scopes via
>>> threadlocals and context managers (as e.g. Flask might do, I'm not sure
>>> what actual strategy it uses), they interact rather weirdly with
>>> generators (it's also why I think Python should support actually
>>> dynamically scoped variables, it would also fix the thread-broken
>>> behavior of e.g. warnings.catch_warnings)




From steve at pearwood.info  Fri Nov 16 07:05:02 2012
From: steve at pearwood.info (Steven D'Aprano)
Date: Fri, 16 Nov 2012 17:05:02 +1100
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
Message-ID: <50A5D78E.9040406@pearwood.info>

On 15/11/12 22:11, Andrew Barnert wrote:
> Which means the only question is, which one looks more readable:
>
> 1. (foo(line) for line in baz(f) if 'bar' in line with open('foo') as f)
> 2. (foo(line) for line in baz(f) with open('foo') as f if 'bar' in line)
> 3. (foo(line) with open('foo') as f for line in baz(f) if 'bar' in line)

Is that a trick question?

Answer: None of them.

In my opinion, they are all too busy for a generator expression and should
be re-written as a generator function.

As far as the given use-case is concerned:

upperlines = (line.upper() for line in open('foo'))

I don't see what the concern is. The file will remain open so long as the
generator is not exhausted, but that has to be the case no matter what you
do. If the generator is thrown away before being exhausted, the file will
eventually be closed by the garbage collector, if only when the application
or script exits. For short-lived scripts, the temporarily leakage of a file
handle or two is hardly likely to be a serious problem.

Presumably if you have a long-lived application with many such opened
files, you might risk running out of file handles when running under Jython
or IronPython. But I think that's a sufficiently unusual and advanced use-
case that I'm not worried that this is a problem that needs solving with
syntax instead of education.



-- 
Steven


From abarnert at yahoo.com  Fri Nov 16 10:09:24 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 01:09:24 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <50A55C06.3060802@canterbury.ac.nz>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<50A55C06.3060802@canterbury.ac.nz>
Message-ID: <1353056964.61039.YahooMailRC@web184702.mail.ne1.yahoo.com>

So far, nearly everyone is discussing things which are tangential, or arguing 
that one of the optional variants is bad. So let me strip down the proposal, 
without any options in it, and expand on a use case. The syntax is:


    (foo(line) with open('bar') as f for line in baz(f))

This translates to calling this function:

    def gen():
        with open('bar') as f:
            for line in baz(f):
                yield foo(line)

The translation for with clauses is identical to for and if clauses, and nesting 
works in the obvious way.

So, why do I want to create a generator that wraps a file or other generator 
inside a with clause?

There are a wide range of modules that have functions that can take a generator 
of strings in place of a file. Some examples off the top of my head include 
numpy.loadtxt, poster.multipart_encode, and line_protocol.connection.send. Many 
of these are asynchronous, so I can't just wrap the call in a with statement; I 
have to send a generator that will close the wrapped file (or other generator) 
when it's exhausted or closed, instead of when the function returns.

So, imagine a simple "get" command in a mail server, a method in the Connection 
class:

    def handle_get(self, message_id):
        path = os.path.join(mailbox_path, message_id)
        self.send_async(open(path, 'r'))

Now, let's say I want to do some kind of processing on the file as I send it 
(e.g., remove excessive curse words, or add new ones in if there aren't enough 
in any line):

    def handle_get(self, message_id):
        path = os.path.join(mailbox_path, message_id)
        def censored_file():
            with open(path, 'r') as file:
                for line in file:
                    yield self.censor(line)
        self.send_async(censored_file())

With my suggested idea, the last 5 lines could be replaced by this:

        self.send_async(self.censor(line) with open(path, 'r') as file for line 
in file)

Of course this async_chat-style model isn't the only way to write a server, but 
it is a common way to write a server, and I don't think it should be 
complicated.

----- Original Message ----
> From: Greg Ewing <greg.ewing at canterbury.ac.nz>
> To: python-ideas at python.org
> Sent: Thu, November 15, 2012 1:18:24 PM
> Subject: Re: [Python-ideas] With clauses for generator expressions
> 
> Andrew Barnert wrote:
> > 1. (foo(line) for line in baz(f) if 'bar' in line  with open('foo') as f)
> > 2. (foo(line) for line in baz(f) with open('foo')  as f if 'bar' in line)
> > 3. (foo(line) with open('foo') as f for line in  baz(f) if 'bar' in line)
> 
> Order 3 is the most consistent with existing  features. The
> only out-of-order thing about comprehensions currently is  that
> the result expression comes first instead of last. Everything
> else is  in the same order as the statement expansion.
> 
> --  Greg
> _______________________________________________
> Python-ideas mailing  list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
> 


From abarnert at yahoo.com  Fri Nov 16 10:26:03 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 01:26:03 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <50A5D78E.9040406@pearwood.info>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<50A5D78E.9040406@pearwood.info>
Message-ID: <1353057963.64085.YahooMailRC@web184706.mail.ne1.yahoo.com>

From: Steven D'Aprano <steve at pearwood.info>
Sent: Thu, November 15, 2012 10:05:36 PM


> As far as the given use-case is concerned:
> 
> upperlines =  (line.upper() for line in open('foo'))
> 
> I don't see what the concern is.  The file will remain open so long as the
> generator is not exhausted, but that  has to be the case no matter what you
> do. If the generator is thrown away  before being exhausted, the file will
> eventually be closed by the garbage  collector, if only when the application
> or script exits. For short-lived  scripts, the temporarily leakage of a file
> handle or two is hardly likely to  be a serious problem.
> 
> Presumably if you have a long-lived application  with many such opened
> files, you might risk running out of file handles when  running under Jython
> or IronPython. But I think that's a sufficiently unusual  and advanced use-
> case that I'm not worried that this is a problem that needs  solving with
> syntax instead of education.


This seems to be an argument against with statements, or any other kind of 
resource management at all besides "trust the GC". I'm pretty sure PEP 310, PEP 
340, PEP 343, and the discussion around them already had plenty of 
counter-arguments, but here's a couple quick ones: If you've opened a file for 
exclusive access (the default on Windows), you can't safely open it again if you 
can't predict when it will be closed. If the context in question is a mutex lock 
rather than a file open, you can't safely lock it again if you can't predict 
when it will be released (and, even if you never want to lock it again, you 
could end up deadlocked against another thread that does).


From abarnert at yahoo.com  Fri Nov 16 10:32:22 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 01:32:22 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <50A5C374.2000306@gmx.net>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<2A701905-63E7-4C50-9D92-8FB8BBC95EF2@masklinn.net>
	<50A5BD16.6040405@gmx.net>
	<CAP7+vJL8=qs80QATqNqKSAB7crc_ouugrB_3tNdtVWap67ptNw@mail.gmail.com>
	<50A5C374.2000306@gmx.net>
Message-ID: <1353058342.89958.YahooMailRC@web184705.mail.ne1.yahoo.com>

I'm pretty sure both my original message and the blog post linked from there 
explained why this is not particularly useful for list comprehensions. (If 
you're guaranteed to exhaust the iteration in the current block?which you 
obviously always are for comprehensions?just make the with a statement with its 
own block.)

The only reason I suggested it for comprehensions as well as generator 
expressions is that someone convinced me that it would be slightly easier to 
implement, and to teach to users, than if it were only available for generator 
expressions.


From: Mathias Panzenb?ck <grosser.meister.morti at gmx.net>
Sent: Thu, November 15, 2012 8:39:34 PM
> 
> Oh yes, you're right. Didn't think of this. Maybe I should go to bed/not write  
>comments here at 5:38am.
> 
> On 11/16/2012 05:27 AM, Guido van Rossum  wrote:
> > On Thu, Nov 15, 2012 at 8:12 PM, Mathias Panzenb?ck
> >  <grosser.meister.morti at gmx.net>  wrote:
> >> I think this syntax would still make sense for list  comprehensions:
> >>
> >> upperlines = [lines.upper() for line in  file with open('foo', 'r') as 
file]
> >
> > -1000. There is no  discernible advantage over
> >
> > with open(...) as file:
> >     upperlines = [lines.upper() for line in file]
> >
> > Also you've  got the order backwards -- when there's a sequence of
> > 'for' and 'if'  clauses in a comprehension, they are to be read from
> > left to right, but  here you're tacking something onto the end that's
> > supposed to go  first.
> >
> > Please don't destroy my beautiful  language.
> >
> > --Guido
> >
> >> On 11/15/2012 10:29 AM,  Masklinn wrote:
> >>>
> >>>
> >>> On 2012-11-15,  at 04:44 , Andrew Barnert wrote:
> >>>
> >>>> First, I  realize that people regularly propose with expressions. This  
is
> >>>> not the
> >>>> same  thing.
> >>>>
> >>>> The problem with the with  statement is not that it can't be postfixed
> >>>> perl-style, or  used in expressions. The problem is that it can't be used
> >>>>  with
> >>>> generator  expressions.
> >>>>
> >>>> Here's the  suggestion:
> >>>>
> >>>>       upperlines = (lines.upper() for line in file with open('foo', 'r')  
>as
> >>>> file)
> >>>>
> >>>> This would  be equivalent to:
> >>>>
> >>>>       def foo():
> >>>>          with  open('foo', 'r') as file:
> >>>>               for line in file:
> >>>>                   yield line.upper()
> >>>>       upperlines = foo()
> >>>>
> >>>> The  motivation is that there is no way to write this properly using  a
> >>>> with
> >>>> statement and a generator  expression?in fact, the only way to get this
> >>>> right  is
> >>>> with the generator function  above.
> >>>
> >>>
> >>> Actually, it's extremely  debatable that the generator function is
> >>> correct: if the  generator is not fully consumed (terminating iteration
> >>> on the  file) I'm pretty sure the file will *not* get closed save by the
> >>>  GC doing a pass on all dead objects maybe. This means this function  is
> >>> *not safe* as a lazy source to an arbitrary client, as that  client may
> >>> very well use itertools.slice or itertools.takewhile  and only partially
> >>> consume the  generator.
> >>>
> >>> Here's an  example:
> >>>
> >>> --
> >>> import  itertools
> >>>
> >>> class  Manager(object):
> >>>       def  __enter__(self):
> >>>           return  self
> >>>
> >>>       def __exit__(self,  *args):
> >>>            print("Exited")
> >>>
> >>>       def  __iter__(self):
> >>>           for i in  range(5):
> >>>               yield  i
> >>>
> >>> def foo():
> >>>        with Manager() as ms:
> >>>           for m  in ms:
> >>>               yield  m
> >>>
> >>> def bar():
> >>>        print("1")
> >>>       f = foo()
> >>>        print("2")
> >>>       # Only consume  part of the iterable
> >>>        list(itertools.islice(f, None, 2))
> >>>        print("3")
> >>>
> >>> bar()
> >>>  print("4")
> >>> --
> >>>
> >>> CPython output,  I'm impressed that the refcounting GC actually bothers
> >>> unwinding  the stack and running the __exit__ handler *once bar has
> >>>  finished executing*:
> >>>
> >>>> python3 withgen.py
> >>>
> >>>  1
> >>> 2
> >>> 3
> >>> Exited
> >>>  4
> >>>
> >>> But here's the (just as correct, as far as I  can tell) output from pypy:
> >>>
> >>>> pypy-c  withgen.py
> >>>
> >>> 1
> >>> 2
> >>>  3
> >>> 4
> >>>
> >>> If the program was long  running, it is possible that pypy would run
> >>> __exit__ when the  containing generator is released (though by no means
> >>> certain, I  don't know if this is specified at all).
> >>>
> >>> This is  in fact one of the huge issues with faking dynamic scopes via
> >>>  threadlocals and context managers (as e.g. Flask might do, I'm not  sure
> >>> what actual strategy it uses), they interact rather weirdly  with
> >>> generators (it's also why I think Python should support  actually
> >>> dynamically scoped variables, it would also fix the  thread-broken
> >>> behavior of e.g.  warnings.catch_warnings)
> 
> 
> _______________________________________________
> Python-ideas  mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
> 


From steve at pearwood.info  Fri Nov 16 10:53:18 2012
From: steve at pearwood.info (Steven D'Aprano)
Date: Fri, 16 Nov 2012 20:53:18 +1100
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1353057963.64085.YahooMailRC@web184706.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<50A5D78E.9040406@pearwood.info>
	<1353057963.64085.YahooMailRC@web184706.mail.ne1.yahoo.com>
Message-ID: <50A60D0E.9030204@pearwood.info>

On 16/11/12 20:26, Andrew Barnert wrote:
> From: Steven D'Aprano<steve at pearwood.info>
> Sent: Thu, November 15, 2012 10:05:36 PM
>
>
>> As far as the given use-case is concerned:
>>
>> upperlines =  (line.upper() for line in open('foo'))
>>
>> I don't see what the concern is.  The file will remain open so long as the
>> generator is not exhausted, but that  has to be the case no matter what you
>> do. If the generator is thrown away  before being exhausted, the file will
>> eventually be closed by the garbage  collector, if only when the application
>> or script exits. For short-lived  scripts, the temporarily leakage of a file
>> handle or two is hardly likely to  be a serious problem.
>>
>> Presumably if you have a long-lived application  with many such opened
>> files, you might risk running out of file handles when  running under Jython
>> or IronPython. But I think that's a sufficiently unusual  and advanced use-
>> case that I'm not worried that this is a problem that needs  solving with
>> syntax instead of education.
>
>
> This seems to be an argument against with statements, or any other kind of
> resource management at all besides "trust the GC".


Certainly not. I'm saying that for many applications, explicit resource
management is not critical -- letting the GC close the file (or whatever
resource you're working with) -- is a perfectly adequate strategy. The mere
existence of "faulty" gen expressions like the above example is not
necessarily a problem.

Think of it this way: you can optimize code for speed, for memory, and for
resource usage. (Memory of course being a special case of resource usage.)
You're worried about making it easy to micro-optimize generator expressions
for resource usage. I'm saying that's usually premature optimization. It's
not worth new syntax complicating generator expressions to optimize the
closing of a few files.

If your application is not one of those applications where a laissez-faire
approach to resource management is acceptable, that's fine. I'm not saying
that nobody needs care about resource management! If you need to care about
your resources with more attention than benign neglect, then do so.

The only limitation here is that you can't use a context manager in a list
comprehension or generator expression. I don't care about that. Not every
problem that requires a function needs to be solvable with lambda, and not
every problem that requires a generator needs to be solvable with a generator
expression.

The beauty of generator expressions is that they are deliberately lean. The
bar to fatten them up with more syntax is quite high, and I don't think you
have come even close to getting over it.



-- 
Steven


From storchaka at gmail.com  Fri Nov 16 11:29:17 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Fri, 16 Nov 2012 12:29:17 +0200
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1353056964.61039.YahooMailRC@web184702.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<50A55C06.3060802@canterbury.ac.nz>
	<1353056964.61039.YahooMailRC@web184702.mail.ne1.yahoo.com>
Message-ID: <k854hu$cmf$1@ger.gmane.org>

On 16.11.12 11:09, Andrew Barnert wrote:
> With my suggested idea, the last 5 lines could be replaced by this:
> 
>          self.send_async(self.censor(line) with open(path, 'r') as file for line
> in file)

    self.send_async(self.censor(line) for line in open(path, 'r'))

or

    self.send_async(map(self.censor, open(path, 'r')))

This is *not worse* than your first example

    self.send_async(open(path, 'r'))

How do you write a managed uncensored variant? You can use the wrapper suggested by Mathias Panzenb?ck.

    self.send_async(managed(open(path, 'r')))
    self.send_async(self.censor(line) for line in managed(open(path, 'r')))

It is easy, clear, universal and requires no changes to syntax.




From abarnert at yahoo.com  Fri Nov 16 11:32:36 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 02:32:36 -0800 (PST)
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
	attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
Message-ID: <1353061956.13722.YahooMailRC@web184703.mail.ne1.yahoo.com>

From: Mike Meyer <mwm at mired.org>
Sent: Thu, November 15, 2012 2:29:44 AM

>If the goal is to make os.walk fast, then it might be better (on Posix systems, 

>anyway) to see if it can be built on top of ftw instead of low-level directory 
>scanning routines.

After a bit of experimentation, I'm not sure there actually is any significant 
improvement to be had on most POSIX systems working that way.

Looking at the source from FreeBSD, OS X, and glibc, they all call stat (or a 
stat family call) on each file, unless you ask for no stat info. A quick test on 
OS X shows that calling fts via ctypes is about 5% faster than os.walk, and 5% 
slower than find -ls or find -mtime (which will stat every file).

Passing FTS_NOSTAT to fts is about 3x faster, but only 8% faster than os.walk 
with the stat calls hacked out, and 40% slower than find.

So, a "nostat" option is a potential performance improvement, but switching to 
ftw/nftw/fts, with or without the nostat flag, doesn't seem to be worth it.


From mwm at mired.org  Fri Nov 16 11:45:13 2012
From: mwm at mired.org (Mike Meyer)
Date: Fri, 16 Nov 2012 04:45:13 -0600
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <1353061956.13722.YahooMailRC@web184703.mail.ne1.yahoo.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
	<1353061956.13722.YahooMailRC@web184703.mail.ne1.yahoo.com>
Message-ID: <CAD=7U2BwsWhL+tce=Hi5UHNubJd1hbWRfYraSdvTdMBaX4fsSA@mail.gmail.com>

On Fri, Nov 16, 2012 at 4:32 AM, Andrew Barnert <abarnert at yahoo.com> wrote:
> From: Mike Meyer <mwm at mired.org>
> Sent: Thu, November 15, 2012 2:29:44 AM
>
>>If the goal is to make os.walk fast, then it might be better (on Posix systems,
>
>>anyway) to see if it can be built on top of ftw instead of low-level directory
>>scanning routines.
> After a bit of experimentation, I'm not sure there actually is any significant
> improvement to be had on most POSIX systems working that way.

I agree with that, so long as you have to stay with the os.walk
interface.

> Looking at the source from FreeBSD, OS X, and glibc, they all call stat (or a
> stat family call) on each file, unless you ask for no stat info.

Right. They either they give you *all* the stat information, or they
don't give you *any* of it. So there's no way to use it to create the
directory/other split in os.walk without doing the stat calls.

> Passing FTS_NOSTAT to fts is about 3x faster, but only 8% faster than os.walk
> with the stat calls hacked out, and 40% slower than find.

That's actually a good thing to know. With FTS_NOSTAT, fts winds up
using the d_type field to figure out what's a directory (assuming you
were on a file system that has those). That's the proposed change for
os.walk, so we now have an estimate of how fast we can expect it to
be.

I'm surprised that it's slower than find. The FreeBSD version of find
uses fts_open/fts_read. Could it be that used FTS_NOCHDIR to emulate
os.walk, whereas find doesn't?

    <mike


From mwm at mired.org  Fri Nov 16 12:03:22 2012
From: mwm at mired.org (Mike Meyer)
Date: Fri, 16 Nov 2012 05:03:22 -0600
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCH7L7gggmzYTaunp6hfYsLzT5wq2=k1Z0-04DYzdtAewg@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
	<CAL9jXCHFrtey3onf4dwUtkR1AmXeRwfy8OdK+NHjrAFehxqxWg@mail.gmail.com>
	<CAD=7U2A08BEPThUs8jmJDkyDOeLOybN-1BBVjgp+cSoMOzf+Ng@mail.gmail.com>
	<CAL9jXCH7L7gggmzYTaunp6hfYsLzT5wq2=k1Z0-04DYzdtAewg@mail.gmail.com>
Message-ID: <CAD=7U2A97gCo8sDegX3M70mv6vvCCCaCFQqDnYduUOT0gbbzpA@mail.gmail.com>

I'm pretty much convinced that - if the primary goal is to speed up
os.walk by leveraging the Windows calls and the existence of d_type on
some posix file systems - the proposed iterdir_stat interface is about
as good as we can do.

However, as a tool for making it easy to iterate through files in a
directory getting some/all stat information, I think it's ugly. It's
designed specifically for one system, with another common case sort of
wedged in. There's no telling how well it will be handle any other
systems, but I can see that they might be problematical. Worse yet,
you wind up with stat information you can't trust, so have to
basically write code to access multiple attributes like:

	  if st.attr1 is None:
	     st = os.stat(...)
	  func(st.attr1)
	  if st.attr2 is None:
	     st = os.stat(...)
	  func(st.attr2)

Not bad if you only want one or two values, but ugly if you want four
or more.

I can see a number of alternatives to improve this situation:

1) wrap the return partial stat info in a proxy object that will do a
real stat if a request is made for a value that isn't there. This has
already been rejected.

2) Make iterdir_stat an os.walk internal tool, and don't export it.

3) Add some kind of "we have a full stat" indicator, so that clients
that want to use lots of attributes can just check that and do the
stat if needed.

4) Pick and document one of the a stat values as a "we have a full
stat" indicator, to use like case 3.

5) Add a keyword argument to iterdir_stat that causes it to always
just do the full stat. Actually, having three modes might be useful:
the default is None, which is the currently proposed behavior. Setting
it to True causes the full stat always be done, and setting it to
False just returns file names.

6) Depreciate os.walk, and provide os.itertree with an interface that
lets us leverage the available tools better. That's a whole other can
of worms, though.

   Thanks,
   <mike


From abarnert at yahoo.com  Fri Nov 16 12:06:43 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 03:06:43 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <50A60D0E.9030204@pearwood.info>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<50A5D78E.9040406@pearwood.info>
	<1353057963.64085.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<50A60D0E.9030204@pearwood.info>
Message-ID: <1353064003.57789.YahooMailRC@web184702.mail.ne1.yahoo.com>

From: Steven D'Aprano <steve at pearwood.info>
Sent: Fri, November 16, 2012 1:53:42 AM


> > This seems to be an argument against with  statements, or any other kind of
> > resource management at all besides  "trust the GC".
> 
> Certainly not. I'm saying that for many applications,  explicit resource
> management is not critical -- letting the GC close the file  (or whatever
> resource you're working with) -- is a perfectly adequate  strategy. The mere
> existence of "faulty" gen expressions like the above  example is not
> necessarily a problem.
> 
> Think of it this way: you can  optimize code for speed, for memory, and for
> resource usage. (Memory of  course being a special case of resource usage.)
> You're worried about making  it easy to micro-optimize generator expressions
> for resource usage.

It's not a micro-optimization, or an optimization at all. It has nothing to do 
with performance, and everything to do with making your code work at all. (Or, 
in some cases, making it robust?your code may work 99% of the time, or work with 
CPython or POSIX but not PyPy or Windows.) For example, see Google's Python 
Style Guide 
at http://google-styleguide.googlecode.com/svn/trunk/pyguide.html#Files_and_Sockets
 for why they recommend always closing files.

> The only limitation here is that you can't use a context manager in a  list
> comprehension or generator expression.

Yes, that's exactly the limitation (but only in generator expressions?in list 
comprehensions, it can't ever matter).

> The beauty of generator expressions is that they  are deliberately lean. The
> bar to fatten them up with more syntax is quite  high, and I don't think you
> have come even close to getting over  it.

This is one of those cases where it won't hurt you when you don't use it. You 
don't have to put if clauses into generator expressions, or nest multiple 
loops?and very often you don't, in which case they don't get in the way, and 
your expression is concise and simple. Similarly, you won't have to put with 
clauses into generator expressions, and very often you won't, in which case they 
won't get in the way.

And I don't think anyone would have trouble learning or understanding it. The 
expression still maps to a generator function that's just a simple tree of 
one-line nested statements with a yield statement at the bottom, the only 
difference is that instead of the two most common kinds of statements in such 
functions, you can now use the three most common.


From abarnert at yahoo.com  Fri Nov 16 13:09:49 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 04:09:49 -0800 (PST)
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
	attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2BwsWhL+tce=Hi5UHNubJd1hbWRfYraSdvTdMBaX4fsSA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
	<1353061956.13722.YahooMailRC@web184703.mail.ne1.yahoo.com>
	<CAD=7U2BwsWhL+tce=Hi5UHNubJd1hbWRfYraSdvTdMBaX4fsSA@mail.gmail.com>
Message-ID: <1353067789.84594.YahooMailRC@web184703.mail.ne1.yahoo.com>

> From: Mike Meyer <mwm at mired.org>
> Sent: Fri, November 16, 2012 2:45:15 AM
> 
> On Fri, Nov 16, 2012 at 4:32 AM, Andrew Barnert <abarnert at yahoo.com> wrote:
> > From:  Mike Meyer <mwm at mired.org>
> > Sent: Thu, November  15, 2012 2:29:44 AM
> >
> > Passing FTS_NOSTAT to fts is about 3x faster, but only 8%  faster than 
>os.walk
> > with the stat calls hacked out, and 40% slower than  find.
> 
> That's actually a good thing to know. With FTS_NOSTAT, fts winds  up
> using the d_type field to figure out what's a directory (assuming  you
> were on a file system that has those). That's the proposed change  for
> os.walk, so we now have an estimate of how fast we can expect it  to
> be.

I'm not sure I'd put too much confidence on the 3x difference as generally 
applicable to POSIX. Apple uses FreeBSD's fts unmodified, even though in a quick 
browser I saw at least one case where a trivial change would have made a 
difference (the link count check that's only used with ufs/nfs/nfs4/ext2fs would 
also work on hfs+). Also, OS X with HFS+ drives does some bizarre stuff with 
disk caching, especially with an SSD (which in itself probably changes the 
performance characteristics).

But I'd guess it's somewhere in the right ballpark, and if anything it'll 
probably be even more improvement on FreeBSD and linux than on OS X.

> I'm surprised that it's slower than find. The FreeBSD version  of find
> uses fts_open/fts_read. Could it be that used FTS_NOCHDIR to  emulate
> os.walk, whereas find doesn't?


No, just FTS_PHYSICAL (with or without FTS_NOSTAT).

It looks like more than half of the difference is due to 
print(ent.fts_path.decode('utf8')) in Python vs. puts(entry->fts_path) in find 
(based on removing the print entirely). I don't think it's worth the effort to 
investigate further?let's get the 3x faster before we worry about the last 40%? 
But if you want to, the source I used is at https://github.com/abarnert/py-fts


From de.rouck.robrecht at gmail.com  Fri Nov 16 13:28:46 2012
From: de.rouck.robrecht at gmail.com (Robrecht De Rouck)
Date: Fri, 16 Nov 2012 13:28:46 +0100
Subject: [Python-ideas] Uniquify attribute for lists
Message-ID: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>

Hello,

I just wanted to bring to your attention that an *attribute for removing
duplicate elements* for lists would be a nice feature.

*def uniquify(lis):
    seen = set()
    seen_add = seen.add
    return [ x for x in lis if x not in seen and not seen_add(x)]*
*
*
The code is from this
post<http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order>.
Also
check out this performance
comparison<http://www.peterbe.com/plog/uniqifiers-benchmark> of
uniquifying snippets.
It would be useful to have a uniquify attribute for containers in general.

Best regards, Robrecht
*
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121116/39a4f942/attachment.html>

From p.f.moore at gmail.com  Fri Nov 16 13:39:48 2012
From: p.f.moore at gmail.com (Paul Moore)
Date: Fri, 16 Nov 2012 12:39:48 +0000
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
Message-ID: <CACac1F-dKMFx4BtrorKc4LBwj1nt+TBiaYRnOffbKCNWex_AJw@mail.gmail.com>

On 16 November 2012 12:28, Robrecht De Rouck <de.rouck.robrecht at gmail.com>wrote:

> Hello,
>
> I just wanted to bring to your attention that an *attribute for removing
> duplicate elements* for lists would be a nice feature.
>
> *def uniquify(lis):
>     seen = set()
>     seen_add = seen.add
>     return [ x for x in lis if x not in seen and not seen_add(x)]*
> *
> *
> The code is from this post<http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order>. Also
> check out this performance comparison<http://www.peterbe.com/plog/uniqifiers-benchmark> of
> uniquifying snippets.
> It would be useful to have a uniquify attribute for containers in general.
>


list(set(ls))

Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121116/f2e2a356/attachment.html>

From fuzzyman at gmail.com  Fri Nov 16 14:17:56 2012
From: fuzzyman at gmail.com (Michael Foord)
Date: Fri, 16 Nov 2012 13:17:56 +0000
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <CACac1F-dKMFx4BtrorKc4LBwj1nt+TBiaYRnOffbKCNWex_AJw@mail.gmail.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<CACac1F-dKMFx4BtrorKc4LBwj1nt+TBiaYRnOffbKCNWex_AJw@mail.gmail.com>
Message-ID: <CAKCKLWyotck7AHef-Yi6ieP9mLGyisEXewv5XmD2PonTVDJ35A@mail.gmail.com>

On 16 November 2012 12:39, Paul Moore <p.f.moore at gmail.com> wrote:

> On 16 November 2012 12:28, Robrecht De Rouck <de.rouck.robrecht at gmail.com>wrote:
>
>> Hello,
>>
>> I just wanted to bring to your attention that an *attribute for removing
>> duplicate elements* for lists would be a nice feature.
>>
>> *def uniquify(lis):
>>     seen = set()
>>     seen_add = seen.add
>>     return [ x for x in lis if x not in seen and not seen_add(x)]*
>> *
>> *
>> The code is from this post<http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order>. Also
>> check out this performance comparison<http://www.peterbe.com/plog/uniqifiers-benchmark> of
>> uniquifying snippets.
>> It would be useful to have a uniquify attribute for containers in
>> general.
>>
>
>
> list(set(ls))
>

This loses order. Both solutions suffer from the problem that they only
work with hashable objects.

Michael


>
> Paul
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
>


-- 

http://www.voidspace.org.uk/

May you do good and not evil
May you find forgiveness for yourself and forgive others
May you share freely, never taking more than you give.
-- the sqlite blessing http://www.sqlite.org/different.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121116/dac620fa/attachment.html>

From masklinn at masklinn.net  Fri Nov 16 14:49:12 2012
From: masklinn at masklinn.net (Masklinn)
Date: Fri, 16 Nov 2012 14:49:12 +0100
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <CAKCKLWyotck7AHef-Yi6ieP9mLGyisEXewv5XmD2PonTVDJ35A@mail.gmail.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<CACac1F-dKMFx4BtrorKc4LBwj1nt+TBiaYRnOffbKCNWex_AJw@mail.gmail.com>
	<CAKCKLWyotck7AHef-Yi6ieP9mLGyisEXewv5XmD2PonTVDJ35A@mail.gmail.com>
Message-ID: <B26B6B40-6ADF-4AF9-B51C-933B23E1FD54@masklinn.net>

On 2012-11-16, at 14:17 , Michael Foord wrote:

> On 16 November 2012 12:39, Paul Moore <p.f.moore at gmail.com> wrote:
> 
>> On 16 November 2012 12:28, Robrecht De Rouck <de.rouck.robrecht at gmail.com>wrote:
>> 
>>> Hello,
>>> 
>>> I just wanted to bring to your attention that an *attribute for removing
>>> duplicate elements* for lists would be a nice feature.
>>> 
>>> *def uniquify(lis):
>>>    seen = set()
>>>    seen_add = seen.add
>>>    return [ x for x in lis if x not in seen and not seen_add(x)]*
>>> *
>>> *
>>> The code is from this post<http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order>. Also
>>> check out this performance comparison<http://www.peterbe.com/plog/uniqifiers-benchmark> of
>>> uniquifying snippets.
>>> It would be useful to have a uniquify attribute for containers in
>>> general.
>>> 
>> 
>> 
>> list(set(ls))
>> 
> 
> This loses order. Both solutions suffer from the problem that they only
> work with hashable objects.

Though in both cases they also have the advantage that they work in
(roughly) O(n) where an eq-based uniquifier (such as Haskell's
nub/nubBy) works in O(n^2).

From storchaka at gmail.com  Fri Nov 16 15:37:54 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Fri, 16 Nov 2012 16:37:54 +0200
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <CAKCKLWyotck7AHef-Yi6ieP9mLGyisEXewv5XmD2PonTVDJ35A@mail.gmail.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<CACac1F-dKMFx4BtrorKc4LBwj1nt+TBiaYRnOffbKCNWex_AJw@mail.gmail.com>
	<CAKCKLWyotck7AHef-Yi6ieP9mLGyisEXewv5XmD2PonTVDJ35A@mail.gmail.com>
Message-ID: <k85j43$f64$1@ger.gmane.org>

On 16.11.12 15:17, Michael Foord wrote:
> On 16 November 2012 12:39, Paul Moore
> <p.f.moore at gmail.com
> <mailto:p.f.moore at gmail.com>> wrote:

>     list(set(ls))
>
>
> This loses order.

list(collections.OrderedDict.fromkeys(ls))




From ncoghlan at gmail.com  Fri Nov 16 16:53:14 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sat, 17 Nov 2012 01:53:14 +1000
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <CADiSq7dKE_7Z6s2dcRG5XHjdTvxELLYJyB8=VeRATuOvqv2o9w@mail.gmail.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<CADiSq7dYXhZKbQgMuMr1XBMRPUhL5MZ3Nd0rYFazr1G9cGbB2A@mail.gmail.com>
	<1352993114.99957.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<CADiSq7dKE_7Z6s2dcRG5XHjdTvxELLYJyB8=VeRATuOvqv2o9w@mail.gmail.com>
Message-ID: <CADiSq7f84F3mC+1B8OjVydut=C+fev0Qqrj_SXZZh+HRjR7sXg@mail.gmail.com>

On Fri, Nov 16, 2012 at 10:46 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> However, I realised there's a more serious problem with your idea: the
> outermost clause in a list comprehension or generator expression is
> evaluated immediately and passed as an argument to the inner scope that
> implements the loop, so you have an unresolved sequencing problem between
> the evaluation of that argument and the evaluation of the context manager.
> If you want the context manager inside the generator, you *can't* reference
> the name bound in the as clause in the outermost iterable.
>

(Andrew's reply here dropped the list from the cc, but I figure my
subsequent clarification is worth sharing more widely)

When you write a genexp like this:

    gen = (x for x in get_seq())

The expansion is *NOT* this:

    def _g():
        for x in get_seq():
            yield x

    gen = _g()

Instead, it is actually:

    def _g(iterable):
        for x in iterable:
            yield x

    gen = _g(get_seq())

That is, the outermost iterable is evaluated in the *current* scope, not
inside the generator. Thus, the entire proposal is rendered incoherent, as
there is no way for the context manager expression to be executed both
*before* the outermost iterable expression and *inside* the generator
function, since the generator doesn't get called until *after* the
outermost iterable expression has already been evaluated. (And, to stave of
the obvious question, no this order of evaluation is *not* negotiable, as
changing it would be a huge backwards compatibility breach, as well as
leading to a lot more obscure errors with generator expressions)

The reason PEP 403 is potentially relevant is because it lets you write a
one-shot generator function using the long form and still make it clear
that it *is* a one shot operation that creates the generator-iterator
directly, without exposing the generator function itself:

    @in gen = g()
    def g():
        for x in get_seq():
            yield x

Or, going back to the use case in the original post:

    @in upperlines = f()
    def f():
        with open('foo', 'r') as file:
            for line in file:
                yield line.upper()


Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121117/1ec34a7f/attachment.html>

From jstpierre at mecheye.net  Fri Nov 16 19:22:00 2012
From: jstpierre at mecheye.net (Jasper St. Pierre)
Date: Fri, 16 Nov 2012 13:22:00 -0500
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <k85j43$f64$1@ger.gmane.org>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<CACac1F-dKMFx4BtrorKc4LBwj1nt+TBiaYRnOffbKCNWex_AJw@mail.gmail.com>
	<CAKCKLWyotck7AHef-Yi6ieP9mLGyisEXewv5XmD2PonTVDJ35A@mail.gmail.com>
	<k85j43$f64$1@ger.gmane.org>
Message-ID: <CAA0H+QRhTm9orBcOJHvonAveo6gRYdCQ3SiS-APUsW-TgaJjjA@mail.gmail.com>

As long as we're giving terrible code suggestions:

[a for a in L if a not in locals("_[0]")]


On Fri, Nov 16, 2012 at 9:37 AM, Serhiy Storchaka <storchaka at gmail.com>wrote:

> On 16.11.12 15:17, Michael Foord wrote:
>
>> On 16 November 2012 12:39, Paul Moore
>> <p.f.moore at gmail.com
>> <mailto:p.f.moore at gmail.com>> wrote:
>>
>
>      list(set(ls))
>>
>>
>> This loses order.
>>
>
> list(collections.OrderedDict.**fromkeys(ls))
>
>
>
> ______________________________**_________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/**mailman/listinfo/python-ideas<http://mail.python.org/mailman/listinfo/python-ideas>
>



-- 
  Jasper
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121116/b5a773c4/attachment.html>

From steve at pearwood.info  Fri Nov 16 19:21:24 2012
From: steve at pearwood.info (Steven D'Aprano)
Date: Sat, 17 Nov 2012 05:21:24 +1100
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
Message-ID: <50A68424.9000509@pearwood.info>

On 16/11/12 23:28, Robrecht De Rouck wrote:
> Hello,
>
> I just wanted to bring to your attention that an *attribute for removing
> duplicate elements* for lists would be a nice feature.
>
> *def uniquify(lis):
>      seen = set()
>      seen_add = seen.add
>      return [ x for x in lis if x not in seen and not seen_add(x)]*

That won't work for a general purpose function, because lists can hold
unhashable items, and sets require hashable.

Here's an old recipe predating sets that solves the problem in a number
of different ways. Read the comments for versions that don't lose order.



-- 
Steven


From abarnert at yahoo.com  Fri Nov 16 19:55:43 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 10:55:43 -0800
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <CADiSq7f84F3mC+1B8OjVydut=C+fev0Qqrj_SXZZh+HRjR7sXg@mail.gmail.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<CADiSq7dYXhZKbQgMuMr1XBMRPUhL5MZ3Nd0rYFazr1G9cGbB2A@mail.gmail.com>
	<1352993114.99957.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<CADiSq7dKE_7Z6s2dcRG5XHjdTvxELLYJyB8=VeRATuOvqv2o9w@mail.gmail.com>
	<CADiSq7f84F3mC+1B8OjVydut=C+fev0Qqrj_SXZZh+HRjR7sXg@mail.gmail.com>
Message-ID: <1294F7BB-58C9-4476-BC3A-EAB4330097BF@yahoo.com>

Ah, you're right. The only way this would work is if the with clause were second or later, which would be very uncommon. And the fact that it doesn't work in the most common case means that, even if it were occasionally useful, it would cause a lot more confusion than benefit.

So, never mind...

Sent from my iPhone

On Nov 16, 2012, at 7:53, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On Fri, Nov 16, 2012 at 10:46 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> However, I realised there's a more serious problem with your idea: the outermost clause in a list comprehension or generator expression is evaluated immediately and passed as an argument to the inner scope that implements the loop, so you have an unresolved sequencing problem between the evaluation of that argument and the evaluation of the context manager. If you want the context manager inside the generator, you *can't* reference the name bound in the as clause in the outermost iterable.
> 
> (Andrew's reply here dropped the list from the cc, but I figure my subsequent clarification is worth sharing more widely) 
> 
> When you write a genexp like this:
> 
>     gen = (x for x in get_seq())
> 
> The expansion is *NOT* this:
> 
>     def _g():
>         for x in get_seq():
>             yield x
> 
>     gen = _g()
> 
> Instead, it is actually:
> 
>     def _g(iterable):
>         for x in iterable:
>             yield x
> 
>     gen = _g(get_seq())
> 
> That is, the outermost iterable is evaluated in the *current* scope, not inside the generator. Thus, the entire proposal is rendered incoherent, as there is no way for the context manager expression to be executed both *before* the outermost iterable expression and *inside* the generator function, since the generator doesn't get called until *after* the outermost iterable expression has already been evaluated. (And, to stave of the obvious question, no this order of evaluation is *not* negotiable, as changing it would be a huge backwards compatibility breach, as well as leading to a lot more obscure errors with generator expressions)
> 
> The reason PEP 403 is potentially relevant is because it lets you write a one-shot generator function using the long form and still make it clear that it *is* a one shot operation that creates the generator-iterator directly, without exposing the generator function itself:
> 
>     @in gen = g()
>     def g():
>         for x in get_seq():
>             yield x
> 
> Or, going back to the use case in the original post:
> 
>     @in upperlines = f()
>     def f():
>         with open('foo', 'r') as file:
>             for line in file:
>                 yield line.upper()
> 
> 
> 
> Cheers,
> Nick.
> 
> -- 
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121116/bc684235/attachment.html>

From abarnert at yahoo.com  Fri Nov 16 20:02:49 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 11:02:49 -0800
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <50A5C20E.4050604@gmx.net>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<50A5C20E.4050604@gmx.net>
Message-ID: <56FA138C-D75C-4F4A-9D3B-E5FBBC727186@yahoo.com>

I missed this the first time through among all the other alternative suggestions:

Sent from my iPhone

On Nov 15, 2012, at 20:33, Mathias Panzenb?ck <grosser.meister.morti at gmx.net>
> 
> For now one can do this, which is functional equivalent but adds the overhead of another generator:
> 
>   def managed(sequence):
>       with sequence:
>           for item in sequence:
>               yield item
> 
>   upperlines = (lines.upper() for line in managed(open('foo', 'r')))


I think this ought to be in itertools in the standard library.

I don't think the extra overhead will be a problem most of the time.

It solves at least the simplest cases for when a with clause would be useful, and it's even a better solution for some cases where you'd write a with statement today.

In some cases you'd have to write things like managed(closing(foo)), but in those cases you probably wouldn't have wanted the with clause, either.


From abarnert at yahoo.com  Fri Nov 16 20:12:43 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 11:12:43 -0800
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
Message-ID: <F723AF5E-961C-401E-B551-3D0DCC5B7F40@yahoo.com>

If I understand this right, the problem you want to solve is that there is no obvious way to uniquify lists that's order preserving and efficient, so you want a good implementation to be added as an attribute of the list type. Right?

As others have pointed out, your implementation only works for lists with hashable elements, so no lists of lists, for example.

Also, making it an attribute of list means you can't use it on, say, a tuple, or a dict key iterator, or a file. Why restrict it like that? I'd much rather have an itertools.uniquify(seq) than a list method. (If I'm just misreading your use of the word "attribute", I apologize.)

And, once it's a separate function rather than a member of list, why do you want it to return a list rather than a generator?

All that being said, if getting this right is difficult enough that a bunch of people working together on a blog over 6 years didn't come up with a good version that supports non-hashable elements, maybe a good implementation does belong in the standard library itertools.

Sent from my iPhone

On Nov 16, 2012, at 4:28, Robrecht De Rouck <de.rouck.robrecht at gmail.com> wrote:

> Hello,
> 
> I just wanted to bring to your attention that an attribute for removing duplicate elements for lists would be a nice feature. 
> 
> def uniquify(lis):
>     seen = set()
>     seen_add = seen.add
>     return [ x for x in lis if x not in seen and not seen_add(x)]
> 
> The code is from this post. Also check out this performance comparison of uniquifying snippets.
> It would be useful to have a uniquify attribute for containers in general. 
> 
> Best regards, Robrecht
> 
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121116/a3af38a6/attachment.html>

From abarnert at yahoo.com  Fri Nov 16 21:14:17 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 12:14:17 -0800 (PST)
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <F723AF5E-961C-401E-B551-3D0DCC5B7F40@yahoo.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<F723AF5E-961C-401E-B551-3D0DCC5B7F40@yahoo.com>
Message-ID: <1353096857.53493.YahooMailRC@web184701.mail.ne1.yahoo.com>

From: Andrew Barnert <abarnert at yahoo.com>
Sent: Fri, November 16, 2012 11:15:18 AM

>All that being said, if getting this right is difficult enough that a bunch of 
>people working together on a blog over 6 years didn't come up with a good 
>version that supports non-hashable elements, maybe a good implementation does 
>belong in the standard library itertools.

Actually, it looks like it's already there. The existing unique_everseen 
function in http://docs.python.org/3/library/itertools.html#itertools-recipes 
(also available from the more-itertools PyPI module at 
http://packages.python.org/more-itertools/api.html#more_itertools.unique_everseen)
 is an improvement on this idea.

So, unless someone has done performance tests showing that the suggested 
implementation is significantly faster than unique_everseen (I suppose the 
"__contains__" vs. "in" might make a difference?), and this is a critical 
bottleneck for your app, I think the right way to write this function is:

    uniquify = more_itertools.unique_everseen

Unfortunately, it's still not going to work on non-hashable elements. Maybe 
itertools (either the module or the documentation recipe list) needs a version 
that does?


From zuo at chopin.edu.pl  Fri Nov 16 22:59:34 2012
From: zuo at chopin.edu.pl (Jan Kaliszewski)
Date: Fri, 16 Nov 2012 22:59:34 +0100
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <F723AF5E-961C-401E-B551-3D0DCC5B7F40@yahoo.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<F723AF5E-961C-401E-B551-3D0DCC5B7F40@yahoo.com>
Message-ID: <23ad1c9b272e06d1eaabb2e722e9f3b7@chopin.edu.pl>

Here is my quick'n'dirty implementation of different variants 
(with/without key, preserving/not preserving order, accepting 
hashable-only/any items):

http://wklej.org/id/872623/

Timeit-ed on my machine:

$ python3.3 iteruniq.py
test1_nokey_noorder_hashonly [0.08257626800332218, 0.08304202905856073, 
0.08718552498612553]
test2_nokey_noorder_universal [2.48601198696997, 2.4620621589710936, 
2.453364996938035]
test3_nokey_withorder_hashonly [0.3661507030483335, 0.3646505419164896, 
0.36500189593061805]
test4_nokey_withorder_universal [7.532308181049302, 7.397191203082912, 
7.316833758028224]
test5_withkey_noorder_hashonly [0.9567891559563577, 0.9690931889927015, 
0.9598639439791441]
test6_withkey_noorder_universal [3.142076837946661, 3.144917198107578, 
3.150129645015113]
test7_withkey_withorder_hashonly [0.9493958179373294, 
0.9514245060272515, 0.9517305289627984]
test8_withkey_withorder_universal [10.233501984039322, 
10.404869885998778, 10.786898656049743]

Cheers.
*j



From greg.ewing at canterbury.ac.nz  Fri Nov 16 23:45:11 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sat, 17 Nov 2012 11:45:11 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAD=7U2A97gCo8sDegX3M70mv6vvCCCaCFQqDnYduUOT0gbbzpA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
	<CAL9jXCHFrtey3onf4dwUtkR1AmXeRwfy8OdK+NHjrAFehxqxWg@mail.gmail.com>
	<CAD=7U2A08BEPThUs8jmJDkyDOeLOybN-1BBVjgp+cSoMOzf+Ng@mail.gmail.com>
	<CAL9jXCH7L7gggmzYTaunp6hfYsLzT5wq2=k1Z0-04DYzdtAewg@mail.gmail.com>
	<CAD=7U2A97gCo8sDegX3M70mv6vvCCCaCFQqDnYduUOT0gbbzpA@mail.gmail.com>
Message-ID: <50A6C1F7.9060801@canterbury.ac.nz>

Mike Meyer wrote:
> I can see a number of alternatives to improve this situation:
> 
> 1) wrap the return partial stat info in a proxy object
> 2) Make iterdir_stat an os.walk internal tool, and don't export it.
> 3) Add some kind of "we have a full stat" indicator,
> 4) document one of the a stat values as a "we have a full stat" indicator,
> 5) Add a keyword argument to ... always do the full stat.
> 6) Depreciate os.walk, and provide os.itertree

7) Provide an iterdir() with a way of specifying exactly
which stat fields you're interested in. Then it can perform
stat calls if and only if needed, and the user doesn't have
to tediously test for presence of things in the result.

-- 
Greg


From abarnert at yahoo.com  Sat Nov 17 00:25:31 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 15:25:31 -0800 (PST)
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <23ad1c9b272e06d1eaabb2e722e9f3b7@chopin.edu.pl>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<F723AF5E-961C-401E-B551-3D0DCC5B7F40@yahoo.com>
	<23ad1c9b272e06d1eaabb2e722e9f3b7@chopin.edu.pl>
Message-ID: <1353108331.50013.YahooMailRC@web184702.mail.ne1.yahoo.com>

Comparing your test3 and test7 to the equivalent calls with the itertools 
recipe, it's about 32% slower in 2.7 and 46% slower in 3.3. But the added 
flexibility might easily make up for the cost?it certainly does if you need it?

More interestingly, with this:

    if hashable_only is None:
        try:
            return iteruniq(iterable, key, preserve_order, True)
        except TypeError:
            return iteruniq(iterable, key, preserve_order, False)

? hashable_only=None is only 8% slower than hashable_only=False when you have 
non-hashables, and 91% faster when you don't. (And trying unique_everseen 
instead of iteruniq if hashable_only is None and preserve_order makes that 7% 
and 94%.)

The current unique_everseen is already by far the longest recipe on the 
itertools docs page, but it still might be worth updating with a synthesized 
best-of-all-options version, or actually adding to itertools instead of leaving 
as a recipe.

----- Original Message ----

> From: Jan Kaliszewski <zuo at chopin.edu.pl>
> To: python-ideas at python.org
> Sent: Fri, November 16, 2012 2:00:06 PM
> Subject: Re: [Python-ideas] Uniquify attribute for lists
> 
> Here is my quick'n'dirty implementation of different variants (with/without 
>key,  preserving/not preserving order, accepting hashable-only/any  items):
> 
> http://wklej.org/id/872623/
> 
> Timeit-ed on my  machine:
> 
> $ python3.3 iteruniq.py
> test1_nokey_noorder_hashonly  [0.08257626800332218, 0.08304202905856073,  
>0.08718552498612553]
> test2_nokey_noorder_universal [2.48601198696997,  2.4620621589710936, 
>2.453364996938035]
> test3_nokey_withorder_hashonly  [0.3661507030483335, 0.3646505419164896,  
>0.36500189593061805]
> test4_nokey_withorder_universal [7.532308181049302,  7.397191203082912, 
>7.316833758028224]
> test5_withkey_noorder_hashonly  [0.9567891559563577, 0.9690931889927015,  
>0.9598639439791441]
> test6_withkey_noorder_universal [3.142076837946661,  3.144917198107578, 
>3.150129645015113]
> test7_withkey_withorder_hashonly  [0.9493958179373294, 0.9514245060272515,  
>0.9517305289627984]
> test8_withkey_withorder_universal [10.233501984039322,  10.404869885998778,  
>10.786898656049743]
> 
> Cheers.
> *j
> 
> _______________________________________________
> Python-ideas  mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
> 


From greg.ewing at canterbury.ac.nz  Sat Nov 17 00:28:34 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sat, 17 Nov 2012 12:28:34 +1300
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <CADiSq7f84F3mC+1B8OjVydut=C+fev0Qqrj_SXZZh+HRjR7sXg@mail.gmail.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<CADiSq7dYXhZKbQgMuMr1XBMRPUhL5MZ3Nd0rYFazr1G9cGbB2A@mail.gmail.com>
	<1352993114.99957.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<CADiSq7dKE_7Z6s2dcRG5XHjdTvxELLYJyB8=VeRATuOvqv2o9w@mail.gmail.com>
	<CADiSq7f84F3mC+1B8OjVydut=C+fev0Qqrj_SXZZh+HRjR7sXg@mail.gmail.com>
Message-ID: <50A6CC22.90506@canterbury.ac.nz>

Nick Coghlan wrote:
> That is, the outermost iterable is evaluated in the *current* scope, not 
> inside the generator.

I've always felt it was a bad idea to bake this kludge into
the language. It sweeps a certain class of problems under the
rug, but only in *some* cases. For example, in

    ((x, y) for x in foo for y in blarg)

rebinding of foo is guarded against, but not blarg. And if
that's not arbitrary enough, in the otherwise completely
equivalent

    ((x, y) for y in blarg for x in foo)

it's the other way around.

Anyhow, it wouldn't be *impossible* to incorporate a with-clause
into this scheme. Given

    (upper(line) with open(name) as f for line in f)

you either pick open(name) to be the pre-evaluated expression,
or not do any pre-evaluation at all in that case. Either way,
it can't break any *existing* code, because nobody is writing
genexps containing with-clauses yet.

-- 
Greg


From guido at python.org  Sat Nov 17 00:50:45 2012
From: guido at python.org (Guido van Rossum)
Date: Fri, 16 Nov 2012 15:50:45 -0800
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <50A6CC22.90506@canterbury.ac.nz>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<CADiSq7dYXhZKbQgMuMr1XBMRPUhL5MZ3Nd0rYFazr1G9cGbB2A@mail.gmail.com>
	<1352993114.99957.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<CADiSq7dKE_7Z6s2dcRG5XHjdTvxELLYJyB8=VeRATuOvqv2o9w@mail.gmail.com>
	<CADiSq7f84F3mC+1B8OjVydut=C+fev0Qqrj_SXZZh+HRjR7sXg@mail.gmail.com>
	<50A6CC22.90506@canterbury.ac.nz>
Message-ID: <CAP7+vJLv4qapTmQ+ZjymP_042BOW6TbH-xJ9Tj6uypWEr3SPgQ@mail.gmail.com>

On Fri, Nov 16, 2012 at 3:28 PM, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Nick Coghlan wrote:
>>
>> That is, the outermost iterable is evaluated in the *current* scope, not
>> inside the generator.
>
>
> I've always felt it was a bad idea to bake this kludge into
> the language. It sweeps a certain class of problems under the
> rug, but only in *some* cases. For example, in
>
>    ((x, y) for x in foo for y in blarg)
>
> rebinding of foo is guarded against, but not blarg. And if
> that's not arbitrary enough, in the otherwise completely
> equivalent
>
>    ((x, y) for y in blarg for x in foo)
>
> it's the other way around.

I wouldn't call it arbitrary -- the second and following clauses
*must* be re-evaluated because they may reference the loop variable of
the first. And the two versions you show aren't equivalent unless
iterating over blarg and foo is completely side-effect-free.

> Anyhow, it wouldn't be *impossible* to incorporate a with-clause
> into this scheme. Given
>
>    (upper(line) with open(name) as f for line in f)
>
> you either pick open(name) to be the pre-evaluated expression,
> or not do any pre-evaluation at all in that case. Either way,
> it can't break any *existing* code, because nobody is writing
> genexps containing with-clauses yet.

And nobody ever will. It's too ugly.

-- 
--Guido van Rossum (python.org/~guido)


From tjreedy at udel.edu  Sat Nov 17 01:00:19 2012
From: tjreedy at udel.edu (Terry Reedy)
Date: Fri, 16 Nov 2012 19:00:19 -0500
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <1353056964.61039.YahooMailRC@web184702.mail.ne1.yahoo.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<50A55C06.3060802@canterbury.ac.nz>
	<1353056964.61039.YahooMailRC@web184702.mail.ne1.yahoo.com>
Message-ID: <k86k2m$non$1@ger.gmane.org>

On 11/16/2012 4:09 AM, Andrew Barnert wrote:
> So far, nearly everyone is discussing things which are tangential, or arguing
> that one of the optional variants is bad. So let me strip down the proposal,
> without any options in it, and expand on a use case. The syntax is:
>
>
>      (foo(line) with open('bar') as f for line in baz(f))

OK, that's helpful. Now let me strip down my objection to this: your 
proposal is conceptually wrong because it mixes two distinct and 
different ideas -- collection definition and context management. It 
conflicts with a well-defined notion of long standing.

To explain: in math, one can define a set explicitly by displaying the 
members or implicitly as a subset of based on one or more base sets. 
Using one version of the notation
{0, 2, 4} == {2*i| i in N; i < 3}
The latter is 'set-builder notation' or a 'set comprehension' (and would 
usually use the epsilon-like member symbol instead of 'in'). The idea 
goes back at least a century.
https://en.wikipedia.org/wiki/Set-builder_notation

In Python, the latter directly translates to
   {2*i for i in itertools.count() if i < 3} ==
   {i for i in range(0, 5, 2)}
(Python does not require the base collection to match the result class.)
Another pair of examples:
   {(i,j)| i in N, j in N; i+j <= 5}
   {(i,j) for i in count() for j in count if i+j <= 5}

Similar usage in programming go back over half a century.
https://en.wikipedia.org/wiki/List_comprehension
While notation in both math and CS varies, the components are always 
input source collection variables, conditions or predicates, and an 
output expression.

The Python reference manual documents comprehensions as an alternate 
atomic display form. In Chapter 6, Expressions, Section 2, Atoms,

"For constructing a list, a set or a dictionary Python provides special 
syntax called ?displays?, each of them in two flavors:
either the container contents are listed explicitly, or
they are computed via a set of looping and filtering instructions, 
called a comprehension.
...
list_display ::=  "[" [expression_list | comprehension] "]"
<etc>"
A generator expression similarly represents an untyped abstract 
sequence, rather than a concrete class.
---

In summary: A context-manager, as an object with __enter__ and __exit__ 
methods, is not a proper component of a comprehension. For instance, 
replace "open('xxx')" in your proposal with a lock creation function. On 
the other hand, an iterable managed resource, as suggested by Mathias 
Panzenb?ck, works fine as a source. So it does work (as you noticed also).

-- 
Terry Jan Reedy




From steve at pearwood.info  Sat Nov 17 02:12:31 2012
From: steve at pearwood.info (Steven D'Aprano)
Date: Sat, 17 Nov 2012 12:12:31 +1100
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <50A68424.9000509@pearwood.info>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<50A68424.9000509@pearwood.info>
Message-ID: <50A6E47F.3030304@pearwood.info>

On 17/11/12 05:21, Steven D'Aprano wrote:

> Here's an old recipe predating sets that solves the problem in a number
> of different ways. Read the comments for versions that don't lose order.

And it would help if I actually included the URL. Sorry about that.


http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/




-- 
Steven


From abarnert at yahoo.com  Sat Nov 17 04:17:59 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Fri, 16 Nov 2012 19:17:59 -0800 (PST)
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <50A6E47F.3030304@pearwood.info>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<50A68424.9000509@pearwood.info> <50A6E47F.3030304@pearwood.info>
Message-ID: <1353122279.34236.YahooMailRC@web184704.mail.ne1.yahoo.com>

I created a fork of more-itertools, with unique_everseen modified to work with 
non-hashable elements. It starts off assuming everything is hashable, and using 
a set; if it gets a TypeError, it converts the set to a list and carries on. 
Very simple, and it's within 5% of the speed of the original in the best case.

>From a quick glance at your recipe, it looks like you had the same idea long 
ago.

Meanwhile, I've been thinking that you don't really have to fall all the way 
back to a list if things aren't hashable; there are plenty of intermediate 
steps:

There are plenty of types that aren't generally hashable, but you can be sure 
that your algorithm won't mutate them, and you can devise a hash function that 
guarantees the hashable requirement. For some easy examples, hash mutable 
sequences as (type(x), tuple(x)), mutable sets as (type(x), frozenset(x)), 
mutable mappings as (type(x), tuple(dict(x).items())), mutable buffers as 
(type(x), bytes(x)), etc. Or, if you have a bunch of your own mutable classes, 
maybe add a "fakehash" method to them and use that.

As another option, a set of pickle.dumps(x) would work for many types, and it's 
still O(N), although with a huge multiplier, so it's not worth it 
unless len(sequence) >> avg(len(element)). Also, it's not guaranteed that x==y 
implies dumps(x)==dumps(y), so you'd need to restrict it to types for which this 
is known to be true.

There are plenty of types that are not hashable, but are fully ordered, and 
(type(x), x) is fully ordered as long as all of the types are, so in such cases 
you can use a sorted collection (blist.sortedset?) and get O(N log N) time.

Of course none of these works for all types, so you'd still have to fall back to 
linear searching through a list in some cases.

At any rate, I don't think any of these alternatives needs to be added to a 
general-purpose uniquifier, but they should all be doable if your use case 
requires better performance for, e.g., a list of lists or a generator of mutable 
class objects or a huge collection of quickly-picklable objects.


----- Original Message ----
> From: Steven D'Aprano <steve at pearwood.info>
> To: python-ideas at python.org
> Sent: Fri, November 16, 2012 5:18:02 PM
> Subject: Re: [Python-ideas] Uniquify attribute for lists
> 
> On 17/11/12 05:21, Steven D'Aprano wrote:
> 
> > Here's an old recipe  predating sets that solves the problem in a number
> > of different ways.  Read the comments for versions that don't lose order.
> 
> And it would help  if I actually included the URL. Sorry about  that.
> 
> 
> http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/
> 
> 
> 
> 
> --  Steven
> _______________________________________________
> Python-ideas  mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
> 


From stefan at drees.name  Sat Nov 17 08:23:32 2012
From: stefan at drees.name (Stefan Drees)
Date: Sat, 17 Nov 2012 08:23:32 +0100
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <50A6C1F7.9060801@canterbury.ac.nz>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
	<CAL9jXCHFrtey3onf4dwUtkR1AmXeRwfy8OdK+NHjrAFehxqxWg@mail.gmail.com>
	<CAD=7U2A08BEPThUs8jmJDkyDOeLOybN-1BBVjgp+cSoMOzf+Ng@mail.gmail.com>
	<CAL9jXCH7L7gggmzYTaunp6hfYsLzT5wq2=k1Z0-04DYzdtAewg@mail.gmail.com>
	<CAD=7U2A97gCo8sDegX3M70mv6vvCCCaCFQqDnYduUOT0gbbzpA@mail.gmail.com>
	<50A6C1F7.9060801@canterbury.ac.nz>
Message-ID: <50A73B74.7050003@drees.name>

Greg Ewing suggested:
> Mike Meyer wrote:
>> I can see a number of alternatives to improve this situation:
>>
>> 1) wrap the return partial stat info in a proxy object
>> 2) Make iterdir_stat an os.walk internal tool, and don't export it.
>> 3) Add some kind of "we have a full stat" indicator,
>> 4) document one of the a stat values as a "we have a full stat"
>> indicator,
>> 5) Add a keyword argument to ... always do the full stat.
>> 6) Depreciate os.walk, and provide os.itertree
>
> 7) Provide an iterdir() with a way of specifying exactly
> which stat fields you're interested in. Then it can perform
> stat calls if and only if needed, and the user doesn't have
> to tediously test for presence of things in the result.
>

+1 for following that seventh path. It offers the additional benefit for 
the library code, that constraints of the backend functionality used are 
more clearer to handle: If requested and available allthough expensive, 
"yield nevertheless the attribute values" is then a valid strategy.

All the best,
Stefan.


From chris at kateandchris.net  Sat Nov 17 19:11:09 2012
From: chris at kateandchris.net (Chris Lambacher)
Date: Sat, 17 Nov 2012 13:11:09 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCHuxvNnT-JEkWHF_ZNLPMDsRHywxmkmuK39u6G4ju7XZA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CADiSq7dUo_p+1af8Bs=Znhq5Zmr657JP=z_O6uzLC-yAWptqiA@mail.gmail.com>
	<CAL9jXCHuxvNnT-JEkWHF_ZNLPMDsRHywxmkmuK39u6G4ju7XZA@mail.gmail.com>
Message-ID: <CAAXXHg+J5k82dsEYRZSu=tbqo=JqMah-qZT4ri+qunLNEYSuUg@mail.gmail.com>

On Mon, Nov 12, 2012 at 3:55 PM, Ben Hoyt <benhoyt at gmail.com> wrote:

> Yes, those are good points. I'll see about making a "betterwalk" or similar
> module and releasing on PyPI.
>

You should call it "speedwalk" ;)

-Chris

-- 
Christopher Lambacher
chris at kateandchris.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121117/de788be9/attachment.html>

From joshua.landau.ws at gmail.com  Sat Nov 17 20:37:38 2012
From: joshua.landau.ws at gmail.com (Joshua Landau)
Date: Sat, 17 Nov 2012 19:37:38 +0000
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <1353122279.34236.YahooMailRC@web184704.mail.ne1.yahoo.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<50A68424.9000509@pearwood.info> <50A6E47F.3030304@pearwood.info>
	<1353122279.34236.YahooMailRC@web184704.mail.ne1.yahoo.com>
Message-ID: <CAN1F8qWD0watc8d=QsV4dgYxr2U7Wss3dyT4RJ4a4Yocz2UWPw@mail.gmail.com>

Surely the best choice is two have *two* caches; one for hashables and
another for the rest.

This might be improvable with a *third* chache if some non-hashables had
total ordering, but that could also introduce bugs I think. It'd also be a
lot harder and likely be slower anyway.

Timings (approximate and rough):
"""
Using 16 elements

% Hashable:    100     90     66     33      0
Original:     1.47   1.00   1.06   1.09   1.01
w/ key=str:   3.73   1.91   2.12   3.15   3.00
New:          1.20   1.46   1.81   2.13   3.38
w/ key=str:   1.72   2.00   2.48   2.76   3.01

Using 64 elements

% Hashable:    100     90     66     33      0
Original:     1.15   1.29   1.61   1.64   1.43
w/ key=str:   1.98   2.50   3.09   3.55   3.99
New:          1.00   1.47   2.18   3.01   3.60
w/ key=str:   1.87   2.30   2.79   3.41   3.84

Using 256 elements

% Hashable:    100     90     66     33      0
Original:     2.70   3.66   5.19   5.34   4.41
w/ key=str:   4.06   5.07   6.26   6.93   6.98
New:          1.00   1.65   2.92   5.28   7.62
w/ key=str:   2.28   2.71   3.76   4.36   4.93

Using 1024 elements

% Hashable:    100     90     66     33      0
Original:     9.30   12.4   18.8   21.4   16.9
w/ key=str:   11.1   13.1   16.3   17.5   13.9
New:          1.00   1.84   6.20   13.1   19.8
w/ key=str:   2.31   2.79   3.59   4.50   5.16

Using 4096 elements

% Hashable:    100     90     66     33      0
Original:     33.7   44.3   69.1   79.4   60.5
w/ key=str:   36.7   44.2   59.3   60.1   40.4
New:          1.00   3.73   18.1   42.2   63.7
w/ key=str:   2.23   2.56   3.33   4.19   4.93

Using 16384 elements

% Hashable:    100     90     66     33      0
Original:     126.   173.   265.   313.   243.
w/ key=str:   136.   164.   215.   213.   147.
New:          1.00   12.5   68.6   173.   263.
w/ key=str:   2.24   2.60   3.28   4.14   4.80
"""

--------------

Code attached, unless I forget ;).
No guarantees that it still works the same way, or works at all, or is the
right file.

Every item is repeated 5 times on average for any length of the list being
unique-ified. I'll try it with this changed later.

Basically, the new version is faster on all but ~100% non-hashable lists
when there are more than ~1000 elements, and on more-hashable lists it's
quadratically faster. When slower, it's by only about 10% to 20%. Of
course, specifying whether or not your list is fully-hashable would be more
efficient (probably 10%) but that's not as nice to use.

Additionally, if you use key="" to uniquify by hashable values you are able
to get a good speed up with the new version.

Example of use:
iteruniq(<list of set>, key=frozenset)

With really small non-hashable lists, the original is significantly better
(3x).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121117/569601d9/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: iteruniq.py
Type: application/octet-stream
Size: 6268 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121117/569601d9/attachment.obj>

From joshua.landau.ws at gmail.com  Sat Nov 17 21:11:57 2012
From: joshua.landau.ws at gmail.com (Joshua Landau)
Date: Sat, 17 Nov 2012 20:11:57 +0000
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <k86k2m$non$1@ger.gmane.org>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<50A55C06.3060802@canterbury.ac.nz>
	<1353056964.61039.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<k86k2m$non$1@ger.gmane.org>
Message-ID: <CAN1F8qWvsxXNLTNBRgtbE63uw_Btr95WQ_NWYBLD+4xw3ZL=Lw@mail.gmail.com>

On 17 November 2012 00:00, Terry Reedy <tjreedy at udel.edu> wrote:

> On 11/16/2012 4:09 AM, Andrew Barnert wrote:
>
>> So far, nearly everyone is discussing things which are tangential, or
>> arguing
>> that one of the optional variants is bad. So let me strip down the
>> proposal,
>> without any options in it, and expand on a use case. The syntax is:
>>
>>
>>      (foo(line) with open('bar') as f for line in baz(f))
>>
>
> OK, that's helpful. Now let me strip down my objection to this: your
> proposal is conceptually wrong because it mixes two distinct and different
> ideas -- collection definition and context management. It conflicts with a
> well-defined notion of long standing.
>
> To explain: in math, one can define a set explicitly by displaying the
> members or implicitly as a subset of based on one or more base sets. Using
> one version of the notation
> {0, 2, 4} == {2*i| i in N; i < 3}
> The latter is 'set-builder notation' or a 'set comprehension' (and would
> usually use the epsilon-like member symbol instead of 'in'). The idea goes
> back at least a century.
> https://en.wikipedia.org/wiki/**Set-builder_notation<https://en.wikipedia.org/wiki/Set-builder_notation>
>
> In Python, the latter directly translates to
>   {2*i for i in itertools.count() if i < 3} ==
>   {i for i in range(0, 5, 2)}
> (Python does not require the base collection to match the result class.)
> Another pair of examples:
>   {(i,j)| i in N, j in N; i+j <= 5}
>   {(i,j) for i in count() for j in count if i+j <= 5}
>
> Similar usage in programming go back over half a century.
> https://en.wikipedia.org/wiki/**List_comprehension<https://en.wikipedia.org/wiki/List_comprehension>
> While notation in both math and CS varies, the components are always input
> source collection variables, conditions or predicates, and an output
> expression.
>
> The Python reference manual documents comprehensions as an alternate
> atomic display form. In Chapter 6, Expressions, Section 2, Atoms,
>
> "For constructing a list, a set or a dictionary Python provides special
> syntax called ?displays?, each of them in two flavors:
> either the container contents are listed explicitly, or
> they are computed via a set of looping and filtering instructions, called
> a comprehension.
> ...
> list_display ::=  "[" [expression_list | comprehension] "]"
> <etc>"
> A generator expression similarly represents an untyped abstract sequence,
> rather than a concrete class.
> ---
>
> In summary: A context-manager, as an object with __enter__ and __exit__
> methods, is not a proper component of a comprehension. For instance,
> replace "open('xxx')" in your proposal with a lock creation function. On
> the other hand, an iterable managed resource, as suggested by Mathias
> Panzenb?ck, works fine as a source. So it does work (as you noticed also).


I don't follow how you made these two leaps:
* It doesn't apply to set comprehensions in *math* -> it doesn't apply to
set comprehensions in *Python*
* it doesn't apply to *set* comprehensions in Python -> it doesn't apply to
*any* comprehensions in Python
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121117/ef8aea71/attachment.html>

From storchaka at gmail.com  Sat Nov 17 22:18:29 2012
From: storchaka at gmail.com (Serhiy Storchaka)
Date: Sat, 17 Nov 2012 23:18:29 +0200
Subject: [Python-ideas] Fast pickle
Message-ID: <k88uv6$loh$1@ger.gmane.org>

I have an idea which can increase pickle/unpickle performance. This requires a change of format, so we need a new version of the protocol. May be it will be a 4 (PEP 3154) or 5.

All items should be aligned to 4 bytes. It allow fast reading/writing of small integers, memcpy and UTF-8 codec should be faster on aligned data.

In order not to waste space, a byte code should be combined with the data or the size of data.

For integers:

<code> <24-bit integer>
<code> <24-bit size> <size-byte integer> <padding>
<code> <56-bit size> <size-byte integer> <padding>

For strings:

<code> <8-bit size> <size-byte string> <padding>
<code> <24-bit size> <size-byte string> <padding>
<code> <56-bit size> <size-byte string> <padding>

For collections:

<code> <24-bit size> <item1> <item2> ... <item #size>
<code> <56-bit size> <item1> <item2> ... <item #size>

For references:
 
<code> <24-bit index>
<code> <56-bit index>

For 1- and 2-byte integers, this can be expensive. We can add a special code for grouping. It will be even shorter than in the old protocols.

<group code> <item code> <16-bit count> <integer1> <integer2> ... <integer #count> <padding>

What do you think about this?




From solipsis at pitrou.net  Sat Nov 17 22:36:13 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sat, 17 Nov 2012 22:36:13 +0100
Subject: [Python-ideas] Fast pickle
References: <k88uv6$loh$1@ger.gmane.org>
Message-ID: <20121117223613.10838e97@pitrou.net>


Hello,

On Sat, 17 Nov 2012 23:18:29 +0200
Serhiy Storchaka <storchaka at gmail.com>
wrote:
> I have an idea which can increase pickle/unpickle performance. This requires a change of format, so we need a new version of the protocol. May be it will be a 4 (PEP 3154) or 5.
> 
> All items should be aligned to 4 bytes. It allow fast reading/writing of small integers, memcpy and UTF-8 codec should be faster on aligned data.

I can see several drawbacks here:
- you will still have to support unaligned data (because of memoryview
  slices)
- it may not be significantly faster, because actual creation of
  objects also contributes to unpickling performance
- there will be less code sharing between different protocol versions,
  making pickle harder to maintain

If you think this is worthwhile, I think you should first draft a
proof of concept to evaluate the performance gain.

Regards

Antoine.




From tjreedy at udel.edu  Sun Nov 18 20:55:32 2012
From: tjreedy at udel.edu (Terry Reedy)
Date: Sun, 18 Nov 2012 14:55:32 -0500
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <CAN1F8qWvsxXNLTNBRgtbE63uw_Btr95WQ_NWYBLD+4xw3ZL=Lw@mail.gmail.com>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<50A55C06.3060802@canterbury.ac.nz>
	<1353056964.61039.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<k86k2m$non$1@ger.gmane.org>
	<CAN1F8qWvsxXNLTNBRgtbE63uw_Btr95WQ_NWYBLD+4xw3ZL=Lw@mail.gmail.com>
Message-ID: <k8befn$7ba$1@ger.gmane.org>

On 11/17/2012 3:11 PM, Joshua Landau wrote:
> On 17 November 2012 00:00, Terry Reedy

>     OK, that's helpful. Now let me strip down my objection to this: your
>     proposal is conceptually wrong because it mixes two distinct and
>     different ideas -- collection definition and context management. It
>     conflicts with a well-defined notion of long standing.
>
>     To explain: in math, one can define a set explicitly by displaying
>     the members or implicitly as a subset of based on one or more base
>     sets. Using one version of the notation
>     {0, 2, 4} == {2*i| i in N; i < 3}
>     The latter is 'set-builder notation' or a 'set comprehension' (and
>     would usually use the epsilon-like member symbol instead of 'in').
>     The idea goes back at least a century.
>     https://en.wikipedia.org/wiki/__Set-builder_notation
>     <https://en.wikipedia.org/wiki/Set-builder_notation>
>
>     In Python, the latter directly translates to
>        {2*i for i in itertools.count() if i < 3} ==
>        {i for i in range(0, 5, 2)}
>     (Python does not require the base collection to match the result class.)
>     Another pair of examples:
>        {(i,j)| i in N, j in N; i+j <= 5}
>        {(i,j) for i in count() for j in count if i+j <= 5}
>
>     Similar usage in programming go back over half a century.
>     https://en.wikipedia.org/wiki/__List_comprehension
>     <https://en.wikipedia.org/wiki/List_comprehension>
>     While notation in both math and CS varies, the components are always
>     input source collection variables, conditions or predicates, and an
>     output expression.
>
>     The Python reference manual documents comprehensions as an alternate
>     atomic display form. In Chapter 6, Expressions, Section 2, Atoms,
>
>     "For constructing a list, a set or a dictionary Python provides
>     special syntax called ?displays?, each of them in two flavors:
>     either the container contents are listed explicitly, or
>     they are computed via a set of looping and filtering instructions,
>     called a comprehension.
>     ...
>     list_display ::=  "[" [expression_list | comprehension] "]"
>     <etc>"
>     A generator expression similarly represents an untyped abstract
>     sequence, rather than a concrete class.
>     ---
>
>     In summary: A context-manager, as an object with __enter__ and
>     __exit__ methods, is not a proper component of a comprehension. For
>     instance, replace "open('xxx')" in your proposal with a lock
>     creation function. On the other hand, an iterable managed resource,
>     as suggested by Mathias Panzenb?ck, works fine as a source. So it
>     does work (as you noticed also).
>
>
> I don't follow how you made these two leaps:
> * It doesn't apply to set comprehensions in *math* -> it doesn't apply
> to set comprehensions in *Python*
> * it doesn't apply to *set* comprehensions in Python -> it doesn't apply
> to *any* comprehensions in Python

Since the OP withdrew his suggestion, its a moot point. However, I 
talked about the general, coherent concept of comprehensions, as used in 
both math and CS, as an alternative to explicit listing. Do look at the 
references, including the Python manual. It presents the general idea 
and implementation first and then the four specific versions. I only 
used sets for an example.

-- 
Terry Jan Reedy




From benhoyt at gmail.com  Sun Nov 18 21:52:39 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Mon, 19 Nov 2012 09:52:39 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <1353061956.13722.YahooMailRC@web184703.mail.ne1.yahoo.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
	<1353061956.13722.YahooMailRC@web184703.mail.ne1.yahoo.com>
Message-ID: <CAL9jXCFypRhY9ux=a-=f8CU89xA+udpJma8Gyw9r8c88d2ZVSA@mail.gmail.com>

> Passing FTS_NOSTAT to fts is about 3x faster, but only 8% faster than os.walk
> with the stat calls hacked out, and 40% slower than find.

Relatedly, I've just finished a proof-of-concept version of
iterdir_stat() for Linux using readdir_r and ctypes, and it was only
about 10% faster than the existing os.walk on large directories. I was
surprised by this, given that I saw a 400% speedup removing the
stat()s on Windows, but I guess it means that stat() and/or system
calls in general are *much* faster or better cached on Linux.

Still, it's definitely worth the huge speedup on Windows, and I think
it's the right thing to use the dirent d_type info on Linux, even
though the speed gain is small -- it's still faster, and it still
saves all those os.stat()s. Also, I'm doing this via ctypes in pure
Python, so doing it in C may give another small boost especially for
the Linux version.

If anyone wants to test what speeds they're getting on Linux or
Windows, or critique my proof of concept, please try it at
https://github.com/benhoyt/betterwalk -- just run "python benchmark.py
[directory]" on a large directory. Note this is only a proof of
concept at this stage, not hardened code!

> So, a "nostat" option is a potential performance improvement, but switching to
> ftw/nftw/fts, with or without the nostat flag, doesn't seem to be worth it.

Agreed. Also, this is beyond the scope of my initial suggestion.

-Ben


From benhoyt at gmail.com  Sun Nov 18 22:00:56 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Mon, 19 Nov 2012 10:00:56 +1300
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <50A73B74.7050003@drees.name>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
	<CAL9jXCHFrtey3onf4dwUtkR1AmXeRwfy8OdK+NHjrAFehxqxWg@mail.gmail.com>
	<CAD=7U2A08BEPThUs8jmJDkyDOeLOybN-1BBVjgp+cSoMOzf+Ng@mail.gmail.com>
	<CAL9jXCH7L7gggmzYTaunp6hfYsLzT5wq2=k1Z0-04DYzdtAewg@mail.gmail.com>
	<CAD=7U2A97gCo8sDegX3M70mv6vvCCCaCFQqDnYduUOT0gbbzpA@mail.gmail.com>
	<50A6C1F7.9060801@canterbury.ac.nz> <50A73B74.7050003@drees.name>
Message-ID: <CAL9jXCFYeOU3Kj39=MREWv3dKNLn2Z0aZidC=34CEGGyyDvVsg@mail.gmail.com>

>>> 1) wrap the return partial stat info in a proxy object
>>> 2) Make iterdir_stat an os.walk internal tool, and don't export it.
>>> 3) Add some kind of "we have a full stat" indicator,
>>> 4) document one of the a stat values as a "we have a full stat"
>>> indicator,
>>> 5) Add a keyword argument to ... always do the full stat.
>>> 6) Depreciate os.walk, and provide os.itertree

I don't love most of these solutions as they seem to complicate
things. I'd be happy with 2) if it came to it -- but I think it's a
very useful tool, especially on Windows, because it means Windows
users would have "pure Python access" to FindFirst/FindNext and a very
good speed improvement for os.walk.

>> 7) Provide an iterdir() with a way of specifying exactly
>> which stat fields you're interested in. Then it can perform
>> stat calls if and only if needed, and the user doesn't have
>> to tediously test for presence of things in the result.
>>
>
> +1 for following that seventh path. It offers the additional benefit for the
> library code, that constraints of the backend functionality used are more
> clearer to handle: If requested and available allthough expensive, "yield
> nevertheless the attribute values" is then a valid strategy.

Ah, that's an interesting idea. It's nice and explicit. I'd go for
either the status quo I've proposed or this one. Though only thing I'd
want to push for is a clean API. Any suggestions? Mine would be
something like:

for filename, st in iterdir_stat(path, stat_fields=['st_mode', 'st_size']:
    ...

However, you might also need 'd_type' field option, because eg for
os.walk() you don't actually need all of st_mode, just the type info.
This is odd (that most of the fields are named "st_*" but one
"d_type") but not terrible.

-Ben


From random832 at fastmail.us  Mon Nov 19 02:23:27 2012
From: random832 at fastmail.us (Random832)
Date: Sun, 18 Nov 2012 20:23:27 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCGOWZVCu5G6V34ER_GYY5wv1dPw30L=PMuTHC-=s_kHAw@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CAL9jXCFaT6tbT49yH5tRDbW09vyj4dPSO9c5q71aeFghhj9+ZA@mail.gmail.com>
	<1353015922.23812.140661154202993.035FA1E6@webmail.messagingengine.com>
	<CAL9jXCGOWZVCu5G6V34ER_GYY5wv1dPw30L=PMuTHC-=s_kHAw@mail.gmail.com>
Message-ID: <50A98A0F.7010604@fastmail.us>

On 11/15/2012 4:50 PM, Ben Hoyt wrote:
> 1) You've have to add a whole new way / set of constants / functions
> to test for the different values of d_type. Whereas there's already
> stuff (stat module) to test for st_mode values.
>
> 2) It'd make the typical use case more complex, for example, the
> straight "if st.st_mode is None ... else ..." I gave earlier becomes
> this:
>
> for filename, st in iterdir_stat(path):
>       if st.d_type is None:
>            if st.st_mode is None:
>                 st = os.stat(os.path.join(path, filename))
>            is_dir = stat.S_ISDIR(st.st_mode)
>       else:
>            is_dir = st.d_type == DT_DIR
>
> -Ben
I actually meant adding d_type *everywhere*...

if st.d_type is None:
     st = os.stat(os.path.join(path, filename))
is_dir = st.d_type == DT_DIR

Of course, your code would ultimately be more complex anyway since when 
followlinks=True you want to use isdir, and when it's not you want to 
use lstat. And consider what information iterdir_stat actually returns 
when the results are symlinks (if it's readdir/d_type, it's going to say 
"it's a symlink" and you need to try again to followlinks, if it's 
WIN32_FIND_DATA you have the information for both in principle, but the 
stat structure can only include one. Do we need an iterdir_lstat? If so, 
should iterdir_stat return None if d_type is DT_LNK, or DT_LNK?)

...and ultimately deprecating the S_IS___ stuff. It made sense in the 
1970s when there was real savings in packing your 4 bits of type 
information and your 12 bits of permission information in a single 
16-bit field, now it's just a historical artifact that seems like the 
only reason for it is a vague "make os feel like C on Unix" principle.


From random832 at fastmail.us  Mon Nov 19 02:32:44 2012
From: random832 at fastmail.us (Random832)
Date: Sun, 18 Nov 2012 20:32:44 -0500
Subject: [Python-ideas] Speed up os.walk() 5x to 9x by using file
 attributes from FindFirst/NextFile() and readdir()
In-Reply-To: <CAL9jXCFypRhY9ux=a-=f8CU89xA+udpJma8Gyw9r8c88d2ZVSA@mail.gmail.com>
References: <CAL9jXCFGVGhzT46TRiD85w1LCS83XLcE+DN07+OW9SQtWqor3A@mail.gmail.com>
	<CADiSq7dRrWhA0hN_KPPiTnyzovQjYMnpk9gEUYeGmTCXLGdTfg@mail.gmail.com>
	<CAL9jXCG6MKyXjQbVGOL4CTOqtcnkEFYS8ZRouN-bntD7y2BfvA@mail.gmail.com>
	<CA+OGgf6WrJFpCH=MNmCkj-QAmoDPZpEigXEBYUR6+gycyxjE+w@mail.gmail.com>
	<CAD=7U2ChxJz9XtGCOJVwubV6tLSvdZNJuYOg+DVvj8QBkc-x5w@mail.gmail.com>
	<CA+OGgf7oRfeeKqBcOgz=QML8tz9oe-uPiZDOR7v=WJQ4uB4MSQ@mail.gmail.com>
	<CAD=7U2Aw1oOC98VYA-XU8PJ=+St3L4sx=7-LPiCutR5+-pBxFg@mail.gmail.com>
	<CA+OGgf7an_1-ebWs=5cCEaXZhcKRLUCe-rpkFYbgdmQMe4MyEw@mail.gmail.com>
	<CAD=7U2AmS=Lz+mBu684v4AwrCeTYy-itxaAny41VGPRboALjOQ@mail.gmail.com>
	<CAL9jXCGAEJVrXKKgZOyA=sxj823vnETHfbK9kj_XVBr9Ut0+WQ@mail.gmail.com>
	<CAD=7U2DBfCMVTz2eGbq1sgftRj7LG4iAcyPpxquX5vxNQ+Pjwg@mail.gmail.com>
	<1353061956.13722.YahooMailRC@web184703.mail.ne1.yahoo.com>
	<CAL9jXCFypRhY9ux=a-=f8CU89xA+udpJma8Gyw9r8c88d2ZVSA@mail.gmail.com>
Message-ID: <50A98C3C.9020807@fastmail.us>

On 11/18/2012 3:52 PM, Ben Hoyt wrote:
> Still, it's definitely worth the huge speedup on Windows
It occurs to me that we need to be careful in defining what we can 
actually get on Linux or Windows.

Linux readdir can return the file type, but it's always going to be 
DT_LNK for symlinks, requiring an extra stat call if followlinks=True.

Windows functions return both "is a symlink" and "is a directory" in a 
single call (ntpath.isdir / nt._isdir takes advantage of this, and the 
same information is present in the Find functions' data) but not other 
information about the symlink target beyond whether it is a directory. 
Neither python's existing stat function nor anything proposed here has a 
way of representing this, but it would be useful for os.walk to be able 
to get this information.


From mwm at mired.org  Mon Nov 19 04:32:20 2012
From: mwm at mired.org (Mike Meyer)
Date: Sun, 18 Nov 2012 21:32:20 -0600
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <50A967BF.1030106@fastmail.us>
References: <509C2E9D.3080707@python.org> <509CBC78.4040602@egenix.com>
	<509D2EF0.8010209@python.org>
	<CAF-Rda8CbrGnQw7qhr_jWfak7Jt6ATER2pNrLCEnx4y0Lv-Zug@mail.gmail.com>
	<20121114165732.69dcd274@resist.wooz.org>
	<50A42FD5.6050405@fastmail.us> <50A43665.4010406@pearwood.info>
	<1353015788.22979.140661154200249.5F8D101F@webmail.messagingengine.com>
	<CAD=7U2BbudYcuJpmnY7PAw8cSOj7Y-SvFCEATDhvNOmc3_wkbQ@mail.gmail.com>
	<50A967BF.1030106@fastmail.us>
Message-ID: <b959e67d-35a5-4ba3-9262-7f632f315c78@email.android.com>



Random832 <random832 at fastmail.us> wrote:

>On 11/15/2012 6:18 PM, Mike Meyer wrote:
>> It's obviously true. The kernel (or shell, as the case may be) 
>> interprets the shebang line to find the executable an pick out the 
>> arguments to pass to the executable. The executable (Python) then 
>> interprets the arguments, without ever having seen the shebang line.
>Right. It **interprets the arguments**. In this case, the single 
>argument is "-E -s", there's no reason it couldn't or shouldn't treat 
>that the same way as "-E" "-s".

I've got three reasons to not do that:

  Special cases aren't special enough to break the rules.
  Errors should never pass silently.
  In the face of ambiguity, refuse the temptation to guess.

If you knew you were handling a shebang line, it might be different. But if a user went out of their way to pass those arguments on the command line, you want to be very careful about undoing what they did.

I think providing single-letter variants for options (which was why this was pointed out) is a better solution than guessing about what the user wanted to do to silently surpress an error to handle the special case of not being able to have multiple arguments on a shebang line (for some OS's?).
-- 
Sent from my Android tablet with K-9 Mail. Please excuse my swyping.


From abarnert at yahoo.com  Mon Nov 19 05:57:15 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Sun, 18 Nov 2012 20:57:15 -0800 (PST)
Subject: [Python-ideas] With clauses for generator expressions
In-Reply-To: <k8befn$7ba$1@ger.gmane.org>
References: <1352951084.92039.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<20121115073555.GA7582@odell.Belkin>
	<1352977867.77252.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<50A55C06.3060802@canterbury.ac.nz>
	<1353056964.61039.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<k86k2m$non$1@ger.gmane.org>
	<CAN1F8qWvsxXNLTNBRgtbE63uw_Btr95WQ_NWYBLD+4xw3ZL=Lw@mail.gmail.com>
	<k8befn$7ba$1@ger.gmane.org>
Message-ID: <1353301035.98686.YahooMailRC@web184705.mail.ne1.yahoo.com>

From: Terry Reedy <tjreedy at udel.edu>
Sent: Sun, November 18, 2012 11:56:04 AM


> On 11/17/2012 3:11 PM, Joshua Landau wrote:
> > On 17 November 2012 00:00,  Terry Reedy
> 
> >     OK, that's helpful. Now let me strip down  my objection to this: your
> >     proposal is conceptually wrong  because it mixes two distinct and
> >     different ideas --  collection definition and context management. It
> >     conflicts  with a well-defined notion of long standing.


> > I don't follow  how you made these two leaps:
> > * It doesn't apply to set comprehensions  in *math* -> it doesn't apply
> > to set comprehensions in  *Python*
> > * it doesn't apply to *set* comprehensions in Python -> it  doesn't apply
> > to *any* comprehensions in Python
> 
> Since the OP  withdrew his suggestion, its a moot point. 

I agree that it is a moot point. The idea would require a larger semantic change 
than I initially anticipated, and I disagree with Greg Ewing that the immediate 
evaluation of the outer source is a kluge that should be abandoned, so I've 
withdrawn it. (Of course if Greg Ewing or Joshua Landau or anyone else wants to 
pick up the idea, I apologize for presuming, but I no longer think it's a good 
idea.)

That's why I ignored the point about set builder notation. But if you want to 
continue to argue it:

> However, I 
> talked about the  general, coherent concept of comprehensions, as used in 
> both math and CS, as  an alternative to explicit listing. Do look at the 
> references, including the  Python manual. It presents the general idea 
> and implementation first and  then the four specific versions. I only 
> used sets for an example.


Nested comprehensions already break the analogy with set builder notation. For 
one thing, nobody would define the rationals as {i/j | j in Z: j != 0 | i in Z}. 
People would probably figure out what you meant, but you wouldn't write it that 
way. Nested comprehensions (even more so when one is dependent on the other) 
make it blatant that a comprehension is actually an iterative sequence builder, 
not a declarative set builder.

The analogy is a loose one, and it already leaks. It really only holds when 
you've got a single, well-ordered, finite source. It's obvious that (i/j for j 
in itertools.count(2) for i in range(1, j)) generates the rationals in (0, 1), 
in a specific order (with repeats), but you wouldn't write anything remotely 
similar in set builder notation. In fact, you'd probably define that set just as 
{q | i, j in N+: qj=i, q<1}, and you can't translate that to Python at all.


From random832 at fastmail.us  Mon Nov 19 07:05:34 2012
From: random832 at fastmail.us (Random832)
Date: Mon, 19 Nov 2012 01:05:34 -0500
Subject: [Python-ideas] CLI option for isolated mode
In-Reply-To: <b959e67d-35a5-4ba3-9262-7f632f315c78@email.android.com>
References: <509C2E9D.3080707@python.org> <509CBC78.4040602@egenix.com>
	<509D2EF0.8010209@python.org>
	<CAF-Rda8CbrGnQw7qhr_jWfak7Jt6ATER2pNrLCEnx4y0Lv-Zug@mail.gmail.com>
	<20121114165732.69dcd274@resist.wooz.org>
	<50A42FD5.6050405@fastmail.us> <50A43665.4010406@pearwood.info>
	<1353015788.22979.140661154200249.5F8D101F@webmail.messagingengine.com>
	<CAD=7U2BbudYcuJpmnY7PAw8cSOj7Y-SvFCEATDhvNOmc3_wkbQ@mail.gmail.com>
	<50A967BF.1030106@fastmail.us>
	<b959e67d-35a5-4ba3-9262-7f632f315c78@email.android.com>
Message-ID: <50A9CC2E.4050207@fastmail.us>

On 11/18/2012 10:32 PM, Mike Meyer wrote:
> I've got three reasons to not do that:
>
>    Special cases aren't special enough to break the rules.
>    Errors should never pass silently.
>    In the face of ambiguity, refuse the temptation to guess.
Except, python uses normal getopts style parsing for its own arguments.

Simply declare this to be the meaning of the previously undefined 
'space' option flag. Nothing special, not an error if it's defined, and 
not ambiguous.
> If you knew you were handling a shebang line, it might be different. But if a user went out of their way to pass those arguments on the command line, you want to be very careful about undoing what they did.
Undoing what? We're not talking about an option to a script, we're 
talking about an option ('\x20') to the python interpreter that has no 
other meaning. And it's not an error anymore that it is defined.

The only other possible meaning of an argument beginning with '-' and 
containing a space is e.g. '-m mod' using the (undocumented, I might 
add) syntax of following the option directly with its argument to load a 
module called '\x20mod' (can module names even begin with a space?)

Alternately, we could introduce a special syntax to allow additional 
interpreter options to be set within the file.

I also note that the -W, -Q, -m, and -c options already violate the 
principle of requiring all options to be able to be specified in a 
single argument to the interpreter, as would be required to allow all 
combinations of options to be able to be specified on a shebang line on 
such a system.

The more compelling case for not using a long option is simply that 
python does not use any other long options, not any logic about saying 
that all combinations of options must be able to be specified on a 
shebang line.


From a.kruis at science-computing.de  Mon Nov 19 18:24:08 2012
From: a.kruis at science-computing.de (Anselm Kruis)
Date: Mon, 19 Nov 2012 18:24:08 +0100
Subject: [Python-ideas] thread safe dictionary initialisation from mappings:
	dict(non_dict_mapping)
Message-ID: <50AA6B38.4060903@science-computing.de>

Hello,

I found the following annoying behaviour of dict(non_dict_mapping) and 
dict.update(non_dict_mapping), if non_dict_mapping implements 
collections.abc.Mapping but is not an instance of dict. In this case the 
implementations of dict() and dict.update() use PyDict_Merge(PyObject 
*a, PyObject *b, int override).

The essential part of PyDict_Merge(a,b, override) is

# update dict a with the content of mapping b.
keys = b.keys()
for key in keys:
    ...
    a[key] = b.__getitem__(key)

This algorithm is susceptible to race conditions, if a second thread 
modifies the source mapping b between "b.keys()" and b.__getitem__(key):
- If the second thread deletes an item from b, PyDict_Merge fails with a
KeyError exception.
- If the second thread inserts a new value and then modifies an existing 
value, a contains the modified value but not the new value.

Of course the current behaviour is the best you can get with a "minimum 
mapping interface".


My Goal
-------
I want to be able to implement a mapping so that "dict(mapping)" works 
in an "atomic" way.


Requirements for a solution
---------------------------
- it does not modify the behaviour of existing code
- the performance must be similar to the current implementation
- simple to implement
- sufficiently generic


Proposal
--------
My idea is to define a new optional method "__update_other__(setter)" 
for mappings. A minimal implementation of this method looks like:

def __update_other__(self, setter):
   for key in self.keys():
      setter(key, self[key])

Now it is possible to extend PyDict_Merge(PyObject *a, PyObject *b, int 
override) to check, if the source mapping b implements 
"__update_other__()". If it does, PyDict_Merge simply calls 
b.__update_other__. Otherwise it falls back to the current 
implementation using keys() and __getitem__().

Pseudo code for PyDict_Merge(PyObject *a, PyObject *b, int override)

if hasattr(b, "__update__other__") and callable(b.__update_other__):
   if override:
     b.__update_other__(a.__setitem__)
   else:
     b.__update_other__(a.setdefault)
   return
# old code follows
keys = b.keys()
   for key in keys:
     a[key] = b.__getitem__(key)


Example
-------
A typical implementation of __update_other__ for a mapping with 
concurrent access looks like:

def __update_other__(self, setter):
   # perform an appropriate locking
   with self._lock:
     for key in self._keys:
       setter(key, self._values[key])


Note 1:
The __update_other__(self, setter) interface hides the nature of the 
caller. It is also possible to retrieve only the keys or only the values 
using suitable setter arguments.

Note 2:
In order to preserve the behaviour of existing code, the method 
__update_other__() must not be added to existing mapping implementations 
(collections.abc.Mapping, UserDict, ...). But adding support for 
__update_other__ to __init__() and update() of existing implementations 
is OK.

Note 3:
If we add add a test for __update_other__ to PyDict_Merge in front of 
the "PyDict_Check(b)" test this would also solve the issues described in 
http://bugs.python.org/issue1615701. A sub-class of dict could implement 
__update_other__ to prevent PyDict_Merge from accessing its underlying 
dict implementation.

Note 4: Alternative approach
Obviously it is possible to write "dict(non_dict_mapping.items())" (or 
even "dict(non_dict_mapping.iteritems())", if iteritems() is implemented 
in a suitable way), but constructing the list of items is an expensive 
operation. Additionally it is fairly common and pythonic to write 
"dict(mapping)" to get a copy of mapping.


Anselm

-- 
  Dipl. Phys. Anselm Kruis                       science + computing ag
  Senior Solution Architect                      Ingolst?dter Str. 22
  email A.Kruis at science-computing.de             80807 M?nchen, Germany
  phone +49 89 356386 874  fax 737               www.science-computing.de
-- 
Vorstandsvorsitzender/Chairman of the board of management:
Gerd-Lothar Leonhart
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Michael Heinrichs, 
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Philippe Miltin
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196



From tjreedy at udel.edu  Mon Nov 19 22:09:57 2012
From: tjreedy at udel.edu (Terry Reedy)
Date: Mon, 19 Nov 2012 16:09:57 -0500
Subject: [Python-ideas] thread safe dictionary initialisation from
	mappings: dict(non_dict_mapping)
In-Reply-To: <50AA6B38.4060903@science-computing.de>
References: <50AA6B38.4060903@science-computing.de>
Message-ID: <k8e77a$js$1@ger.gmane.org>

On 11/19/2012 12:24 PM, Anselm Kruis wrote:
> Hello,
>
> I found the following annoying behaviour of dict(non_dict_mapping) and
> dict.update(non_dict_mapping), if non_dict_mapping implements
> collections.abc.Mapping but is not an instance of dict. In this case the
> implementations of dict() and dict.update() use PyDict_Merge(PyObject
> *a, PyObject *b, int override).
>
> The essential part of PyDict_Merge(a,b, override) is
>
> # update dict a with the content of mapping b.
> keys = b.keys()
> for key in keys:
>     ...
>     a[key] = b.__getitem__(key)
>
> This algorithm is susceptible to race conditions, if a second thread
> modifies the source mapping b between "b.keys()" and b.__getitem__(key):
> - If the second thread deletes an item from b, PyDict_Merge fails with a
> KeyError exception.
> - If the second thread inserts a new value and then modifies an existing
> value, a contains the modified value but not the new value.

It is well-known that mutating a collection while iterating over it can 
lead to unexpected or undesired behavior, including exceptions. This is 
not limited updating a dict from a non-dict source. The generic answer 
is Don't Do That.

> Of course the current behaviour is the best you can get with a "minimum
> mapping interface".

To me, if you know that the source in d.update(source) is managed (and 
mutated) in another thread, the obvious solution (to Not Do That) is to 
lock the source. This should work for any source and for any similar 
operation. What am I missing?

Instead, you propose to add a specialized, convoluted method that only 
works for updates of dicts by non_dict_mappings that happen to have a 
new and very specialized magic method that automatically does the lock. 
Sorry, I don't see the point. It is not at all a generic solution to a 
generic problem.

-- 
Terry Jan Reedy



From a.kruis at science-computing.de  Tue Nov 20 10:33:53 2012
From: a.kruis at science-computing.de (Anselm Kruis)
Date: Tue, 20 Nov 2012 10:33:53 +0100
Subject: [Python-ideas] thread safe dictionary initialisation from
 mappings: dict(non_dict_mapping)
In-Reply-To: <k8e77a$js$1@ger.gmane.org>
References: <50AA6B38.4060903@science-computing.de> <k8e77a$js$1@ger.gmane.org>
Message-ID: <50AB4E81.4030705@science-computing.de>

Am 19.11.2012 22:09, schrieb Terry Reedy:
> On 11/19/2012 12:24 PM, Anselm Kruis wrote:
>> Hello,
>>
>> I found the following annoying behaviour of dict(non_dict_mapping) and
>> dict.update(non_dict_mapping), if non_dict_mapping implements
>> collections.abc.Mapping but is not an instance of dict. In this case the
>> implementations of dict() and dict.update() use PyDict_Merge(PyObject
>> *a, PyObject *b, int override).
>>
>> The essential part of PyDict_Merge(a,b, override) is
>>
>> # update dict a with the content of mapping b.
>> keys = b.keys()
>> for key in keys:
>>     ...
>>     a[key] = b.__getitem__(key)
>>
>> This algorithm is susceptible to race conditions, if a second thread
>> modifies the source mapping b between "b.keys()" and b.__getitem__(key):
>> - If the second thread deletes an item from b, PyDict_Merge fails with a
>> KeyError exception.
>> - If the second thread inserts a new value and then modifies an existing
>> value, a contains the modified value but not the new value.
>
> It is well-known that mutating a collection while iterating over it can
> lead to unexpected or undesired behavior, including exceptions. This is
> not limited updating a dict from a non-dict source. The generic answer
> is Don't Do That.

Actually that's not the case here: the implementation of dict does not 
iterate over the collection while another thread mutates the collection. 
It iterates over a list of the keys and this list does not change.

>
>> Of course the current behaviour is the best you can get with a "minimum
>> mapping interface".
>
> To me, if you know that the source in d.update(source) is managed (and
> mutated) in another thread, the obvious solution (to Not Do That) is to
> lock the source. This should work for any source and for any similar
> operation. What am I missing?

> Instead, you propose to add a specialized, convoluted method that only
> works for updates of dicts by non_dict_mappings that happen to have a
> new and very specialized magic method that automatically does the lock.
> Sorry, I don't see the point. It is not at all a generic solution to a
> generic problem.

It is the automatic locking. For list- and set-like collections it is 
already possible to implement this kind of automatic locking, because 
iterating over them returns the complete information. Mappings are 
special because of their key-value items.

If automatic locking of a collection is the right solution to a 
particular problem, depends on the problem. There are problems, where 
automatic locking is the best choice. I think, python should support it.

(If my particular applications belongs to this class of problems is 
another question and not relevant here.)

Regards
   Anselm

-- 
  Dipl. Phys. Anselm Kruis                       science + computing ag
  Senior Solution Architect                      Ingolst?dter Str. 22
  email A.Kruis at science-computing.de             80807 M?nchen, Germany
  phone +49 89 356386 874  fax 737               www.science-computing.de
-- 
Vorstandsvorsitzender/Chairman of the board of management:
Gerd-Lothar Leonhart
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Michael Heinrichs, 
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Philippe Miltin
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196



From ncoghlan at gmail.com  Tue Nov 20 12:20:03 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Tue, 20 Nov 2012 21:20:03 +1000
Subject: [Python-ideas] thread safe dictionary initialisation from
 mappings: dict(non_dict_mapping)
In-Reply-To: <50AB4E81.4030705@science-computing.de>
References: <50AA6B38.4060903@science-computing.de> <k8e77a$js$1@ger.gmane.org>
	<50AB4E81.4030705@science-computing.de>
Message-ID: <CADiSq7cx6iwWBrpX4U4xuswxr6ckiZvcA2EVcpzZCPpswK4zdg@mail.gmail.com>

On Tue, Nov 20, 2012 at 7:33 PM, Anselm Kruis
<a.kruis at science-computing.de>wrote:

> Am 19.11.2012 22:09, schrieb Terry Reedy:
>
>  On 11/19/2012 12:24 PM, Anselm Kruis wrote:
>>
>>> Hello,
>>>
>>> I found the following annoying behaviour of dict(non_dict_mapping) and
>>> dict.update(non_dict_mapping), if non_dict_mapping implements
>>> collections.abc.Mapping but is not an instance of dict. In this case the
>>> implementations of dict() and dict.update() use PyDict_Merge(PyObject
>>> *a, PyObject *b, int override).
>>>
>>> The essential part of PyDict_Merge(a,b, override) is
>>>
>>> # update dict a with the content of mapping b.
>>> keys = b.keys()
>>> for key in keys:
>>>     ...
>>>     a[key] = b.__getitem__(key)
>>>
>>> This algorithm is susceptible to race conditions, if a second thread
>>> modifies the source mapping b between "b.keys()" and b.__getitem__(key):
>>> - If the second thread deletes an item from b, PyDict_Merge fails with a
>>> KeyError exception.
>>> - If the second thread inserts a new value and then modifies an existing
>>> value, a contains the modified value but not the new value.
>>>
>>
>> It is well-known that mutating a collection while iterating over it can
>> lead to unexpected or undesired behavior, including exceptions. This is
>> not limited updating a dict from a non-dict source. The generic answer
>> is Don't Do That.
>>
>
> Actually that's not the case here: the implementation of dict does not
> iterate over the collection while another thread mutates the collection. It
> iterates over a list of the keys and this list does not change.


Whether or not the keys() method makes a copy of the underlying keys is
entirely up to the collection - e.g. the Python 3 dict type returns a live
view of the underlying dictionary from keys()/values()/items() rather than
returning a list copy as it did in Python 2.

Building and working with containers in a thread safe manner is inherently
challenging, and given that the standard Python containers only make
minimal efforts in that direction (relying heavily on the GIL and the way
it interacts with components written in C) it's unlikely you're ever going
to achieve adequate results without exposing an explicit locking API. For
example, you could make your container a context manager, so people could
write:

    with my_threadsafe_mapping:
        dict_copy = dict(my_threadsafe_mapping)

This has the advantage of making it easy to serialise *any* multi-step
operation on your container on its internal lock, not just the specific
case of copying to a builtin dictionary.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121120/141c9172/attachment.html>

From solipsis at pitrou.net  Tue Nov 20 19:30:44 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 20 Nov 2012 19:30:44 +0100
Subject: [Python-ideas] thread safe dictionary initialisation from
 mappings: dict(non_dict_mapping)
References: <50AA6B38.4060903@science-computing.de> <k8e77a$js$1@ger.gmane.org>
	<50AB4E81.4030705@science-computing.de>
Message-ID: <20121120193044.746730b5@pitrou.net>

On Tue, 20 Nov 2012 10:33:53 +0100
Anselm Kruis
<a.kruis at science-computing.de> wrote:
> 
> It is the automatic locking. For list- and set-like collections it is 
> already possible to implement this kind of automatic locking, because 
> iterating over them returns the complete information. Mappings are 
> special because of their key-value items.
> 
> If automatic locking of a collection is the right solution to a 
> particular problem, depends on the problem. There are problems, where 
> automatic locking is the best choice. I think, python should support it.

Automatic locking is rarely the solution to any problem, especially
with such general-purposes objects as associative containers. In many
cases, you have a bit more to protect than simply the dict's contents.

Most of Python's types and routines promise to be "safe" in the face of
threads in the sense that they won't crash, but they don't promise to
do "what you want" since what you want will really depend on the use
case.

(there are a couple of special cases such as binary files where we try
to ensure meaningful thread-safety, because that's what users expect
from their experience with system APIs)

Regards

Antoine.




From phihag at phihag.de  Tue Nov 20 19:21:00 2012
From: phihag at phihag.de (Philipp Hagemeister)
Date: Tue, 20 Nov 2012 19:21:00 +0100
Subject: [Python-ideas] python_modules as default directory for dependencies
	in distutils
Message-ID: <50ABCA0C.50001@phihag.de>

Currently, there are the following methods for installing dependencies:

? Use the distribution's packaging (ignored here, all further points
refer to setup.py/distutils)
? Install them system-wide (default). This requires superuser rights and
is basically guaranteed to conflict with some other application,
especially if applications are choosy about the versions of Python
packages they like.
? Install them user-wide (--user), with pretty much the same downsides,
plus that the application is now bound to the user installing it.
? Manually invoke distutils with another path (error-prone and
non-standard).
? Give up and use virtualenv. While this works fine, it's a little bit
heavy-handed to modify one's shell just to launch a potentially trivial
application.

Therefore, I'd like to suggest a new alternative location (--here =
--root "./python_modules", intended to become default in Python 5),
modeled after node's packaging system (http://goo.gl/dMRTC).

The obvious advantage of installing all dependencies into a directory in
the application root is that the application will work for every user,
never conflict with any other application, and it is both easy to
package dependencies (say, for an sftp-only rollout) and to delete all
dependencies. Of course, this is not sufficient to replace virtualenv,
but I believe a large majority of applications will (or at least should)
run under any common python interpreter.

Aside from the new flag in distutils, the site module should
automatically look into ./python_modules , as if it were a second USER_SITE.

In node, this scheme works so well that virtually nobody bothers to use
system-wide installation, except when they want a binary to be available
in the PATH for all users.

This suggestion seems so obvious that it probably has been discussed
before, but my google-fu is too weak to find it. If it has, I'd be glad
to get a link to the old discussion. Thanks!

- Philipp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121120/63725c69/attachment.pgp>

From dholth at gmail.com  Tue Nov 20 20:13:12 2012
From: dholth at gmail.com (Daniel Holth)
Date: Tue, 20 Nov 2012 14:13:12 -0500
Subject: [Python-ideas] python_modules as default directory for
 dependencies in distutils
In-Reply-To: <50ABCA0C.50001@phihag.de>
References: <50ABCA0C.50001@phihag.de>
Message-ID: <CAG8k2+7tquKOf4hbcgT84ZMUNvnuHQd3GRF9-Tz0CiQ4JMjNNQ@mail.gmail.com>

On Tue, Nov 20, 2012 at 1:21 PM, Philipp Hagemeister <phihag at phihag.de>wrote:

> Currently, there are the following methods for installing dependencies:
>
> ? Use the distribution's packaging (ignored here, all further points
> refer to setup.py/distutils)
> ? Install them system-wide (default). This requires superuser rights and
> is basically guaranteed to conflict with some other application,
> especially if applications are choosy about the versions of Python
> packages they like.
> ? Install them user-wide (--user), with pretty much the same downsides,
> plus that the application is now bound to the user installing it.
> ? Manually invoke distutils with another path (error-prone and
> non-standard).
> ? Give up and use virtualenv. While this works fine, it's a little bit
> heavy-handed to modify one's shell just to launch a potentially trivial
> application.
>
> Therefore, I'd like to suggest a new alternative location (--here =
> --root "./python_modules", intended to become default in Python 5),
> modeled after node's packaging system (http://goo.gl/dMRTC).
>
> The obvious advantage of installing all dependencies into a directory in
> the application root is that the application will work for every user,
> never conflict with any other application, and it is both easy to
> package dependencies (say, for an sftp-only rollout) and to delete all
> dependencies. Of course, this is not sufficient to replace virtualenv,
> but I believe a large majority of applications will (or at least should)
> run under any common python interpreter.
>
> Aside from the new flag in distutils, the site module should
> automatically look into ./python_modules , as if it were a second
> USER_SITE.
>
> In node, this scheme works so well that virtually nobody bothers to use
> system-wide installation, except when they want a binary to be available
> in the PATH for all users.
>
> This suggestion seems so obvious that it probably has been discussed
> before, but my google-fu is too weak to find it. If it has, I'd be glad
> to get a link to the old discussion. Thanks!
>
> - Philipp


You wouldn't need stdlib support to do this. I believe setuptools'
pkg_resources can look in a directory full of eggs, adding the required
ones to PYTHONPATH based on requirements as specified in a wrapper script.
Gem uses a directory full of versioned packages like
~/.gem/ruby/1.8/gems/sinatra-1.3.3/.

The feature is something like having a dynamic linker. It is a useful thing
to have.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121120/f5ea6d4f/attachment.html>

From chris.jerdonek at gmail.com  Tue Nov 20 20:56:41 2012
From: chris.jerdonek at gmail.com (Chris Jerdonek)
Date: Tue, 20 Nov 2012 11:56:41 -0800
Subject: [Python-ideas] thread safe dictionary initialisation from
 mappings: dict(non_dict_mapping)
In-Reply-To: <20121120193044.746730b5@pitrou.net>
References: <50AA6B38.4060903@science-computing.de> <k8e77a$js$1@ger.gmane.org>
	<50AB4E81.4030705@science-computing.de>
	<20121120193044.746730b5@pitrou.net>
Message-ID: <CAOTb1wdtUy-eCn1aMPs37z1WWdjutt1PEnmN61+WnkS75qkuDA@mail.gmail.com>

On Tue, Nov 20, 2012 at 10:30 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Tue, 20 Nov 2012 10:33:53 +0100
> Anselm Kruis
> <a.kruis at science-computing.de> wrote:
>>
>> It is the automatic locking. For list- and set-like collections it is
>> already possible to implement this kind of automatic locking, because
>> iterating over them returns the complete information. Mappings are
>> special because of their key-value items.
>>
>> If automatic locking of a collection is the right solution to a
>> particular problem, depends on the problem. There are problems, where
>> automatic locking is the best choice. I think, python should support it.
>
> Automatic locking is rarely the solution to any problem, especially
> with such general-purposes objects as associative containers. In many
> cases, you have a bit more to protect than simply the dict's contents.
>
> Most of Python's types and routines promise to be "safe" in the face of
> threads in the sense that they won't crash, but they don't promise to
> do "what you want" since what you want will really depend on the use
> case.

There is an open documentation issue related to this here:

http://bugs.python.org/issue15339

--Chris


From jimjjewett at gmail.com  Tue Nov 20 21:35:44 2012
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 20 Nov 2012 15:35:44 -0500
Subject: [Python-ideas] python_modules as default directory for
 dependencies in distutils
In-Reply-To: <CAG8k2+7tquKOf4hbcgT84ZMUNvnuHQd3GRF9-Tz0CiQ4JMjNNQ@mail.gmail.com>
References: <50ABCA0C.50001@phihag.de>
	<CAG8k2+7tquKOf4hbcgT84ZMUNvnuHQd3GRF9-Tz0CiQ4JMjNNQ@mail.gmail.com>
Message-ID: <CA+OGgf6uLzzNcp-aiQqJ850chugScuie3G6vi-Ui++8X99+x6w@mail.gmail.com>

On 11/20/12, Daniel Holth <dholth at gmail.com> wrote:
> On Tue, Nov 20, 2012 at 1:21 PM, Philipp Hagemeister
> <phihag at phihag.de>wrote:

>> Currently, there are the following methods for installing dependencies:
...
>> Therefore, I'd like to suggest a new alternative location (--here =
>> --root "./python_modules", intended to become default in Python 5),
>> modeled after node's packaging system (http://goo.gl/dMRTC).

If I'm understanding correctly,  you just mean "install dependencies
in the same place as the application that asked for them", or maybe in
a magically named subdirectory.  That does sound like a reasonable
policy -- similar to the windows or java solution of packing
everything into a single bundle.

>> Aside from the new flag in distutils, the site module should
>> automatically look into ./python_modules , as if it were a second
>> USER_SITE.

As opposed to just putting them a layer up, and looking into the
application package's own directory for relative imports?

> You wouldn't need stdlib support to do this. I believe setuptools'
> pkg_resources can look in a directory full of eggs, adding the required
> ones to PYTHONPATH based on requirements as specified in a wrapper script.
> Gem uses a directory full of versioned packages like
> ~/.gem/ruby/1.8/gems/sinatra-1.3.3/.

If I understand correctly, that just provides a way to include the
version number when choosing the system-wide package location (and
later, when importing).  Also useful, but different from bundling the
dependencies inside each application that requires them.

Most notably, the bundle-inside solution will* find exactly the module
it shipped with, including custom patches.  The versioned-packages
solution will have conflicts when more than one application provides
for the same dependency, but will better support independent
maintenance (or at least security patches) for the 4th-party modules.

* Err, unless the module was loaded before the application, or
modified locally, or something odd happened with import, or ...

-jJ


From phihag at phihag.de  Tue Nov 20 22:30:44 2012
From: phihag at phihag.de (Philipp Hagemeister)
Date: Tue, 20 Nov 2012 22:30:44 +0100
Subject: [Python-ideas] python_modules as default directory for
 dependencies in distutils
In-Reply-To: <CA+OGgf6uLzzNcp-aiQqJ850chugScuie3G6vi-Ui++8X99+x6w@mail.gmail.com>
References: <50ABCA0C.50001@phihag.de>
	<CAG8k2+7tquKOf4hbcgT84ZMUNvnuHQd3GRF9-Tz0CiQ4JMjNNQ@mail.gmail.com>
	<CA+OGgf6uLzzNcp-aiQqJ850chugScuie3G6vi-Ui++8X99+x6w@mail.gmail.com>
Message-ID: <50ABF684.6000803@phihag.de>

On 11/20/2012 09:35 PM, Jim Jewett wrote:
>>> >> Aside from the new flag in distutils, the site module should
>>> >> automatically look into ./python_modules , as if it were a second
>>> >> USER_SITE.
> As opposed to just putting them a layer up, and looking into the
> application package's own directory for relative imports?

Precisely, because that kind of clutters the application's root
directory, especially when the number of dependencies reaches triple
digits. Think of all the entries in .hgignore/.gitignore alone.

> Most notably, the bundle-inside solution will* find exactly the module
> it shipped with, including custom patches.  The versioned-packages
> solution will have conflicts when more than one application provides
> for the same dependency, but will better support independent
> maintenance (or at least security patches) for the 4th-party modules.

Yeah, no having automatic security updates is a definitive downside of
the bundling into a local directory; but that's no different to the
situation with a virtuelenv (or user-specific packages).

- Philipp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121120/72e3d12b/attachment.pgp>

From phihag at phihag.de  Tue Nov 20 22:54:53 2012
From: phihag at phihag.de (Philipp Hagemeister)
Date: Tue, 20 Nov 2012 22:54:53 +0100
Subject: [Python-ideas] python_modules as default directory for
 dependencies in distutils
In-Reply-To: <CAG8k2+7tquKOf4hbcgT84ZMUNvnuHQd3GRF9-Tz0CiQ4JMjNNQ@mail.gmail.com>
References: <50ABCA0C.50001@phihag.de>
	<CAG8k2+7tquKOf4hbcgT84ZMUNvnuHQd3GRF9-Tz0CiQ4JMjNNQ@mail.gmail.com>
Message-ID: <50ABFC2D.3090106@phihag.de>

On 11/20/2012 08:13 PM, Daniel Holth wrote:
> You wouldn't need stdlib support to do this. I believe setuptools'
> pkg_resources can look in a directory full of eggs, adding the required
> ones to PYTHONPATH based on requirements as specified in a wrapper script.

I'm not quite sure to which aspect you're referring to - changing
distutils(=setup.py) to have the --here option, or changing site?

If it's the former, I still have to download the directory full of eggs
to somewhere, haven't I? And the point of this suggestions is that
instead of *somewhere*, there's a dedicated "standard" location.

And the point of the change to  site  would be that one doesn't need to
do anything,

git clone http://example.org/app
python setup.py install --here
./app.py

would just work without modification to the application (and not disturb
any other application).

My limited understanding of pkgresources may impede me though. Can you
link me to or describe how I can use setuptools here?

Thanks,

Philipp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121120/ccb9492e/attachment.pgp>

From ncoghlan at gmail.com  Wed Nov 21 04:27:01 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 21 Nov 2012 13:27:01 +1000
Subject: [Python-ideas] python_modules as default directory for
 dependencies in distutils
In-Reply-To: <50ABCA0C.50001@phihag.de>
References: <50ABCA0C.50001@phihag.de>
Message-ID: <CADiSq7cKuL+VXW8fTET9LYoQvKWkHvZ_=73EWG8j3wSEu9+fwg@mail.gmail.com>

On Wed, Nov 21, 2012 at 4:21 AM, Philipp Hagemeister <phihag at phihag.de>wrote:

> Currently, there are the following methods for installing dependencies:
>
> ? Use the distribution's packaging (ignored here, all further points
> refer to setup.py/distutils)
> ? Install them system-wide (default). This requires superuser rights and
> is basically guaranteed to conflict with some other application,
> especially if applications are choosy about the versions of Python
> packages they like.
> ? Install them user-wide (--user), with pretty much the same downsides,
> plus that the application is now bound to the user installing it.
> ? Manually invoke distutils with another path (error-prone and
> non-standard).
> ? Give up and use virtualenv. While this works fine, it's a little bit
> heavy-handed to modify one's shell just to launch a potentially trivial
> application.
>

Or install them all in a single directory, add a __main__.py file to that
directory and then just pass that directory name on the command line
instead of a script name. The directory will be added as sys.path[0] and
the __main__.py file will be executed as the main module (If your
additional application dependencies are all pure Python files, you can even
zip up that directory and pass that on the command line instead). This
approach has been supported since at least Python 2.6, but was missing from
the original What's New, and nobody ever goes back to read the "using"
documentation on the website because they assume they already know how
invoking the interpreter works.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121121/04a61788/attachment.html>

From phihag at phihag.de  Wed Nov 21 10:50:57 2012
From: phihag at phihag.de (Philipp Hagemeister)
Date: Wed, 21 Nov 2012 10:50:57 +0100
Subject: [Python-ideas] python_modules as default directory for
 dependencies in distutils
In-Reply-To: <CADiSq7cKuL+VXW8fTET9LYoQvKWkHvZ_=73EWG8j3wSEu9+fwg@mail.gmail.com>
References: <50ABCA0C.50001@phihag.de>
	<CADiSq7cKuL+VXW8fTET9LYoQvKWkHvZ_=73EWG8j3wSEu9+fwg@mail.gmail.com>
Message-ID: <50ACA401.2080303@phihag.de>

On 11/21/2012 04:27 AM, Nick Coghlan wrote:
> Or install them all in a single directory, add a __main__.py file to that
> directory and then just pass that directory name on the command line
> instead of a script name. The directory will be added as sys.path[0] and
> the __main__.py file will be executed as the main module (If your
> additional application dependencies are all pure Python files, you can even
> zip up that directory and pass that on the command line instead).
I'm well-aware of that approach, but didn't apply it to dependencies,
and am still not sure how to. Can you describe how a hypothetical
helloworld application with one dependency would look like? And wouldn't
one sacrifice the ability to seamlessly import from the application's
code itself.

As far as I understand, you suggest a setup like

./main.py (with content:
  import lxml.etree
  import myapp
  myapp.hello(lxml.etree.fromstring('<foo/>'))")
)
./myapp/__init__.py
./python_modules/__main__.py -> ../main.py
./python_modules/myapp -> ../myapp  # Or a path fixup in main
./python_modules/lxml/...   # or equivalent .pth
./myapp.sh (chmod +x, with content:
  python -m python_modules
)

which strikes me as really complex (and would still benefit from a
--here option to distutils). And how would the setup.py in . look to set
up all the symlinks?

- Philipp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121121/d2a91925/attachment.pgp>

From ncoghlan at gmail.com  Wed Nov 21 14:38:39 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 21 Nov 2012 23:38:39 +1000
Subject: [Python-ideas] python_modules as default directory for
 dependencies in distutils
In-Reply-To: <50ACA401.2080303@phihag.de>
References: <50ABCA0C.50001@phihag.de>
	<CADiSq7cKuL+VXW8fTET9LYoQvKWkHvZ_=73EWG8j3wSEu9+fwg@mail.gmail.com>
	<50ACA401.2080303@phihag.de>
Message-ID: <CADiSq7d1sCytBav-dehkaPsVo--nYrbBCpPfkMkiLDxP6j6AMQ@mail.gmail.com>

On Wed, Nov 21, 2012 at 7:50 PM, Philipp Hagemeister <phihag at phihag.de>wrote:

> On 11/21/2012 04:27 AM, Nick Coghlan wrote:
> > Or install them all in a single directory, add a __main__.py file to that
> > directory and then just pass that directory name on the command line
> > instead of a script name. The directory will be added as sys.path[0] and
> > the __main__.py file will be executed as the main module (If your
> > additional application dependencies are all pure Python files, you can
> even
> > zip up that directory and pass that on the command line instead).
> I'm well-aware of that approach, but didn't apply it to dependencies,
> and am still not sure how to. Can you describe how a hypothetical
> helloworld application with one dependency would look like? And wouldn't
> one sacrifice the ability to seamlessly import from the application's
> code itself.
>

One directory containing:

  runthis/
       __main__.py
           (with content as described for your main.py)
       lxml
       myapp

Execute "python runthis" (via a +x shell script if you prefer). Note the
lack of -m: you're executing the directory contents, not a package. You can
also bundle it all into a zip file, but that only works if you don't need C
extension support (since zipimport can't handle the necessary step of
extracting the shared libraries out to separate files so the OS can load
them.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121121/62d35ac5/attachment.html>

From phihag at phihag.de  Wed Nov 21 16:07:47 2012
From: phihag at phihag.de (Philipp Hagemeister)
Date: Wed, 21 Nov 2012 16:07:47 +0100
Subject: [Python-ideas] python_modules as default directory for
 dependencies in distutils
In-Reply-To: <CADiSq7d1sCytBav-dehkaPsVo--nYrbBCpPfkMkiLDxP6j6AMQ@mail.gmail.com>
References: <50ABCA0C.50001@phihag.de>
	<CADiSq7cKuL+VXW8fTET9LYoQvKWkHvZ_=73EWG8j3wSEu9+fwg@mail.gmail.com>
	<50ACA401.2080303@phihag.de>
	<CADiSq7d1sCytBav-dehkaPsVo--nYrbBCpPfkMkiLDxP6j6AMQ@mail.gmail.com>
Message-ID: <50ACEE43.8030806@phihag.de>

On 11/21/2012 02:38 PM, Nick Coghlan wrote:
>   runthis/
>        __main__.py
>            (with content as described for your main.py)
>        lxml
>        myapp

But how is that different from putting everything into the root
directory of the application?

In particular, assume that I have lxml001..lxml100 . Don't I still have
to gitignore/hgignore all of them, and write a convoluted target to
delete all of them?

Plus, how would I install all these dependencies with distutils?

- Philipp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121121/cb7b665d/attachment.pgp>

From sturla at molden.no  Wed Nov 21 16:12:04 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 21 Nov 2012 16:12:04 +0100
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
Message-ID: <50ACEF44.1090705@molden.no>

See this:

http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063593.html

According to Apple enineers:

"""
For API outside of POSIX, including GCD and technologies like
Accelerate, we do not support usage on both sides of a fork(). For
this reason among others, use of fork() without exec is discouraged in
general in processes that use layers above POSIX.
"""

Multiprocessing on OSX calls os.fork, but not os.exec.

Thus, is multiprocessing errorneously implemented on Mac? Forking 
without calling exec means that only APIs inside POSIX can be used by 
the child process.

For NumPy, it even affects functions like matrix multiplication when the 
accelerate framework is used for BLAS.

Does multiprocessing needs a reimplementation on Mac to behave as it 
does on Windows? (Yes it would cripple it similarly to the crippled 
multiprocessing on Windows.)

And what about Python itself? Is there any non-POSIX code in the 
interpreter? If it is, os.fork should be removed on Mac.


Sturla





From solipsis at pitrou.net  Wed Nov 21 20:25:26 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 21 Nov 2012 20:25:26 +0100
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
References: <50ACEF44.1090705@molden.no>
Message-ID: <20121121202526.206ccc84@pitrou.net>

On Wed, 21 Nov 2012 16:12:04 +0100
Sturla Molden <sturla at molden.no> wrote:

> See this:
> 
> http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063593.html
> 
> According to Apple enineers:
> 
> """
> For API outside of POSIX, including GCD and technologies like
> Accelerate, we do not support usage on both sides of a fork(). For
> this reason among others, use of fork() without exec is discouraged in
> general in processes that use layers above POSIX.
> """
> 
> Multiprocessing on OSX calls os.fork, but not os.exec.
> 
> Thus, is multiprocessing errorneously implemented on Mac? Forking 
> without calling exec means that only APIs inside POSIX can be used by 
> the child process.

Or perhaps "fork()" is erroneously implemented on Mac.
Regardless, http://bugs.python.org/issue8713 has a proposal by Richard
to make things more configurable on all POSIX platforms.

Regards

Antoine.




From mwm at mired.org  Wed Nov 21 20:34:36 2012
From: mwm at mired.org (Mike Meyer)
Date: Wed, 21 Nov 2012 13:34:36 -0600
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <50ACEF44.1090705@molden.no>
References: <50ACEF44.1090705@molden.no>
Message-ID: <CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>

On Wed, Nov 21, 2012 at 9:12 AM, Sturla Molden <sturla at molden.no> wrote:
> See this:
>
> http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063593.html
>
> According to Apple engineers:
>
> """
> For API outside of POSIX, including GCD and technologies like
> Accelerate, we do not support usage on both sides of a fork(). For
> this reason among others, use of fork() without exec is discouraged in
> general in processes that use layers above POSIX.
> """
>
> Multiprocessing on OSX calls os.fork, but not os.exec.
>
> Thus, is multiprocessing errorneously implemented on Mac? Forking without
> calling exec means that only APIs inside POSIX can be used by the child
> process.

That isn't the way I read the quoted statement - and it's probably not
true. The way I read it, you have to make all your above-the-POSIX
calls either before you fork, or make them all *after* you fork, but
not on "both sides of a fork()." The reality is probably that you're
ok so long as you make sure the above-the-POSIX data is in the "right"
state before you fork. Rather than trying to describe the "right"
state, Apple provides a simple rule to do that - and then provides a
simple-minded rule to force compliance with that rule.

Note that the same warning applies to some of the objects that are
defined by POSIX interfaces. If you fork with them in the "wrong"
state, you're going to get broken behavior.

> For NumPy, it even affects functions like matrix multiplication when the
> accelerate framework is used for BLAS.
>
> Does multiprocessing needs a reimplementation on Mac to behave as it does on
> Windows? (Yes it would cripple it similarly to the crippled multiprocessing
> on Windows.)

-1.

Mac's make nice Unix desktops (if you like their GUI), and it's not
uncommon to find people writing & testing software on Mac's for
deployment to POSIX servers. Such people using the multiprocessing
module need it to provide proper POSIX behavior.

> And what about Python itself? Is there any non-POSIX code in the
> interpreter? If it is, os.fork should be removed on Mac.

Well, the Mac-specific portions might. But those of us developing for
deployment to non-Mac systems won't be using those.

In general, these issues are simply a case of needing to be aware of
what your program is doing, and making sure you don't do things that
can cause problems - whether you're using above-the-POSIX Mac APIs, or
POSIX the problematic Posix APIs.  A case might be made that the
multiprocessing module should help with that, by providing a way to
say "I'm doing things that make fork dangerous, please use the
system-appropriate fork+exec call instead."

       <mike


From abarnert at yahoo.com  Wed Nov 21 21:25:01 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Wed, 21 Nov 2012 12:25:01 -0800 (PST)
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <20121121202526.206ccc84@pitrou.net>
References: <50ACEF44.1090705@molden.no> <20121121202526.206ccc84@pitrou.net>
Message-ID: <1353529501.66820.YahooMailRC@web184704.mail.ne1.yahoo.com>

From: Antoine Pitrou <solipsis at pitrou.net>

To: python-ideas at python.org
> 
> On Wed, 21 Nov 2012 16:12:04 +0100
> Sturla Molden <sturla at molden.no> wrote:
> 
> > According to Apple enineers:
> > 
> > """
> > For API  outside of POSIX, including GCD and technologies like
> > Accelerate, we do  not support usage on both sides of a fork(). For
> > this reason among  others, use of fork() without exec is discouraged in
> > general in  processes that use layers above POSIX.
> > """
> > 
> >  Multiprocessing on OSX calls os.fork, but not os.exec.
> > 
> > Thus, is  multiprocessing errorneously implemented on Mac? Forking 
> > without  calling exec means that only APIs inside POSIX can be used by 
> > the child  process.
> 
> Or perhaps "fork()" is erroneously implemented on  Mac.

No, it's not that fork is erroneously implemented, it's that CoreFoundation, as 
designed, doesn't work across a properly-implemented POSIX fork. And presumably 
the same is true for various other Apple technologies, but they won't give a  
complete list of what is and isn't safe and why?instead, they just say "don't 
use any non-POSIX stuff if you fork without exec".

And the problem Python users face isn't that multiprocessing works differently 
on OS X vs. FreeBSD or linux, but that their programs may be quietly using 
non-fork-safe things like Accelerate.framework on OS X but not on FreeBSD or 
linux.



From greg at krypto.org  Wed Nov 21 21:29:23 2012
From: greg at krypto.org (Gregory P. Smith)
Date: Wed, 21 Nov 2012 12:29:23 -0800
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
Message-ID: <CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>

On Wed, Nov 21, 2012 at 11:34 AM, Mike Meyer <mwm at mired.org> wrote:

> On Wed, Nov 21, 2012 at 9:12 AM, Sturla Molden <sturla at molden.no> wrote:
> > See this:
> >
> > http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063593.html
> >
> > According to Apple engineers:
> >
> > """
> > For API outside of POSIX, including GCD and technologies like
> > Accelerate, we do not support usage on both sides of a fork(). For
> > this reason among others, use of fork() without exec is discouraged in
> > general in processes that use layers above POSIX.
> > """
> >
> > Multiprocessing on OSX calls os.fork, but not os.exec.
> >
> > Thus, is multiprocessing errorneously implemented on Mac? Forking without
> > calling exec means that only APIs inside POSIX can be used by the child
> > process.
>

I don't care how this is read or what the default behavior is. All I want
is for someone who cares about multiprocessing actually working well for
people to implement optional support for *not* using os.fork() on posixish
systems.  ie: port an equivalent of the windows stuff over.

I don't use multiprocessing so I've never looked into adding it.  It
shouldn't be difficult given the legwork has already been done for use on
windows.  this is exactly what issue8713 is asking for.


> > Does multiprocessing needs a reimplementation on Mac to behave as it
> does on
> > Windows? (Yes it would cripple it similarly to the crippled
> multiprocessing
> > on Windows.)
>
> -1.
>

+10 - though I wouldn't call it a "re" implementation, just a port of the
windows stuff.

Mac's make nice Unix desktops (if you like their GUI), and it's not
> uncommon to find people writing & testing software on Mac's for
> deployment to POSIX servers. Such people using the multiprocessing
> module need it to provide proper POSIX behavior.
>

Those people are delusional, they need to test their software on the OS it
is destined to run on. Not on their terminal.


> In general, these issues are simply a case of needing to be aware of
> what your program is doing, and making sure you don't do things that
> can cause problems - whether you're using above-the-POSIX Mac APIs, or
> POSIX the problematic Posix APIs.  A case might be made that the
> multiprocessing module should help with that, by providing a way to
> say "I'm doing things that make fork dangerous, please use the
> system-appropriate fork+exec call instead."
>

Yep.  Like I said, I don't personally care what the default is.  I just
want support for both options so that people have a way to _not_ shoot
themselves in the foot.

-gps
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121121/a4429eeb/attachment.html>

From ronaldoussoren at mac.com  Wed Nov 21 21:44:24 2012
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Wed, 21 Nov 2012 21:44:24 +0100
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <20121121202526.206ccc84@pitrou.net>
References: <50ACEF44.1090705@molden.no> <20121121202526.206ccc84@pitrou.net>
Message-ID: <949C7486-A094-4DC5-AFB7-B7EB45FC8DA0@mac.com>


On 21 Nov, 2012, at 20:25, Antoine Pitrou <solipsis at pitrou.net> wrote:

> On Wed, 21 Nov 2012 16:12:04 +0100
> Sturla Molden <sturla at molden.no> wrote:
> 
>> See this:
>> 
>> http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063593.html
>> 
>> According to Apple enineers:
>> 
>> """
>> For API outside of POSIX, including GCD and technologies like
>> Accelerate, we do not support usage on both sides of a fork(). For
>> this reason among others, use of fork() without exec is discouraged in
>> general in processes that use layers above POSIX.
>> """
>> 
>> Multiprocessing on OSX calls os.fork, but not os.exec.
>> 
>> Thus, is multiprocessing errorneously implemented on Mac? Forking 
>> without calling exec means that only APIs inside POSIX can be used by 
>> the child process.
> 
> Or perhaps "fork()" is erroneously implemented on Mac.

Fork works fine, its "just" that most system libraries above the POSIX layer don't bother to clean
up their state in the child proces. 

There may be good reasons for that (cleaning up state changes by background threads
might be hard), but its pretty annoying non the less.

Ronald


From ronaldoussoren at mac.com  Wed Nov 21 21:49:28 2012
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Wed, 21 Nov 2012 21:49:28 +0100
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <50ACEF44.1090705@molden.no>
References: <50ACEF44.1090705@molden.no>
Message-ID: <0A5D8405-754A-49EA-B74C-B3288E284EB2@mac.com>


On 21 Nov, 2012, at 16:12, Sturla Molden <sturla at molden.no> wrote:

> 
> 
> And what about Python itself? Is there any non-POSIX code in the interpreter? If it is, os.fork should be removed on Mac.

Not necessarily in the interpeter itself, but the proxy-detection code in _scproxy uses non-POSIX code for detecting
the user's proxy settings.

BTW. removing os.fork is overkill, some system APIs don't work properly on OSX after fork and complain loudly when
you try to use them. So don't do that.

Ronald


From greg at krypto.org  Wed Nov 21 21:57:22 2012
From: greg at krypto.org (Gregory P. Smith)
Date: Wed, 21 Nov 2012 12:57:22 -0800
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <0A5D8405-754A-49EA-B74C-B3288E284EB2@mac.com>
References: <50ACEF44.1090705@molden.no>
	<0A5D8405-754A-49EA-B74C-B3288E284EB2@mac.com>
Message-ID: <CAGE7PNK9bgYWmr_zKd8q55jdLd3BW2b-SvmLs-NvLD+xFv-5-A@mail.gmail.com>

On Wed, Nov 21, 2012 at 12:49 PM, Ronald Oussoren <ronaldoussoren at mac.com>wrote:

>
> On 21 Nov, 2012, at 16:12, Sturla Molden <sturla at molden.no> wrote:
>
> >
> >
> > And what about Python itself? Is there any non-POSIX code in the
> interpreter? If it is, os.fork should be removed on Mac.
>
> Not necessarily in the interpeter itself, but the proxy-detection code in
> _scproxy uses non-POSIX code for detecting
> the user's proxy settings.
>

well, it depends.  its not right to ask for "non-posix code" as the
restrictions of what you can use after a fork() are related to what you've
done before the fork (as someone else stated).  if your process has spawned
threads, the entire python interpreter is unsafe to use after a fork() as
posix requires that a limited subset of safe functions are all that is used
from then on...


>
> BTW. removing os.fork is overkill, some system APIs don't work properly on
> OSX after fork and complain loudly when
> you try to use them. So don't do that.
>
> Ronald
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121121/78a7e098/attachment.html>

From mwm at mired.org  Wed Nov 21 23:14:07 2012
From: mwm at mired.org (Mike Meyer)
Date: Wed, 21 Nov 2012 16:14:07 -0600
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
Message-ID: <CAD=7U2AzUbf+-nU1M9YnPRYb9rNrw=be=RJz-GVSphWmiRvXGA@mail.gmail.com>

On Wed, Nov 21, 2012 at 2:29 PM, Gregory P. Smith <greg at krypto.org> wrote:
> On Wed, Nov 21, 2012 at 11:34 AM, Mike Meyer <mwm at mired.org> wrote:
>> On Wed, Nov 21, 2012 at 9:12 AM, Sturla Molden <sturla at molden.no> wrote:
>> Mac's make nice Unix desktops (if you like their GUI), and it's not
>> uncommon to find people writing & testing software on Mac's for
>> deployment to POSIX servers. Such people using the multiprocessing
>> module need it to provide proper POSIX behavior.
>
> Those people are delusional, they need to test their software on the OS it
> is destined to run on. Not on their terminal.

Are you saying that it's delusional to try and write portable code
that runs on POSIX-compliant platforms in Python?

Of course, any such attempt always has to be tested against the
platforms it's supposed to run on - after all, they may fail to
implement the standard correctly. But those are generally considered
to be bugs in the platform in question that need to be worked around,
*not* bugs in software. Since Apple is one of the few vendors
releasing products with a Unix certification, their products are prime
candidates as development platforms for anyone working on Unix
software.

	<mike


From shibturn at gmail.com  Wed Nov 21 23:18:31 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Wed, 21 Nov 2012 22:18:31 +0000
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
Message-ID: <k8jjvq$7a7$1@ger.gmane.org>

On 21/11/2012 8:29pm, Gregory P. Smith wrote:
> I don't care how this is read or what the default behavior is. All I
> want is for someone who cares about multiprocessing actually working
> well for people to implement optional support for *not* using os.fork()
> on posixish systems.  ie: port an equivalent of the windows stuff over.
>
> I don't use multiprocessing so I've never looked into adding it.  It
> shouldn't be difficult given the legwork has already been done for use
> on windows.  this is exactly what issue8713 is asking for.

An implementation is available at

     http://hg.python.org/sandbox/sbt#spawn

You just need to stick

     multiprocessing.set_start_method('spawn')

at the beginning of the program to use fork+exec instead of fork.  The 
test suite passes.  (I would not say that making this work was that 
straightforward though.)

That branch also supports the starting of processes via a server 
process.  That gives an alternative solution to the problem of mixing 
fork() with threads, but has the advantage of being as fast as the 
default fork start method.  However, it does not work on systems where 
fd passing is unsupported like OpenSolaris.

-- 
Richard



From jimjjewett at gmail.com  Wed Nov 21 23:21:36 2012
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 21 Nov 2012 17:21:36 -0500
Subject: [Python-ideas] python_modules as default directory for
 dependencies in distutils
In-Reply-To: <50ACA401.2080303@phihag.de>
References: <50ABCA0C.50001@phihag.de>
	<CADiSq7cKuL+VXW8fTET9LYoQvKWkHvZ_=73EWG8j3wSEu9+fwg@mail.gmail.com>
	<50ACA401.2080303@phihag.de>
Message-ID: <CA+OGgf5kTWf3FV49F66qNXEbD0nqq=zWgunPn9nDbxrFgPt1mQ@mail.gmail.com>

On 11/21/12, Philipp Hagemeister <phihag at phihag.de> wrote:
> On 11/21/2012 04:27 AM, Nick Coghlan wrote:
>> Or install them all in a single directory, add a __main__.py file to that
>> directory and then just pass that directory name on the command line
>> instead of a script name. The directory will be added as sys.path[0] and
>> the __main__.py file will be executed as the main module

>... And wouldn't one sacrifice the ability to seamlessly import from the
> application's code itself.

Do you mean from within the application, or from the supposedly
independent libraries that you depend upon?


> As far as I understand, you suggest a setup like ...

> ./python_modules/__main__.py -> ../main.py
> ./python_modules/myapp -> ../myapp  # Or a path fixup in main

Skip those two ... if something inside python_modules is looking
at your application, then it really shouldn't be segregated into a
python_modules directory.  (And if you need to anyhow, make
those imports explicit, so that you don't end up with two copies of
the "same" module.)

That said, I think (but haven't tested) that import __main__ or
import myprojutils will do the right thing, because of sys.path[0]
being the root directory of myapp.

-jJ


From mwm at mired.org  Wed Nov 21 23:25:53 2012
From: mwm at mired.org (Mike Meyer)
Date: Wed, 21 Nov 2012 16:25:53 -0600
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <CAGE7PNK9bgYWmr_zKd8q55jdLd3BW2b-SvmLs-NvLD+xFv-5-A@mail.gmail.com>
References: <50ACEF44.1090705@molden.no>
	<0A5D8405-754A-49EA-B74C-B3288E284EB2@mac.com>
	<CAGE7PNK9bgYWmr_zKd8q55jdLd3BW2b-SvmLs-NvLD+xFv-5-A@mail.gmail.com>
Message-ID: <CAD=7U2BXMxdj=5hFUeyZ6xdSNh3_du-e_cZR+bs6LURHBBae2Q@mail.gmail.com>

On Wed, Nov 21, 2012 at 2:57 PM, Gregory P. Smith <greg at krypto.org> wrote:
> On Wed, Nov 21, 2012 at 12:49 PM, Ronald Oussoren <ronaldoussoren at mac.com>
> wrote:
>> On 21 Nov, 2012, at 16:12, Sturla Molden <sturla at molden.no> wrote:
> well, it depends.  its not right to ask for "non-posix code" as the
> restrictions of what you can use after a fork() are related to what you've
> done before the fork (as someone else stated).  if your process has spawned
> threads, the entire python interpreter is unsafe to use after a fork()

If your process has spawned threads, POSIX fork() is unsafe. It
creates a clone of your process with every thread but the one calling
fork() stopped dead. Unless you can guarantee that this was a safe
thing to do then (i.e. - all your other threads hold no locks, have no
shared data structures in violation of invariant, etc.), you can be
hosed. Some cases can even hose the parent process.

Calling exec shortly after doing the fork will help with some of these
issues. Not all of them.

	<mike


From phihag at phihag.de  Wed Nov 21 23:34:17 2012
From: phihag at phihag.de (Philipp Hagemeister)
Date: Wed, 21 Nov 2012 23:34:17 +0100
Subject: [Python-ideas] python_modules as default directory for
 dependencies in distutils
In-Reply-To: <CA+OGgf5kTWf3FV49F66qNXEbD0nqq=zWgunPn9nDbxrFgPt1mQ@mail.gmail.com>
References: <50ABCA0C.50001@phihag.de>
	<CADiSq7cKuL+VXW8fTET9LYoQvKWkHvZ_=73EWG8j3wSEu9+fwg@mail.gmail.com>
	<50ACA401.2080303@phihag.de>
	<CA+OGgf5kTWf3FV49F66qNXEbD0nqq=zWgunPn9nDbxrFgPt1mQ@mail.gmail.com>
Message-ID: <50AD56E9.3090204@phihag.de>

On 11/21/2012 11:21 PM, Jim Jewett wrote:
> That said, I think (but haven't tested) that import __main__ or
> import myprojutils will do the right thing, because of sys.path[0]
> being the root directory of myapp.
Nick already clarified what he meant in his other mail, archived at
http://mail.python.org/pipermail/python-ideas/2012-November/017928.html

I just misunderstood his proposal - as far as my current understanding
goes, he suggests putting the application, its __main__.py file, as well
as the dependencies in one subdirectory.

Cheers,

Philipp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121121/88228cc4/attachment.pgp>

From abarnert at yahoo.com  Thu Nov 22 01:06:07 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Wed, 21 Nov 2012 16:06:07 -0800 (PST)
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <CAN1F8qWD0watc8d=QsV4dgYxr2U7Wss3dyT4RJ4a4Yocz2UWPw@mail.gmail.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<50A68424.9000509@pearwood.info> <50A6E47F.3030304@pearwood.info>
	<1353122279.34236.YahooMailRC@web184704.mail.ne1.yahoo.com>
	<CAN1F8qWD0watc8d=QsV4dgYxr2U7Wss3dyT4RJ4a4Yocz2UWPw@mail.gmail.com>
Message-ID: <1353542767.51180.YahooMailRC@web184705.mail.ne1.yahoo.com>

From: Joshua Landau <joshua.landau.ws at gmail.com>
Sent: Sat, November 17, 2012 11:38:22 AM

>Surely the best choice is two have *two* caches; one for hashables and another 
>for the rest.

Your implementation does a try: hash() to decide whether to check the set or the 
list, instead of just doing a try: item in set except: item in list. Is there a 
reason for this? It's more complicated, and it's measurably slower.

>This might be improvable with a *third* chache if some non-hashables had total 
>ordering, but that could also introduce bugs I think. It'd also be a lot harder 

>and likely be slower anyway.

I agree that it's probably not worth adding to something in the standard 
library, or a recipe given in the documentation (in fact, I think I already said 
as much earlier in the thread), but I think you've got most of those facts 
wrong.

It's not a lot harder. The same 4 lines you have to add to do a 
try-set-except-list, you just do again, so it's 
try-set-except-try-sortedlist-except-list. And it won't introduce any bugs. And 
as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique, 
which is obviously better, but probably slower for tiny M, and another 5-10% 
overhead for inappropriate values.

The problem is finding an appropriate sortedcollection type. There isn't one in 
the standard library. There's a link to an external SortedCollection reference 
in the bisect docs page, but that has O(N) insertion time, so it won't help. The 
most popular library I know of is blist.sortedlist, and that works, but it has 
quite a bit of overhead for searching tiny lists. As I understand it, the reason 
there isn't a standard sorted collection is the assumption that people dealing 
with huge sequences ought to expect  to have some searching, comparing, and 
profiling of algorithms in their  future, while those people dealing with len<10 
sequences shouldn't have to think at all.

At any rate, I tried a few different sorted collections. The performance for 
single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); the 
crossover was around M=100, and you get 10x faster by around M=100K. Deciding 
whether this is appropriate, and which implementation to use, and so on? well, 
that's exactly why there's no sorted list in the stdlib in the first place.



From benhoyt at gmail.com  Thu Nov 22 12:39:42 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 23 Nov 2012 00:39:42 +1300
Subject: [Python-ideas] BetterWalk, a better and faster os.walk() for Python
Message-ID: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>

In the recent thread I started called "Speed up os.walk()..." [1] I
was encouraged to create a module to flesh out the idea, so I present
you with BetterWalk:

https://github.com/benhoyt/betterwalk#readme

It's basically all there, and works on Windows, Linux, and Mac OS X.
It probably works on FreeBSD too, but I haven't tested that. I also
haven't written thorough unit tests yet, but intend to after some
further feedback.

In terms of the API for iterdir_stat(), I settled on the more explicit
"pass in what stat fields you want" (the 'fields' parameter). I also
added a 'pattern' parameter to allow you to make use of the wildcard
matching that FindFirst/FindNext provide (it's useful for globbing on
POSIX too, but not a performance improvement).

As for benchmarks, it's about what I saw earlier on Windows (2-6x on
recent versions, depending). My initial tests on Mac OS X show it's
5-10x as fast on that platform! I haven't double-checked those results
yet though.

The results on Linux were somewhat disappointing -- only a 10% speed
improvement on large directories, and it's actually slower on small
directories. It's still doing half the number of system calls ... so I
believe this is because cached os.stat() is super fast on Linux, and
so the slowdown from using ctypes / pure Python is outweighing the
gain from not doing the system call. That said, I've also only tested
Linux in a VirtualBox setup, so maybe that's affecting it too.

Still, if it's a significant win for Windows and OS X users, it's a good thing.

In any case, I'd love it if folks could run the benchmark on their
system (with and without -s) and comment further on the idea and API.

Thanks,
Ben.

[1] http://mail.python.org/pipermail/python-ideas/2012-November/017770.html


From stefan at drees.name  Thu Nov 22 16:58:19 2012
From: stefan at drees.name (Stefan Drees)
Date: Thu, 22 Nov 2012 16:58:19 +0100
Subject: [Python-ideas] BetterWalk,
 a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
Message-ID: <50AE4B9B.4050008@drees.name>

Hi Ben,

Am 22.11.12 12:39, schrieb Ben Hoyt:
> In the recent thread I started called "Speed up os.walk()..." [1] I
> was encouraged to create a module to flesh out the idea, so I present
> you with BetterWalk:
>
> https://github.com/benhoyt/betterwalk#readme
>
> ...
> In any case, I'd love it if folks could run the benchmark on their
> system (with and without -s) and comment further on the idea and API.
>

thanks a lot. I tried it out. Inside the git repo:

$> source /somepath/venv/bin/activate
$(venv)> python ./benchmark.py
Creating tree at ./benchtree: depth=4, num_dirs=5, num_files=50
Traceback (most recent call last):
   File "./benchmark.py", line 121, in <module>
     main()
   File "./benchmark.py", line 116, in main
     create_tree(tree_dir)
   File "./benchmark.py", line 26, in create_tree
     f.write(line * 20000)
TypeError: 'str' does not support the buffer interface
$(venv)> python -V
Python 3.3.0
$(venv)> python
Python 3.3.0 (default, Oct 24 2012, 11:01:23)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.65))]
  ...

Some better way to test it :?)

All the best,
Stefan.
>...
> [1] http://mail.python.org/pipermail/python-ideas/2012-November/017770.html
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>



From mwm at mired.org  Thu Nov 22 17:33:52 2012
From: mwm at mired.org (Mike Meyer)
Date: Thu, 22 Nov 2012 10:33:52 -0600
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
Message-ID: <CAD=7U2A5+sCBxsBgbs3Hb04U4GqkFVf=O6tZVrPRtVSQ-cHSxA@mail.gmail.com>

On Thu, Nov 22, 2012 at 5:39 AM, Ben Hoyt <benhoyt at gmail.com> wrote:
> In the recent thread I started called "Speed up os.walk()..." [1] I
> was encouraged to create a module to flesh out the idea, so I present
> you with BetterWalk:
>
> https://github.com/benhoyt/betterwalk#readme
>
> It's basically all there, and works on Windows, Linux, and Mac OS X.
> It probably works on FreeBSD too, but I haven't tested that.

It doesn't work on FreeBSD 9.1. Here's a quick failure in  the source
directory that might help
you diagnose the problem (I probably won't be able to look into it
deeper until after the holiday weekend):

>>>from betterwalk import walk
>>>for x in walk('.'):
... print x
...
('.', [u'', u'', u'erwalk.py', u'erwalk.pyc', u'p.py'], [u'ignore',
u'hmark.py', u's', u'erwalk.py', u'htree', u'attributes', u'GES.txt',
u'ME.md', u'NSE.txt']) (u'./', [u'', u'', u'erwalk.py', u'erwalk.pyc',
u'p.py'], [u'ignore', u'hmark.py', u's', u'erwalk.py', u'htree',
u'attributes', u'GES.txt', u'ME.md', u'NSE.txt'])
<break>

I also got an error trying to run the benchmark program on 3.2 (this
is python-ideas, which means things discussed here are bound for
Python 3):

bhuda% python3.2 benchmark.py
Creating tree at benchtree: depth=4, num_dirs=5, num_files=50
Traceback (most recent call last):
  File "benchmark.py", line 121, in <module>
    main()
  File "benchmark.py", line 116, in main
    create_tree(tree_dir)
  File "benchmark.py", line 26, in create_tree
    f.write(line * 20000)
TypeError: 'str' does not support the buffer interface

You need to use a bytestring instead of a string here. The best way
will depend on how old a version of python you want to support.

     <mike


From benhoyt at gmail.com  Thu Nov 22 20:20:35 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 23 Nov 2012 08:20:35 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <50AE4B9B.4050008@drees.name>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<50AE4B9B.4050008@drees.name>
Message-ID: <CAL9jXCEwXTF-YhN-McPXAPnkyQfWYV9-dDg4dh-Vyhodqk0Esg@mail.gmail.com>

> TypeError: 'str' does not support the buffer interface
...
> Some better way to test it :?)

Oops, sorry! I'd created the directory tree using Python 2.x, and then
tested using Python 3.x with the dir tree still present.

Fixed in the GitHub repo now.

-Ben


From benhoyt at gmail.com  Thu Nov 22 20:23:31 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 23 Nov 2012 08:23:31 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAD=7U2A5+sCBxsBgbs3Hb04U4GqkFVf=O6tZVrPRtVSQ-cHSxA@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<CAD=7U2A5+sCBxsBgbs3Hb04U4GqkFVf=O6tZVrPRtVSQ-cHSxA@mail.gmail.com>
Message-ID: <CAL9jXCE5ZoTOOuFSae55NHXs80d1SkWaZsaBo9nuEN9f2tEqKg@mail.gmail.com>

> It doesn't work on FreeBSD 9.1. Here's a quick failure in  the source
...
> ('.', [u'', u'', u'erwalk.py', u'erwalk.pyc', u'p.py'], [u'ignore',

Thanks. FreeBSD seemed to have the same "problem" as Mac OS X (issue
#2). I fixed this on the GitHub repo now -- please try again.

> I also got [a TypeError] trying to run the benchmark program on 3.2 (this
> is python-ideas, which means things discussed here are bound for
> Python 3):

Yeah, sorry about that. I'd certainly tested the benchmark on Python
3.x, however not the directory tree creation. Again, fixed now on
GitHub.

-Ben


From solipsis at pitrou.net  Thu Nov 22 20:43:32 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Thu, 22 Nov 2012 20:43:32 +0100
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
Message-ID: <20121122204332.2fcb6f66@pitrou.net>

On Fri, 23 Nov 2012 00:39:42 +1300
Ben Hoyt <benhoyt at gmail.com> wrote:
> 
> Still, if it's a significant win for Windows and OS X users, it's a good thing.
> 
> In any case, I'd love it if folks could run the benchmark on their
> system (with and without -s) and comment further on the idea and API.

On Mageia Linux 1, I had the following results:
- Python 2.7: 0.7x as fast
- Python 3.2: 1.1x as fast
- Python 3.3: 1.2x as fast

The -s flag didn't make a difference.

Do note that the benchmark is very fast - around ~40 ms for a walk.

Regards

Antoine.




From tismer at stackless.com  Thu Nov 22 21:23:14 2012
From: tismer at stackless.com (Christian Tismer)
Date: Thu, 22 Nov 2012 21:23:14 +0100
Subject: [Python-ideas] python_modules as default directory for
 dependencies in distutils
In-Reply-To: <CADiSq7d1sCytBav-dehkaPsVo--nYrbBCpPfkMkiLDxP6j6AMQ@mail.gmail.com>
References: <50ABCA0C.50001@phihag.de>
	<CADiSq7cKuL+VXW8fTET9LYoQvKWkHvZ_=73EWG8j3wSEu9+fwg@mail.gmail.com>
	<50ACA401.2080303@phihag.de>
	<CADiSq7d1sCytBav-dehkaPsVo--nYrbBCpPfkMkiLDxP6j6AMQ@mail.gmail.com>
Message-ID: <50AE89B2.8080100@stackless.com>

On 11/21/12 2:38 PM, Nick Coghlan wrote:
> On Wed, Nov 21, 2012 at 7:50 PM, Philipp Hagemeister <phihag at phihag.de 
> <mailto:phihag at phihag.de>> wrote:
>
>     On 11/21/2012 04:27 AM, Nick Coghlan wrote:
>     > Or install them all in a single directory, add a __main__.py
>     file to that
>     > directory and then just pass that directory name on the command line
>     > instead of a script name. The directory will be added as
>     sys.path[0] and
>     > the __main__.py file will be executed as the main module (If your
>     > additional application dependencies are all pure Python files,
>     you can even
>     > zip up that directory and pass that on the command line instead).
>     I'm well-aware of that approach, but didn't apply it to dependencies,
>     and am still not sure how to. Can you describe how a hypothetical
>     helloworld application with one dependency would look like? And
>     wouldn't
>     one sacrifice the ability to seamlessly import from the application's
>     code itself.
>
>
> One directory containing:
>
>   runthis/
>        __main__.py
>            (with content as described for your main.py)
>        lxml
>        myapp
>
> Execute "python runthis" (via a +x shell script if you prefer). Note 
> the lack of -m: you're executing the directory contents, not a 
> package. You can also bundle it all into a zip file, but that only 
> works if you don't need C extension support (since zipimport can't 
> handle the necessary step of extracting the shared libraries out to 
> separate files so the OS can load them.

Hi Nick,

This would actually be very nice if we could go this far! ;-)
Maybe with some Ramdisk support or something.
A problem might be to handle RPath issues on the fly.
I've actually learnt about this when working on the pyside setup,
was not aware of the problem, before.

Do you think it would make sense for me to put time into this?

cheers - chris

-- 
Christian Tismer             :^)   <mailto:tismer at stackless.com>
Software Consulting          :     Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121     :    *Starship* http://starship.python.net/
14482 Potsdam                :     PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
       whom do you want to sponsor today?   http://www.stackless.com/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121122/63be05f7/attachment.html>

From benhoyt at gmail.com  Thu Nov 22 21:23:51 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 23 Nov 2012 09:23:51 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <20121122204332.2fcb6f66@pitrou.net>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<20121122204332.2fcb6f66@pitrou.net>
Message-ID: <CAL9jXCHvuaEPPriw1ZigdbiroMfbwozk1_gLKdV+87SrgB0G4g@mail.gmail.com>

> On Mageia Linux 1, I had the following results:
> - Python 2.7: 0.7x as fast
> - Python 3.2: 1.1x as fast
> - Python 3.3: 1.2x as fast
> The -s flag didn't make a difference.

Thanks. Out of interest, 64 bit or 32 bit (system and Python)?

I wonder if ctypes got significantly faster in Python 3.x, or what's
going on here. Python 3 is significantly faster in my tests too --
noticeable on Linux.

> Do note that the benchmark is very fast - around ~40 ms for a walk.

Yeah, that's over too quickly for a real test, isn't it? I'm already
creating a 230MB dir, but maybe I need to add more, smaller files.
Easy to tweak with the NUM_FILES and NUM_DIRS constants (though you'll
need to delete your current "benchtree" dir so it recreates it. Maybe
I should bump up the defaults too.

-Ben


From solipsis at pitrou.net  Thu Nov 22 21:29:03 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Thu, 22 Nov 2012 21:29:03 +0100
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<20121122204332.2fcb6f66@pitrou.net>
	<CAL9jXCHvuaEPPriw1ZigdbiroMfbwozk1_gLKdV+87SrgB0G4g@mail.gmail.com>
Message-ID: <20121122212903.469aff53@pitrou.net>

On Fri, 23 Nov 2012 09:23:51 +1300
Ben Hoyt <benhoyt at gmail.com> wrote:
> > On Mageia Linux 1, I had the following results:
> > - Python 2.7: 0.7x as fast
> > - Python 3.2: 1.1x as fast
> > - Python 3.3: 1.2x as fast
> > The -s flag didn't make a difference.
> 
> Thanks. Out of interest, 64 bit or 32 bit (system and Python)?

64 bit.

> I wonder if ctypes got significantly faster in Python 3.x, or what's
> going on here. Python 3 is significantly faster in my tests too --
> noticeable on Linux.

Since you're using a ctypes solution, it's difficult to compare against
built-in listdir() and stat() (which are written in C). So I'd suggest
either rewriting your core function in C (better), or using a ctypes
emulation of listdir() and stat() (easier?).

Regards

Antoine.




From mwm at mired.org  Thu Nov 22 21:36:22 2012
From: mwm at mired.org (Mike Meyer)
Date: Thu, 22 Nov 2012 14:36:22 -0600
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCE5ZoTOOuFSae55NHXs80d1SkWaZsaBo9nuEN9f2tEqKg@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<CAD=7U2A5+sCBxsBgbs3Hb04U4GqkFVf=O6tZVrPRtVSQ-cHSxA@mail.gmail.com>
	<CAL9jXCE5ZoTOOuFSae55NHXs80d1SkWaZsaBo9nuEN9f2tEqKg@mail.gmail.com>
Message-ID: <CAD=7U2DNUab_d+TCvUzqqzdAe3QNDtZLRcRM4p1uLd4qPAgo9Q@mail.gmail.com>

On Thu, Nov 22, 2012 at 1:23 PM, Ben Hoyt <benhoyt at gmail.com> wrote:
>> It doesn't work on FreeBSD 9.1. Here's a quick failure in  the source
> ...
>> ('.', [u'', u'', u'erwalk.py', u'erwalk.pyc', u'p.py'], [u'ignore',
>
> Thanks. FreeBSD seemed to have the same "problem" as Mac OS X (issue
> #2). I fixed this on the GitHub repo now -- please try again.

That's not surprising. OSX started life as Mach with the BSD
personality and a FreeBSD userland.

>> I also got [a TypeError] trying to run the benchmark program on 3.2 (this
>> is python-ideas, which means things discussed here are bound for
>> Python 3):
> Yeah, sorry about that. I'd certainly tested the benchmark on Python
> 3.x, however not the directory tree creation. Again, fixed now on
> GitHub.

Both things seem to be fixed. On a zfs file system, I get

python3.2:  	 1.8x as fast
python2.7:	 1.3x as fast

There are two that worry me, though:

python3.2:    	   0.8x as fast
python 2.7:	   0.6x as fast

I get the same results on an nfs mount of a zfs file system (the
remote fs should not matter) and an memory backed file system
(typically used for /tmp). I had hunt for a disk-based fs to get the
first set of results :-(.

I suspect that neither of these have d_type on the fs, so we're seeing
a serious performance hit for systems that don't have d_type. That
certainly bears further investigation. Could it just be the
python/ctype implementation vs. native code?

      <mike


From benhoyt at gmail.com  Thu Nov 22 22:34:42 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 23 Nov 2012 10:34:42 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAD=7U2DNUab_d+TCvUzqqzdAe3QNDtZLRcRM4p1uLd4qPAgo9Q@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<CAD=7U2A5+sCBxsBgbs3Hb04U4GqkFVf=O6tZVrPRtVSQ-cHSxA@mail.gmail.com>
	<CAL9jXCE5ZoTOOuFSae55NHXs80d1SkWaZsaBo9nuEN9f2tEqKg@mail.gmail.com>
	<CAD=7U2DNUab_d+TCvUzqqzdAe3QNDtZLRcRM4p1uLd4qPAgo9Q@mail.gmail.com>
Message-ID: <CAL9jXCFkiGBWYHTqPNBWCtpML+n463K9uGdjgCGG9_itcy2+dg@mail.gmail.com>

> There are two that worry me, though:
>
> python3.2:         0.8x as fast
> python 2.7:        0.6x as fast
>
> I get the same results on an nfs mount of a zfs file system (the
> remote fs should not matter) and an memory backed file system
> (typically used for /tmp). I had hunt for a disk-based fs to get the
> first set of results :-(.
>
> I suspect that neither of these have d_type on the fs, so we're seeing
> a serious performance hit for systems that don't have d_type. That
> certainly bears further investigation. Could it just be the
> python/ctype implementation vs. native code?

The fallback when d_type isn't present (or return DT_UNKNOWN) is to
call os.stat() anyway, which is almost exactly what the standard
os.walk() does. So yes, the slow down here is almost certainly due to
my pure Python ctypes implementation vs os.listdir()'s C version.

Antoine's suggestion is a good one: rewriting iterdir_stat() in C or
using a ctypes emulation of listdir. Thanks! I'll see what I get time
for.

-Ben


From stefan at drees.name  Thu Nov 22 23:17:59 2012
From: stefan at drees.name (Stefan Drees)
Date: Thu, 22 Nov 2012 23:17:59 +0100
Subject: [Python-ideas] BetterWalk,
 a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCHvuaEPPriw1ZigdbiroMfbwozk1_gLKdV+87SrgB0G4g@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<20121122204332.2fcb6f66@pitrou.net>
	<CAL9jXCHvuaEPPriw1ZigdbiroMfbwozk1_gLKdV+87SrgB0G4g@mail.gmail.com>
Message-ID: <50AEA497.6040409@drees.name>

Hi Ben,

On 22.11.12 21:23, Ben Hoyt wrote:
>> On Mageia Linux 1, I had the following results:
>> - Python 2.7: 0.7x as fast
>> - Python 3.2: 1.1x as fast
>> - Python 3.3: 1.2x as fast
>> The -s flag didn't make a difference.
>
> Thanks. Out of interest, 64 bit or 32 bit (system and Python)?
>
> I wonder if ctypes got significantly faster in Python 3.x, or what's
> going on here. Python 3 is significantly faster in my tests too --
> noticeable on Linux.
>
>> Do note that the benchmark is very fast - around ~40 ms for a walk.
>
> Yeah, that's over too quickly for a real test, isn't it? I'm already
> creating a 230MB dir, but maybe I need to add more, smaller files.
> Easy to tweak with the NUM_FILES and NUM_DIRS constants (though you'll
> need to delete your current "benchtree" dir so it recreates it. Maybe
> I should bump up the defaults too.
>

thanks for providing a fix. Now using revision  f975b2a5... with fixed 
tree creation on a Mac BookPro (8 GB RAM) and OS X 10.8.2 no real-time 
virus scanning during testrun ;-) walking a solid-state disk:

Oh, and I varied a bit the constants ...

With Python 3.3.0:

  + depth=4, num_dirs=5, num_files=50
    os.walk took 0.122s, BetterWalk took 0.076s -- 1.6x as fast

  + depth=4, num_dirs=5, num_files=100
    os.walk took 0.142s, BetterWalk took 0.098s -- 1.4x as fast

  + depth=4, num_dirs=10, num_files=50
    os.walk took 0.840s, BetterWalk took 0.634s -- 1.3x as fast

  + depth=5, num_dirs=5, num_files=50
    os.walk took 0.617s, BetterWalk took 0.446s -- 1.4x as fast

With Python 2.7.3:

  + depth=4, num_dirs=5, num_files=50
    os.walk took 0.060s, BetterWalk took 0.059s -- 1.0x as fast

  + depth=4, num_dirs=5, num_files=100
    os.walk took 0.121s, BetterWalk took 0.136s -- 0.9x as fast

  + depth=4, num_dirs=10, num_files=50
    os.walk took 0.658s, BetterWalk took 0.664s -- 1.0x as fast

  + depth=5, num_dirs=5, num_files=50
    os.walk took 0.473s, BetterWalk took 0.506s -- 0.9x as fast

All the best,
Stefan.


From stefan at drees.name  Thu Nov 22 23:27:19 2012
From: stefan at drees.name (Stefan Drees)
Date: Thu, 22 Nov 2012 23:27:19 +0100
Subject: [Python-ideas] BetterWalk,
 a better and faster os.walk() for Python
In-Reply-To: <50AEA497.6040409@drees.name>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<20121122204332.2fcb6f66@pitrou.net>
	<CAL9jXCHvuaEPPriw1ZigdbiroMfbwozk1_gLKdV+87SrgB0G4g@mail.gmail.com>
	<50AEA497.6040409@drees.name>
Message-ID: <50AEA6C7.4000902@drees.name>

maybe of interest 64-bit Intel based osx system and the filesystem is 
hfs, local and journaled on a fully encrypted account.

Am 22.11.12 23:17, schrieb Stefan Drees:
> Hi Ben,
>
> ...
> With Python 3.3.0:
>
>   + depth=4, num_dirs=5, num_files=50
>     os.walk took 0.122s, BetterWalk took 0.076s -- 1.6x as fast
>
>   + depth=4, num_dirs=5, num_files=100
>     os.walk took 0.142s, BetterWalk took 0.098s -- 1.4x as fast
>
>   + depth=4, num_dirs=10, num_files=50
>     os.walk took 0.840s, BetterWalk took 0.634s -- 1.3x as fast
>
>   + depth=5, num_dirs=5, num_files=50
>     os.walk took 0.617s, BetterWalk took 0.446s -- 1.4x as fast
>
> With Python 2.7.3:
>
>   + depth=4, num_dirs=5, num_files=50
>     os.walk took 0.060s, BetterWalk took 0.059s -- 1.0x as fast
>
>   + depth=4, num_dirs=5, num_files=100
>     os.walk took 0.121s, BetterWalk took 0.136s -- 0.9x as fast
>
>   + depth=4, num_dirs=10, num_files=50
>     os.walk took 0.658s, BetterWalk took 0.664s -- 1.0x as fast
>
>   + depth=5, num_dirs=5, num_files=50
>     os.walk took 0.473s, BetterWalk took 0.506s -- 0.9x as fast
>
> ...


From joshua.landau.ws at gmail.com  Thu Nov 22 23:33:49 2012
From: joshua.landau.ws at gmail.com (Joshua Landau)
Date: Thu, 22 Nov 2012 22:33:49 +0000
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <1353542767.51180.YahooMailRC@web184705.mail.ne1.yahoo.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<50A68424.9000509@pearwood.info> <50A6E47F.3030304@pearwood.info>
	<1353122279.34236.YahooMailRC@web184704.mail.ne1.yahoo.com>
	<CAN1F8qWD0watc8d=QsV4dgYxr2U7Wss3dyT4RJ4a4Yocz2UWPw@mail.gmail.com>
	<1353542767.51180.YahooMailRC@web184705.mail.ne1.yahoo.com>
Message-ID: <CAN1F8qWDERu=xeOqB+NF9_twQ40uCn8mf4jXj+Q8=_Wab74Z-g@mail.gmail.com>

On 22 November 2012 00:06, Andrew Barnert <abarnert at yahoo.com> wrote:

> From: Joshua Landau <joshua.landau.ws at gmail.com>
> Sent: Sat, November 17, 2012 11:38:22 AM
>
> >Surely the best choice is two have *two* caches; one for hashables and
> another
> >for the rest.
>
> Your implementation does a try: hash() to decide whether to check the set
> or the
> list, instead of just doing a try: item in set except: item in list. Is
> there a
> reason for this? It's more complicated, and it's measurably slower.


I did not realise that "[] in set()" raised an error! I'd just assumed it
returned False.

Thank you, this does make small but significant difference.


>  >This might be improvable with a *third* chache if some non-hashables had
> total
> >ordering, but that could also introduce bugs I think. It'd also be a lot
> harder
> >and likely be slower anyway.
>
> I agree that it's probably not worth adding to something in the standard
> library, or a recipe given in the documentation (in fact, I think I
> already said
> as much earlier in the thread), but I think you've got most of those facts
> wrong.
>
> It's not a lot harder. The same 4 lines you have to add to do a
> try-set-except-list, you just do again, so it's
> try-set-except-try-sortedlist-except-list.


Well, I'd sort-of assumed that this included adding  sorted collection to
the mix, as it isn't in the standard library.


> And it won't introduce any bugs.


This took me a while to prove, so I'm proud of this:

>>> from blist import sortedlist
>>> {2} in sortedlist([{1, 2}, {1, 3}, {2}])
False

You *cannot* assume that a data set has total ordering on the basis that
it's working so far.


> And
> as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M
> unique,
> which is obviously better, but probably slower for tiny M, and another
> 5-10%
> overhead for inappropriate values.
>

Well yes... bar the fact that you may be using it on something with a
non-even distribution of "things" where some types are not comparable to
each-other:

[ {1, 2}, [3, 4], [1, 2], [7, 4], [2, 3], (5, 2), [2, 1] ... ]

Where you'll get nowhere near O(NlogM).

*And* then there's the fact that sorted collections have intrinsically more
overhead, and so are likely to give large overhead.

The problem is finding an appropriate sortedcollection type. There isn't
> one in
> the standard library. There's a link to an external SortedCollection
> reference
> in the bisect docs page, but that has O(N) insertion time, so it won't
> help. The
> most popular library I know of is blist.sortedlist, and that works, but it
> has
> quite a bit of overhead for searching tiny lists. As I understand it, the
> reason
> there isn't a standard sorted collection is the assumption that people
> dealing
> with huge sequences ought to expect  to have some searching, comparing, and
> profiling of algorithms in their  future, while those people dealing with
> len<10
> sequences shouldn't have to think at all.
>
> At any rate, I tried a few different sorted collections. The performance
> for
> single-digit M was anywhere from 2.9x slower to 38x slower (8x with
> blist); the
> crossover was around M=100, and you get 10x faster by around M=100K.
> Deciding
> whether this is appropriate, and which implementation to use, and so on?
> well,
> that's exactly why there's no sorted list in the stdlib in the first place.


Thank you for the numbers. May I ask what libraries you used?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121122/bf9bd3cd/attachment.html>

From abarnert at yahoo.com  Thu Nov 22 23:33:14 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 22 Nov 2012 14:33:14 -0800 (PST)
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
Message-ID: <1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>

From: Ben Hoyt <benhoyt at gmail.com>
Sent: Thu, November 22, 2012 3:40:00 AM

> In any case, I'd love it if folks could run the benchmark on  their
> system (with and without -s) and comment further on the idea and  API.


I tested on OS X 10.8.2 on a Retina MBP 15" with 16GB and the stock SSD, using 
Apple 2.6 and 2.7 and python.org 3.3. It seems to be a bit slower in 2.x, a bit 
faster in 3.x, more so in 32-bit mode, and better without -s. The best result I 
got anywhere was 1.5x (3.3, 32-bit, no -s), but repeating that test gave 
anywhere from 1.2x to 1.5x.

Here are the last test runs for each run:

Retina:betterwalk abarnert$ python2.6 benchmark.py
Priming the system's cache...
Benchmarking walks on benchtree, repeat 1/3...
Benchmarking walks on benchtree, repeat 2/3...
Benchmarking walks on benchtree, repeat 3/3...
os.walk took 0.057s, BetterWalk took 0.061s -- 0.9x as fast

Retina:betterwalk abarnert$ python2.7 benchmark.py 
Creating tree at benchtree: depth=4, num_dirs=5, num_files=50
Priming the system's cache...
Benchmarking walks on benchtree, repeat 1/3...
Benchmarking walks on benchtree, repeat 2/3...
Benchmarking walks on benchtree, repeat 3/3...
os.walk took 0.059s, BetterWalk took 0.066s -- 0.9x as fast

Retina:betterwalk abarnert$ python3.3 benchmark.py 
Priming the system's cache...
Benchmarking walks on benchtree, repeat 1/3...
Benchmarking walks on benchtree, repeat 2/3...
Benchmarking walks on benchtree, repeat 3/3...
os.walk took 0.074s, BetterWalk took 0.058s -- 1.3x as fast

Retina:betterwalk abarnert$ python2.6 benchmark.py -s
Priming the system's cache...
Benchmarking walks on benchtree, repeat 1/3...
Benchmarking walks on benchtree, repeat 2/3...
Benchmarking walks on benchtree, repeat 3/3...
os.walk size 226395000, BetterWalk size 226395000 -- equal
os.walk took 0.097s, BetterWalk took 0.104s -- 0.9x as fast

Retina:betterwalk abarnert$ python2.7 benchmark.py -s
Priming the system's cache...
Benchmarking walks on benchtree, repeat 1/3...
Benchmarking walks on benchtree, repeat 2/3...
Benchmarking walks on benchtree, repeat 3/3...
os.walk size 226395000, BetterWalk size 226395000 -- equal
os.walk took 0.100s, BetterWalk took 0.109s -- 0.9x as fast

Retina:betterwalk abarnert$ python3.3 benchmark.py -s
Priming the system's cache...
Benchmarking walks on benchtree, repeat 1/3...
Benchmarking walks on benchtree, repeat 2/3...
Benchmarking walks on benchtree, repeat 3/3...
os.walk size 226395000, BetterWalk size 226395000 -- equal
os.walk took 0.121s, BetterWalk took 0.099s -- 1.2x as fast

Retina:betterwalk abarnert$ python3.3-32 benchmark.py
Priming the system's cache...
Benchmarking walks on benchtree, repeat 1/3...
Benchmarking walks on benchtree, repeat 2/3...
Benchmarking walks on benchtree, repeat 3/3...
os.walk took 0.073s, BetterWalk took 0.048s -- 1.5x as fast

Retina:betterwalk abarnert$ python3.3-32 benchmark.py -s
Priming the system's cache...
Benchmarking walks on benchtree, repeat 1/3...
Benchmarking walks on benchtree, repeat 2/3...
Benchmarking walks on benchtree, repeat 3/3...
os.walk size 226395000, BetterWalk size 226395000 -- equal
os.walk took 0.129s, BetterWalk took 0.100s -- 1.3x as fast


From benhoyt at gmail.com  Thu Nov 22 23:43:59 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 23 Nov 2012 11:43:59 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>
Message-ID: <CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>

> I tested on OS X 10.8.2 on a Retina MBP 15" with 16GB and the stock SSD, using
> Apple 2.6 and 2.7 and python.org 3.3. It seems to be a bit slower in 2.x, a bit
> faster in 3.x, more so in 32-bit mode, and better without -s. The best result I
> got anywhere was 1.5x (3.3, 32-bit, no -s), but repeating that test gave
> anywhere from 1.2x to 1.5x.

Yeah, that's about what I'm seeing on Linux and OS X. (Though for some
weird reason I'm seeing 10x as fast on OS X when I do "python
benchmark.py /usr" -- hence my comments in the README.)

Anyway, thanks for the benchmarks, guys. I'm satisfied with the
Windows results. But I'll need to rewrite the Linux version in C (to
match os.listdir) before these results are really meaningful.

-Ben


From benhoyt at gmail.com  Thu Nov 22 23:44:47 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 23 Nov 2012 11:44:47 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
Message-ID: <CAL9jXCFtePZM38vibwjRAFMCO22iwBuN-WBe7CgOJEUKnZ2vxw@mail.gmail.com>

> Anyway, thanks for the benchmarks, guys. I'm satisfied with the
> Windows results. But I'll need to rewrite the Linux version in C (to
> match os.listdir) before these results are really meaningful.

In the meantime, anyone who wants to comment on the iterdir_stat() API
or other issues, go ahead!

-Ben


From abarnert at yahoo.com  Fri Nov 23 00:57:33 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 22 Nov 2012 15:57:33 -0800 (PST)
Subject: [Python-ideas] Uniquify attribute for lists
In-Reply-To: <CAN1F8qWDERu=xeOqB+NF9_twQ40uCn8mf4jXj+Q8=_Wab74Z-g@mail.gmail.com>
References: <CACWWysfAAGyWOUQcQzcp5UCLO0gy7S1uchorTgS9mFTRR7V5kg@mail.gmail.com>
	<50A68424.9000509@pearwood.info> <50A6E47F.3030304@pearwood.info>
	<1353122279.34236.YahooMailRC@web184704.mail.ne1.yahoo.com>
	<CAN1F8qWD0watc8d=QsV4dgYxr2U7Wss3dyT4RJ4a4Yocz2UWPw@mail.gmail.com>
	<1353542767.51180.YahooMailRC@web184705.mail.ne1.yahoo.com>
	<CAN1F8qWDERu=xeOqB+NF9_twQ40uCn8mf4jXj+Q8=_Wab74Z-g@mail.gmail.com>
Message-ID: <1353628653.5937.YahooMailRC@web184704.mail.ne1.yahoo.com>

From: Joshua Landau <joshua.landau.ws at gmail.com>
Sent: Thu, November 22, 2012 2:34:30 PM
>
>On 22 November 2012 00:06, Andrew Barnert <abarnert at yahoo.com> wrote:
>
>From: Joshua Landau <joshua.landau.ws at gmail.com>
>>Sent: Sat, November 17, 2012 11:38:22 AM
>>
>>
>>>Surely the best choice is two have *two* caches; one for hashables and 
another
>>>for the rest.
>>
>>Your implementation does a try: hash() to decide whether to check the set or 
>the
>>list, instead of just doing a try: item in set except: item in list. Is there 
a
>>reason for this? It's more complicated, and it's measurably slower.

> I did not realise that "[] in set()" raised an error! I'd just assumed it 
> returned False.

I just realized that this doesn't seem to be documented anywhere. It's obvious 
that set.add would have to raise for a non-hashable, but x in set could be 
implemented as either raising or returning False without violating any of the 
requirements at http://docs.python.org/3/library/stdtypes.html#set or anywhere 
else that I can see?

I did a quick test with PyPy and Jython built-in sets, the old sets module, and 
the Python recipe that used to be linked for pre-2.3 compat, and they all do the 
same thing as CPython. (The pure Python versions are all just implemented as a 
dict where all the values are True.) But that still doesn't necessarily 
guarantee that it's safe in all possible future implementations?

Maybe the documentation should be updated to guarantee this?it's a useful thing 
to rely on, all current implementations provide it, and it's hard to think of a 
good reason why breaking it could improve performance or implementation 
simplicity.

> Well, I'd sort-of assumed that this included adding  sorted collection to the
> mix, as it isn't in the standard library.

Yes, as I said later, that's the biggest reason not to consider it as a general 
solution.
 
> You *cannot* assume that a data set has total ordering on the basis that it's 
> working so far.

You're right. I was thinking that a sorted collection should reject adding 
elements that aren't totally ordered with the existing elements? but now that I 
think about it, there's obviously no way to do that in O(log N) time.
 
>> And
>>as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique,
>>which is obviously better, but probably slower for tiny M, and another 5-10%
>>overhead for inappropriate values.
>>
>
> Well yes... bar the fact that you may be using it on something with a non-even 

> distribution of "things" where some types are not comparable to each-other:

I didn't speak very precisely here, because it's hard to be concise, but the 
total performance is O(A) + O(BlogM) + O(CN), where A is the number of hashable 
elements, B is the number of non-hashable but sortable elements that are 
comparable to the first non-hashable but sortable element, M is the number of 
unique elements within B, C is the number of elements that are neither hashable 
nor comparable with the elements of B, and N is the number of unique elements 
within C.

The point is that, if a significant subset of the elements are in B, this will 
be better than O(A)+O(CN); otherwise, it'll be the same. Just as O(A)+O(CN) is 
better than O(CN) if a significant subset of the elements are in A, otherwise 
the same. So, it's an improvement in the same way that adding the set is an 
improvement.

> *And* then there's the fact that sorted collections have intrinsically more 
> overhead, and so are likely to give large overhead.

I mentioned that later (and you commented on it). When M is very small 
(especially if B is also very large), there's a substantial added cost. Of 
course the same is true for the set, but the crossover for the set happens 
somewhere between 2-10 unique elements instead of 100, and the cost below that 
crossover is much smaller.

>>At any rate, I tried a few different sorted collections. The performance for
>>single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); 
the
>>crossover was around M=100, and you get 10x faster by around M=100K. Deciding
>>whether this is appropriate, and which implementation to use, and so on? well,
>>that's exactly why there's no sorted list in the stdlib in the first place.

> Thank you for the numbers. May I ask what libraries you used?

* blist (PyPI): hybrid B+Tree/array
* pyavl (PyPI): AVL tree
* bintrees (PyPI): AVL tree and red-black tree
* opus7 (http://www.brpreiss.com/books/opus7/): AVL tree
* libc++ std::set (incomplete hacked-up Cython interface): red-black tree
* CHDataStructures (via PyObjC): not sure
* java.util.TreeSet (via jython): red-black tree
* java.util.concurrrent.ConcurrentSkipListSet: skip-list
* QtCore.QMap (via PyQt4): skip-list

Some of these are hacked-up implementations that only handle just enough of the 
interface I need, in some cases even faking the comparisons, and in some cases 
not even complete enough to run the real test (so I just tested the time to 
test-and-insert B values M/B values). So, it's not a scientific test or 
anything, but they were all in the expected ballpark (and the few that weren't 
turned out not to be O(log N) insert time, so I threw them out).

The most thorough tests were with blist; I think I posted the complete numbers 
before, but the short version is: 8.0x with 2 unique values; 1.1x with 64; 0.9x 
with 128; 0.1x with 128K, all with 256K total values.


From abarnert at yahoo.com  Fri Nov 23 01:05:39 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 22 Nov 2012 16:05:39 -0800 (PST)
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
Message-ID: <1353629139.13694.YahooMailRC@web184704.mail.ne1.yahoo.com>

From: Ben Hoyt <benhoyt at gmail.com>
Sent: Thu, November 22, 2012 2:44:02 PM


> > I tested on OS X 10.8.2 on a Retina MBP 15" with 16GB and the stock SSD,  
>using
> > Apple 2.6 and 2.7 and python.org 3.3. It seems to be a bit slower in 2.x,  a 
>bit
> > faster in 3.x, more so in 32-bit mode, and better without -s. The  best 
>result I
> > got anywhere was 1.5x (3.3, 32-bit, no -s), but repeating  that test gave
> > anywhere from 1.2x to 1.5x.
> 
> Yeah, that's about  what I'm seeing on Linux and OS X. (Though for some
> weird reason I'm seeing  10x as fast on OS X when I do "python
> benchmark.py /usr" -- hence my comments in the  README.)

I get exactly 1.0x on this test with 2.6 and 2.7, 1.3x with 3.3, 1.4x with 
32-bit 3.3. Any chance your /usr has a symlink to a remote or otherwise slow or 
non-HFS+ filesystem? Is that worth testing?

Also, the -s version seems to fail on dangling symlinks:

$ python2.7 benchmark.py -s /usr
Priming the system's cache...Benchmarking walks on /usr, repeat 1/3...
Traceback (most recent call last):
  File "benchmark.py", line 121, in <module>
    main()
  File "benchmark.py", line 118, in main
    benchmark(tree_dir, get_size=options.size)
  File "benchmark.py", line 83, in benchmark
    os_walk_time = min(os_walk_time, timeit.timeit(do_os_walk, number=1))
  File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py",
 line 228, in timeit
    return Timer(stmt, setup, timer).timeit(number)
  File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py",
 line 194, in timeit
    timing = self.inner(it, self.timer)
  File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py",
 line 100, in inner
    _func()
  File "benchmark.py", line 57, in do_os_walk
    size += os.path.getsize(fullname)
  File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/genericpath.py",
 line 49, in getsize
    return os.stat(filename).st_size
OSError: [Errno 2] No such file or directory: 
'/usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/ppc_intrinsics.h'


$ readlink 
/usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/ppc_intrinsics.h

../../../../../include/gcc/darwin/4.2/ppc_intrinsics.h
$ ls 
/usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/../../../../../include/gcc/darwin/4.2/ppc_intrinsics.h

ls: 
/usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/../../../../../include/gcc/darwin/4.2/ppc_intrinsics.h:
 No such file or directory



From robertc at robertcollins.net  Fri Nov 23 01:26:48 2012
From: robertc at robertcollins.net (Robert Collins)
Date: Fri, 23 Nov 2012 13:26:48 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <1353629139.13694.YahooMailRC@web184704.mail.ne1.yahoo.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
	<1353629139.13694.YahooMailRC@web184704.mail.ne1.yahoo.com>
Message-ID: <CAJ3HoZ3sNY25iTpLTxcEUUG-Q5sruBRTKi31g+VOrXcXLLturg@mail.gmail.com>

If you want to test cold cache behaviour, see /proc/sys/vm/drop_caches

-Rob

On Fri, Nov 23, 2012 at 1:05 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
> From: Ben Hoyt <benhoyt at gmail.com>
> Sent: Thu, November 22, 2012 2:44:02 PM
>
>
>> > I tested on OS X 10.8.2 on a Retina MBP 15" with 16GB and the stock SSD,
>>using
>> > Apple 2.6 and 2.7 and python.org 3.3. It seems to be a bit slower in 2.x,  a
>>bit
>> > faster in 3.x, more so in 32-bit mode, and better without -s. The  best
>>result I
>> > got anywhere was 1.5x (3.3, 32-bit, no -s), but repeating  that test gave
>> > anywhere from 1.2x to 1.5x.
>>
>> Yeah, that's about  what I'm seeing on Linux and OS X. (Though for some
>> weird reason I'm seeing  10x as fast on OS X when I do "python
>> benchmark.py /usr" -- hence my comments in the  README.)
>
> I get exactly 1.0x on this test with 2.6 and 2.7, 1.3x with 3.3, 1.4x with
> 32-bit 3.3. Any chance your /usr has a symlink to a remote or otherwise slow or
> non-HFS+ filesystem? Is that worth testing?
>
> Also, the -s version seems to fail on dangling symlinks:
>
> $ python2.7 benchmark.py -s /usr
> Priming the system's cache...Benchmarking walks on /usr, repeat 1/3...
> Traceback (most recent call last):
>   File "benchmark.py", line 121, in <module>
>     main()
>   File "benchmark.py", line 118, in main
>     benchmark(tree_dir, get_size=options.size)
>   File "benchmark.py", line 83, in benchmark
>     os_walk_time = min(os_walk_time, timeit.timeit(do_os_walk, number=1))
>   File
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py",
>  line 228, in timeit
>     return Timer(stmt, setup, timer).timeit(number)
>   File
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py",
>  line 194, in timeit
>     timing = self.inner(it, self.timer)
>   File
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py",
>  line 100, in inner
>     _func()
>   File "benchmark.py", line 57, in do_os_walk
>     size += os.path.getsize(fullname)
>   File
> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/genericpath.py",
>  line 49, in getsize
>     return os.stat(filename).st_size
> OSError: [Errno 2] No such file or directory:
> '/usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/ppc_intrinsics.h'
>
>
> $ readlink
> /usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/ppc_intrinsics.h
>
> ../../../../../include/gcc/darwin/4.2/ppc_intrinsics.h
> $ ls
> /usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/../../../../../include/gcc/darwin/4.2/ppc_intrinsics.h
>
> ls:
> /usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/../../../../../include/gcc/darwin/4.2/ppc_intrinsics.h:
>  No such file or directory
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas



-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Cloud Services


From abarnert at yahoo.com  Fri Nov 23 03:48:36 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 22 Nov 2012 18:48:36 -0800 (PST)
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAJ3HoZ3sNY25iTpLTxcEUUG-Q5sruBRTKi31g+VOrXcXLLturg@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
	<1353629139.13694.YahooMailRC@web184704.mail.ne1.yahoo.com>
	<CAJ3HoZ3sNY25iTpLTxcEUUG-Q5sruBRTKi31g+VOrXcXLLturg@mail.gmail.com>
Message-ID: <1353638916.20401.YahooMailRC@web184701.mail.ne1.yahoo.com>

> From: Robert Collins <robertc at robertcollins.net>
> Sent: Thu, November 22, 2012 4:26:49 PM
> 
> If you want to test cold cache behaviour, see  /proc/sys/vm/drop_caches
> 
> -Rob


On a Mac? There's no /proc filesystem on OS X; that's linux-specific.

> On Fri, Nov 23, 2012 at 1:05 PM,  Andrew Barnert <abarnert at yahoo.com> wrote:
> > From:  Ben Hoyt <benhoyt at gmail.com>
> > Sent: Thu,  November 22, 2012 2:44:02 PM
> >
> >
> >> > I tested on OS X  10.8.2 on a Retina MBP 15" with 16GB and the stock  
SSD,
> >>using
> >> > Apple 2.6 and 2.7 and python.org 3.3. It seems to be a bit slower in  
>2.x,  a
> >>bit
> >> > faster in 3.x, more so in 32-bit  mode, and better without -s. The  best
> >>result I
> >> >  got anywhere was 1.5x (3.3, 32-bit, no -s), but repeating  that test  
>gave
> >> > anywhere from 1.2x to 1.5x.
> >>
> >> Yeah,  that's about  what I'm seeing on Linux and OS X. (Though for  some
> >> weird reason I'm seeing  10x as fast on OS X when I do  "python
> >> benchmark.py /usr" -- hence my comments in  the  README.)
> >
> > I get exactly 1.0x on this test with 2.6 and  2.7, 1.3x with 3.3, 1.4x with
> > 32-bit 3.3. Any chance your /usr has a  symlink to a remote or otherwise slow 
>or
> > non-HFS+ filesystem? Is that  worth testing?
> >
> > Also, the -s version seems to fail on dangling  symlinks:
> >
> > $ python2.7 benchmark.py -s /usr
> > Priming the  system's cache...Benchmarking walks on /usr, repeat 1/3...
> > Traceback  (most recent call last):
> >   File "benchmark.py", line 121, in  <module>
> >     main()
> >   File "benchmark.py",  line 118, in main
> >     benchmark(tree_dir,  get_size=options.size)
> >   File "benchmark.py", line 83, in  benchmark
> >     os_walk_time = min(os_walk_time,  timeit.timeit(do_os_walk, number=1))
> >   File
> >  
>"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py",
>
> >   line 228, in timeit
> >     return Timer(stmt, setup,  timer).timeit(number)
> >   File
> >  
>"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py",
>
> >   line 194, in timeit
> >     timing = self.inner(it,  self.timer)
> >   File
> >  
>"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py",
>
> >   line 100, in inner
> >     _func()
> >   File  "benchmark.py", line 57, in do_os_walk
> >     size +=  os.path.getsize(fullname)
> >   File
> >  
>"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/genericpath.py",
>
> >   line 49, in getsize
> >     return  os.stat(filename).st_size
> > OSError: [Errno 2] No such file or  directory:
> >  
>'/usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/ppc_intrinsics.h'
>
> >
> >
> >  $ readlink
> >  
>/usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/ppc_intrinsics.h
>
> >
> >  ../../../../../include/gcc/darwin/4.2/ppc_intrinsics.h
> > $ ls
> >  
>/usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/../../../../../include/gcc/darwin/4.2/ppc_intrinsics.h
>
> >
> >  ls:
> >  
>/usr/local/Cellar/gfortran/4.2.4-5666.3/lib/gcc/i686-apple-darwin11/4.2.1/include/../../../../../include/gcc/darwin/4.2/ppc_intrinsics.h:
>
> >   No such file or directory
> >
> >  _______________________________________________
> > Python-ideas mailing  list
> > Python-ideas at python.org
> >  http://mail.python.org/mailman/listinfo/python-ideas
> 
> 
> 
> -- 
> Robert Collins <rbtcollins at hp.com>
> Distinguished  Technologist
> HP Cloud Services
> 


From random832 at fastmail.us  Fri Nov 23 04:06:05 2012
From: random832 at fastmail.us (Random832)
Date: Thu, 22 Nov 2012 22:06:05 -0500
Subject: [Python-ideas] BetterWalk,
 a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
Message-ID: <50AEE81D.5060707@fastmail.us>

Some thoughts:

I'm suspicious of your use of Windows' built-in pattern matching. There 
are a number of quirks to it you haven't accounted for... for example: 
it matches short filenames, the behavior you noted of "?" at the end of 
patterns also applies to the end of the 'filename portion' (e.g. 
foo?.txt can match foo.txt), and the behavior of patterns ending in ".*" 
or "." isn't like fnmatch.

your find_data_to_stat function ignores the symlink flag, and so will 
indicate that a symlink is either a normal file or a directory. This is 
desirable for internal use within walk _only if_ followlinks=True. 
Meanwhile, the linux function will simply result in DT_LNK, which means 
this should really be called iterdir_lstat.

To get the benefit of windows' directory flag, and to define the minimum 
required for os.walk(...,followlinks=True), maybe the API should be an 
iterdir_lstat with a specific option to request "isdir", which will 
populate st_mode with S_IFDIR either from the win32 find data or by, on 
linux, calling stat on anything that comes back with DT_LNK, and when it 
is false will always populate st_mode with S_IFLNK in the case of 
symbolic links.


From abarnert at yahoo.com  Fri Nov 23 05:42:28 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 22 Nov 2012 20:42:28 -0800 (PST)
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCFtePZM38vibwjRAFMCO22iwBuN-WBe7CgOJEUKnZ2vxw@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
	<CAL9jXCFtePZM38vibwjRAFMCO22iwBuN-WBe7CgOJEUKnZ2vxw@mail.gmail.com>
Message-ID: <1353645748.87470.YahooMailRC@web184706.mail.ne1.yahoo.com>

From: Ben Hoyt <benhoyt at gmail.com>
Sent: Thu, November 22, 2012 2:44:49 PM


> In the  meantime, anyone who wants to comment on the iterdir_stat() API
> or other  issues, go ahead!


I already mentioned the problem following symlinks into nonexistent paths.

The followlinks implementation is os.walk seems wrong. If need_stat is false, 
iterdir_stat will return S_IFLNK, but then os.walk only checks for S_IFDIR, so 
it won't recurse into them. Plus, it looks like, even if you got that right, 
you'd be trying to opendir every symlink, even the ones you know aren't links to 
directories.

On a related note, if you call iterdir_stat with just None or st_mode_type, 
symlinks will show up as links, but if you call with anything else, they'll show 
up as the referenced file. I think you really want a "physical" flag to control 
whether you call stat or lstat, although I'm not sure what should happen for the 
no-stat version in that case. (Maybe look at what fts and nftw do?)

It might also be handy to be able to not call stat on directories, so if you 
wanted a "iterwalk_stat", it could just call fstat(dirfd(d)) after opendir (as 
nftw does).

Your code assumes that all paths are UTF-8. That's not guaranteed for linux or 
FreeBSD (although it is for OS X); you want sys.getfilesystemencoding().

Windows wildcards and fnmatch are not the same, and your check for '[' in 
pattern or pattern.endswith('?') is not sufficient to distinguish between the 
two.

The docstring should mention that fields=None returns nothing for free on other 
platforms.

The docstring (and likewise the comments) refers to "BSD" in general, but what 
you actually check for is "freebsd". I think OpenBSD, NetBSD, etc. will work 
with the BSD code; if not, the docs shouldn't imply that they do.

I believe cygwin can also use the BSD code. (I know they use FreeBSD's fts.c 
unmodified.)


From steve at pearwood.info  Fri Nov 23 05:52:25 2012
From: steve at pearwood.info (Steven D'Aprano)
Date: Fri, 23 Nov 2012 15:52:25 +1100
Subject: [Python-ideas] BetterWalk,
 a better and faster os.walk() for Python
In-Reply-To: <1353638916.20401.YahooMailRC@web184701.mail.ne1.yahoo.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
	<1353629139.13694.YahooMailRC@web184704.mail.ne1.yahoo.com>
	<CAJ3HoZ3sNY25iTpLTxcEUUG-Q5sruBRTKi31g+VOrXcXLLturg@mail.gmail.com>
	<1353638916.20401.YahooMailRC@web184701.mail.ne1.yahoo.com>
Message-ID: <50AF0109.3000906@pearwood.info>

On 23/11/12 13:48, Andrew Barnert wrote:
>> From: Robert Collins<robertc at robertcollins.net>
>> Sent: Thu, November 22, 2012 4:26:49 PM
>>
>> If you want to test cold cache behaviour, see  /proc/sys/vm/drop_caches
>>
>> -Rob
>
>
> On a Mac? There's no /proc filesystem on OS X; that's linux-specific.

I don't think that is correct. /proc is a UNIX feature, not just Linux. It
exists on Unixes such as FreeBSD, OpenBSD, NetBSD, Solaris, AIX, as well as
Unix-like Linux and QNX. Even Plan 9, which is not Unix, has /proc.

OS X is also a Unix. Since 10.5, OS X has been registered with the SUS
("Single UNIX Specification") and has received UNIX certification.



-- 
Steven


From abarnert at yahoo.com  Fri Nov 23 06:59:40 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Thu, 22 Nov 2012 21:59:40 -0800 (PST)
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <50AF0109.3000906@pearwood.info>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
	<1353629139.13694.YahooMailRC@web184704.mail.ne1.yahoo.com>
	<CAJ3HoZ3sNY25iTpLTxcEUUG-Q5sruBRTKi31g+VOrXcXLLturg@mail.gmail.com>
	<1353638916.20401.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<50AF0109.3000906@pearwood.info>
Message-ID: <1353650380.46106.YahooMailRC@web184705.mail.ne1.yahoo.com>

From: Steven D'Aprano <steve at pearwood.info>
Sent: Thu, November 22, 2012 8:52:56 PM


> 
> On 23/11/12 13:48, Andrew Barnert wrote:
> >> From: Robert Collins<robertc at robertcollins.net>
> >>  Sent: Thu, November 22, 2012 4:26:49 PM
> >>
> >> If you want to  test cold cache behaviour, see   /proc/sys/vm/drop_caches
> >>
> >> -Rob
> >
> >
> > On  a Mac? There's no /proc filesystem on OS X; that's linux-specific.
> 
> I  don't think that is correct. /proc is a UNIX feature, not just Linux. 

To make it more clear:

The existence of /proc is a non-standardized feature that some, but not, all 
UNIXes have?OS X is one of those that does not.

Almost all UNIX and UNIX-like systems that do have /proc have a directory per 
process, with read-only binary information about those processes, and nothing 
else.

The idea of writing to magic text files under /proc/sys to control the OS is 
entirely specific to linux.

See https://blogs.oracle.com/eschrock/entry/the_power_of_proc for the Solaris 
perspective, 
and http://lists.freebsd.org/pipermail/freebsd-fs/2011-February/010760.html for 
the FreeBSD perspective, to get an idea of how unique linux /proc is, and why 
it's likely to stay that way.

> OS X is also a Unix. Since 10.5, OS X has been registered with the  SUS
> ("Single UNIX Specification") and has received UNIX  certification.

Yes, and SUS/POSIX/OpenGroup doesn't specify /proc/sys, or even basic /proc, or 
epoll or /dev/cdrom or GNOME or half the other things you have on your Linux 
box. OS X is a UNIX system, not a Linux system.


From stefan_ml at behnel.de  Fri Nov 23 08:05:59 2012
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 23 Nov 2012 08:05:59 +0100
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
Message-ID: <k8n78l$6iu$1@ger.gmane.org>

Ben Hoyt, 22.11.2012 23:43:
> I'll need to rewrite the Linux version in C (to
> match os.listdir) before these results are really meaningful.

Start by dropping your code into Cython for now, that allows you to call
the C function directly. Should be a quicker way than rewriting parts of it
in C.

Stefan




From mwm at mired.org  Fri Nov 23 15:32:52 2012
From: mwm at mired.org (Mike Meyer)
Date: Fri, 23 Nov 2012 08:32:52 -0600
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <50AF0109.3000906@pearwood.info>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1353623594.34875.YahooMailRC@web184702.mail.ne1.yahoo.com>
	<CAL9jXCF6PBYU-TaEqfNB2f2S=KdL2hh+eHznCo_NCGR2rXM0UQ@mail.gmail.com>
	<1353629139.13694.YahooMailRC@web184704.mail.ne1.yahoo.com>
	<CAJ3HoZ3sNY25iTpLTxcEUUG-Q5sruBRTKi31g+VOrXcXLLturg@mail.gmail.com>
	<1353638916.20401.YahooMailRC@web184701.mail.ne1.yahoo.com>
	<50AF0109.3000906@pearwood.info>
Message-ID: <b5fe0977-e97b-4d78-902a-32b1abcb1c47@email.android.com>



Steven D'Aprano <steve at pearwood.info> wrote:
>On 23/11/12 13:48, Andrew Barnert wrote:
>>> From: Robert Collins<robertc at robertcollins.net>
>>> Sent: Thu, November 22, 2012 4:26:49 PM
>>> If you want to test cold cache behaviour, see 
>/proc/sys/vm/drop_caches
>> On a Mac? There's no /proc filesystem on OS X; that's linux-specific.
>I don't think that is correct. /proc is a UNIX feature, not just Linux.
>It
>exists on Unixes such as FreeBSD, OpenBSD, NetBSD, Solaris, AIX, as
>well as
>Unix-like Linux and QNX. Even Plan 9, which is not Unix, has /proc.

I believe /proc came from Plan 9. However, it's not a standard Unix feature. Last time I checked it was optional on FreeBSD and disabled by default. The FreeBSD version is also different from the Linux version, sufficiently so that there's a second /proc for the Linux emulation layer.  And like Andrew, I find no /proc on my Macs. 

>OS X is also a Unix. Since 10.5, OS X has been registered with the SUS
>("Single UNIX Specification") and has received UNIX certification.

Unlike the free unices. Makes it a good platform for developing Unix software on.
-- 
Sent from my Android tablet with K-9 Mail. Please excuse my swyping.


From phlogistonjohn at asynchrono.us  Fri Nov 23 17:08:03 2012
From: phlogistonjohn at asynchrono.us (John Mulligan)
Date: Fri, 23 Nov 2012 11:08:03 -0500
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
Message-ID: <1360817.kTgMK23t3p@giza>

Hi, I've been idly watching python-ideas and this thread piqued my 
interest, so I'm unlurking for the first time.

I'm really happy that someones looking into this.  I've done some 
similar work for my day job and have some thoughts about the APIs and 
approach. 

I come at this from a C & Linux POV and wrote some similar wrappers to 
your iterdir_stat. What I do similarly is to provide a "flags" field 
like your "fields" argument (in my case a bitmask) that controls what 
extra information is returned/yielded from the call. What I do 
differently is that I (a) always return the d_type value instead of 
falling back to stat'ing the item (b) do not provide the pattern 
argument. 

I like returning the d_type directly because in the unix style APIs the 
dirent structure doesn't provide the same stuff as the stat result and I 
don't want to trick myself into thinking I have all the information 
available from the readdir call. I also like to have my Python functions 
map pretty closely to the C calls. I know that my Python is only issuing 
the same syscalls that the equivalent C code would. In addition, I like 
control over error handling that calling stat as a separate call gives 
me. For example there is a potential race condition between calling the 
readdir and the stat, like if the object is removed between calls. I can 
be very granular (for lack of a better word) about my error handling in 
these cases.

Because I come at this from a Linux platform I am also not so keen on 
the built in pattern matching that comes for "free" from the 
FindFirst/FindNext Window's API provides. It just feels like this should 
be provided at a higher layer. But I can't say for sure because I don't 
know how much of a performance boost this is on Windows.

I have a confession to make: I don't often use an os.walk equivalent 
when I use my library. I often call the listdir equivalents directly. So 
I've never benchmarked any os.walk equivalent even though I wrote one 
for fun!

In addition I have a fditerdir call that supports a directory file 
descriptor as the first argument. This is handy because I also have a 
wrapper for fstatat (this was all created for Python 2 and before 3.3 
was released).

I really like how your library is better in that you can get more fields 
from the direntry, I only support the d_type field at this time and have 
been meaning to extend the API. I can only yield tuples at the moment 
but a namedtuple style would be much nicer. IMO, think the ideal value 
would be some sort of abstract direntry structure that could be filled 
in with the values that readdir or FindFirst provide and then possibly 
provide a higher level function that combines iterdir + stat if you get 
DT_UNKNOWN.  In other words, provide an easy call like iterdir_stat that 
builds on an iterdir that gets the detailed dentry data.  

PS. 
If anyone is curious my library is available here: 
https://bitbucket.org/nasuni/fsnix 


Thanks!
-- John M.


On Friday, November 23, 2012 12:39:42 AM Ben Hoyt wrote:
> In the recent thread I started called "Speed up os.walk()..." [1] I
> was encouraged to create a module to flesh out the idea, so I present
> you with BetterWalk:
> 
> https://github.com/benhoyt/betterwalk#readme
> 
> It's basically all there, and works on Windows, Linux, and Mac OS X.
> It probably works on FreeBSD too, but I haven't tested that. I also
> haven't written thorough unit tests yet, but intend to after some
> further feedback.
> 
> In terms of the API for iterdir_stat(), I settled on the more explicit
> "pass in what stat fields you want" (the 'fields' parameter). I also
> added a 'pattern' parameter to allow you to make use of the wildcard
> matching that FindFirst/FindNext provide (it's useful for globbing on
> POSIX too, but not a performance improvement).
> 
> As for benchmarks, it's about what I saw earlier on Windows (2-6x on
> recent versions, depending). My initial tests on Mac OS X show it's
> 5-10x as fast on that platform! I haven't double-checked those results
> yet though.
> 
> The results on Linux were somewhat disappointing -- only a 10% speed
> improvement on large directories, and it's actually slower on small
> directories. It's still doing half the number of system calls ... so I
> believe this is because cached os.stat() is super fast on Linux, and
> so the slowdown from using ctypes / pure Python is outweighing the
> gain from not doing the system call. That said, I've also only tested
> Linux in a VirtualBox setup, so maybe that's affecting it too.
> 
> Still, if it's a significant win for Windows and OS X users, it's a
> good thing.
> 
> In any case, I'd love it if folks could run the benchmark on their
> system (with and without -s) and comment further on the idea and API.
> 
> Thanks,
> Ben.
> 
> [1]
> http://mail.python.org/pipermail/python-ideas/2012-November/017770.ht
> ml _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas


From abarnert at yahoo.com  Sun Nov 25 00:27:09 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Sat, 24 Nov 2012 15:27:09 -0800 (PST)
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <1360817.kTgMK23t3p@giza>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1360817.kTgMK23t3p@giza>
Message-ID: <1353799629.25753.YahooMailRC@web184706.mail.ne1.yahoo.com>

First, another thought on the whole thing:

Wrapping readdir is useful. Replacing os.walk is also useful. But they don't 
necessarily have to be tied together at all.

In particular, instead of trying to write an iterdir_stat that can properly 
support os.walk on all platforms, why not just implement os.walk differently on 
platforms where iterdir_stat can't support it? (In fact, I think an os.walk 
replacement based on the fts API, which never uses iterdir_stat, would be the 
best answer, but let me put more thought into that idea...)

Anyway, comments:

From: John Mulligan <phlogistonjohn at asynchrono.us>
Sent: Fri, November 23, 2012 8:13:22 AM


> I like returning the d_type directly because in  the unix style APIs the 
> dirent structure doesn't provide the same stuff as  the stat result and I 
> don't want to trick myself into thinking I have all  the information 
> available from the readdir call. I also like to have my  Python functions 
> map pretty closely to the C calls.

Of course that means that implementing the same interface on Windows means 
faking d_type from the stat result, and making the functions map less closely to 
the C calls?

> In addition I have a fditerdir call that supports a directory file 
> descriptor as the first argument. This is handy because I also have a 
> wrapper for fstatat (this was all created for Python 2 and before 3.3 
> was released).

This can only be implemented on platforms that support the *at functions. I 
believe that means just linux and OpenBSD right now, other *BSD (including OS X) 
at some unspecified point in the future. Putting something like that in the 
stdlib would probably require also adding another function like os_supports_at 
(similar to supports_fd, supports_dirfd, etc.), but that's not a big deal.



From mwm at mired.org  Sun Nov 25 01:56:14 2012
From: mwm at mired.org (Mike Meyer)
Date: Sat, 24 Nov 2012 18:56:14 -0600
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <1353799629.25753.YahooMailRC@web184706.mail.ne1.yahoo.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1360817.kTgMK23t3p@giza>
	<1353799629.25753.YahooMailRC@web184706.mail.ne1.yahoo.com>
Message-ID: <CAD=7U2A4PLazTR1tNT9PsW+Dn8WW5MOM+iUvZ0UVn__Dtfo8mg@mail.gmail.com>

On Sat, Nov 24, 2012 at 5:27 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
> (In fact, I think an os.walk replacement based on the fts API, which
> never uses iterdir_stat, would be the best answer, but let me put
> more thought into that idea...)

A couple of us have looked into this, and there's an impedance
mismatch between the os.walk API and the fts API, in that fts doesn't
provide the d_type information, so you're forced to make the stat
calls that this rewrite was trying to avoid.

If you're thinking about providing an API that looks more like fts,
that might be a better answer - if you design it so it also works on
Windows.

	<mike


From ncoghlan at gmail.com  Sun Nov 25 04:47:05 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sun, 25 Nov 2012 13:47:05 +1000
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <1353799629.25753.YahooMailRC@web184706.mail.ne1.yahoo.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1360817.kTgMK23t3p@giza>
	<1353799629.25753.YahooMailRC@web184706.mail.ne1.yahoo.com>
Message-ID: <CADiSq7fhC=eYnSyEugOC7J0vGJsnEMU1F4hj9mDmv4P7nrF0bw@mail.gmail.com>

On Sun, Nov 25, 2012 at 9:27 AM, Andrew Barnert <abarnert at yahoo.com> wrote:

> This can only be implemented on platforms that support the *at functions. I
> believe that means just linux and OpenBSD right now, other *BSD (including
> OS X)
> at some unspecified point in the future. Putting something like that in the
> stdlib would probably require also adding another function like
> os_supports_at
> (similar to supports_fd, supports_dirfd, etc.), but that's not a big deal.
>

FWIW, if "supports_dirfd" is non-empty, you can be pretty sure that the
underlying OS supports the *at APIs, as that's how the dirfd argument gets
used by the affected functions.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121125/990d2161/attachment.html>

From abarnert at yahoo.com  Sun Nov 25 04:55:51 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Sat, 24 Nov 2012 19:55:51 -0800 (PST)
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAD=7U2A4PLazTR1tNT9PsW+Dn8WW5MOM+iUvZ0UVn__Dtfo8mg@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1360817.kTgMK23t3p@giza>
	<1353799629.25753.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<CAD=7U2A4PLazTR1tNT9PsW+Dn8WW5MOM+iUvZ0UVn__Dtfo8mg@mail.gmail.com>
Message-ID: <1353815751.17423.YahooMailRC@web184702.mail.ne1.yahoo.com>

From: Mike Meyer <mwm at mired.org>
Sent: Sat, November 24, 2012 4:56:16 PM


> On Sat, Nov 24, 2012 at 5:27 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
> > (In  fact, I think an os.walk replacement based on the fts API, which
> > never  uses iterdir_stat, would be the best answer, but let me put
> > more thought  into that idea...)
> 
> A couple of us have looked into this, and there's an  impedance
> mismatch between the os.walk API and the fts API, in that fts  doesn't
> provide the d_type information, so you're forced to make the  stat
> calls that this rewrite was trying to avoid.

Actually, fts does provide all the information you need in fts_info?but you 
don't need it anyway, because fts itself is driving the walk, not os.walk. Using 
fts to implement iterdir_stat, and then using that to implement os.walk, may or 
may not be doable, but it's silly.

> If you're thinking  about providing an API that looks more like fts,
> that might be a better  answer - if you design it so it also works  on
> Windows.


Yes, that was the idea, "an os.walk replacement based on the fts API", as 
opposed to "a reimplementation of os.walk that uses fts". I believe implementing 
the fts interface on Windows and other non-POSIX platforms would be much easier 
than implementing os.walk efficiently on top of iterdir_stat. Also, it would 
ensure that all of the complexities had been thought through (following broken 
links, detecting hardlink and symlink cycles, etc.). And I believe the interface 
is actually better.

That last claim may be a bit controversial. But simple use cases are trivially 
convertible between os.walk and fts, while for anything less simple, I think fts 
is much more readable. 

For example, if you want to skip over any subdirectories whose names start with 
'.':

    # make sure you passed topdown=True or this will silently do nothing
    for i, dn in reversed(enumerate(dirnames)):
        if dn.startswith('.'):
            del dirnames[i]

    if ent.info == fts.D and ent.name.startswith('.'):
        f.skip(ent)

Or, if you only want to traverse 3 levels down:

    if len(os.path.split(dirpath)) > 2:
        del dirnames[:]

    if ent.depth > 3:
        f.skip(ent)

Or, if you want to avoid traversing devices:

    # make sure you passed topdown=True
    fs0 = os.stat(dirname)
    for i, dirname in reversed(enumerate(dirnames)):
        fs1 = os.stat(os.path.join(dirpath, dirname))
        if fs0.st_dev != fs1.st_dev:
            del dirnames[i]

    # just add fts.NODEV to the open options

In every case, the fts version is simpler, more explicit, more concise, and 
easier to get right.

As a 0th draft, the only changes to the interface described 
at http://nixdoc.net/man-pages/FreeBSD/man3/fts_open.3.html would be:

 * fts.open instead of fts_open.

 * open returns an FTS object, which is a context manager and an iterator (which 
just calls read).

 * Everything else is a method on the FTS object, so you call f.read() instead 
of fts_read(f).

 * open takes key and reverse params instead of compar, which work just as in 
list.sort, except that key=None means no sorting. If you want to pass a 
compar-like function, use functools.cmp_to_key.

 * key=None doesn't guarantee that files are listed "in the order listed in the 
directory"; they're listed in an unspecified order.

 * setclient, getclient, getstream are unnecessary, as are the number and 
pointer fields in FTSENT; if you want to pass user data down to your key 
function, just bind it in.

 * read returns an FTSENT, which has all the same fields as the C API struct, 
but without the fts_ prefixes, except that number and pointer may not be 
present.

 * open's options are the same ones defined in fts.h, but without the FTS_ 
prefix, and if you pass neither PHYSICAL nor LOGICAL, you get PHYSICAL rather 
than an error.
The default is PHYSICAL, instead of 0 (which just returns an error if you pass 
it).

 * Not passing NOCHDIR doesn't guarantee that fts does chdir, just allows it to. 
This is actually already true for the BSD interface, but every popular 
implementation does the same thing (actually, they're all trivial variations on 
the same implementation), so people write code that relies on it, so the 
documentation should explicitly say that it's not guaranteed.

 * Instead of fts_set, provide specific again, follow, and skip methods.

 * children returns a list of FTSENT objects, instead of returning the first one 
with ent.link holding the next one.

I think we could add an fts.walk that's a drop-in replacement for simple uses of 
os.walk, but doesn't try to provide the entire interface, which could be useful 
as a transition. But I don't think it's necessary.

Another potential option would be to make the iterator close itself when 
consumed, so you don't need to make it a context manager, again making it easier 
to drop in as a replacement for os.walk, but again I don't think it's necessary.

On POSIX-like platforms without native fts, the FreeBSD fts.c is almost 
completely portable. It does rely on fchdir and statfs, but these can both be 
stubbed out. If you don't have fchdir, you always get the equivalent of NOCHDIR. 
If you don't have statfs, you don't get the ufslinks optimization.

The quick&dirty ctypes wrapper around native fts 
at https://github.com/abarnert/py-fts approximates this interface. However, I'll 
try to write up a complete implementation that wraps native fts if available, or 
does it in pure Python with iterdir_stat (possibly with some minor changes) and 
existing stuff in os otherwise. I suspect I'll come up with a few changes while 
implementing it?and of course it's possible I'll find out that it's not so easy 
to do on Windows, but at the moment I'm pretty confident it will be.


From phlogistonjohn at asynchrono.us  Sun Nov 25 13:54:34 2012
From: phlogistonjohn at asynchrono.us (John Mulligan)
Date: Sun, 25 Nov 2012 07:54:34 -0500
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <1353799629.25753.YahooMailRC@web184706.mail.ne1.yahoo.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1360817.kTgMK23t3p@giza>
	<1353799629.25753.YahooMailRC@web184706.mail.ne1.yahoo.com>
Message-ID: <6959424.U36ixQ3iu6@jackal>

On Saturday, November 24, 2012 03:27:09 PM Andrew Barnert wrote:
> First, another thought on the whole thing:
> 
> Wrapping readdir is useful. Replacing os.walk is also useful. But they don't
> necessarily have to be tied together at all.
> 
> In particular, instead of trying to write an iterdir_stat that can properly
> support os.walk on all platforms, why not just implement os.walk differently
> on platforms where iterdir_stat can't support it? (In fact, I think an
> os.walk replacement based on the fts API, which never uses iterdir_stat,
> would be the best answer, but let me put more thought into that idea...)


Agreed, keeping things separate might be a better approach. I wanted to point 
out the usefulness of an enhanced listdir/iterdir as its own beast in addition 
to improving os.walk. 

There is one thing that is advantageous about creating an ideal enhanced 
os.walk. People would only have to change the module walk is getting imported 
from, no changes would have to be made anywhere else even if that code is 
using features like the ability to modify dirnames (when topdown=True).
I am not sure if fts or other platform specific API could be wrangled into an 
exact drop in replacement.

> 
> Anyway, comments:
> 
> From: John Mulligan <phlogistonjohn at asynchrono.us>
> Sent: Fri, November 23, 2012 8:13:22 AM
> 
> > I like returning the d_type directly because in  the unix style APIs the
> > dirent structure doesn't provide the same stuff as  the stat result and I
> > don't want to trick myself into thinking I have all  the information
> > available from the readdir call. I also like to have my  Python functions
> > map pretty closely to the C calls.
> 
> Of course that means that implementing the same interface on Windows means
> faking d_type from the stat result, and making the functions map less
> closely to the C calls?

I agree, I don't know if it would be better to simply have platform dependent 
fields/values in the struct or if it is better to abstract things in this case. 
Anyway, the betterwalk code is already converting constants from the Windows 
API to mode values. Something similar might be possible for d_type values as 
well.

See: https://github.com/benhoyt/betterwalk/blob/master/betterwalk.py#L62


> 
> > In addition I have a fditerdir call that supports a directory file
> > descriptor as the first argument. This is handy because I also have a
> > wrapper for fstatat (this was all created for Python 2 and before 3.3
> > was released).
> 
> This can only be implemented on platforms that support the *at functions. I
> believe that means just linux and OpenBSD right now, other *BSD (including
> OS X) at some unspecified point in the future. Putting something like that
> in the stdlib would probably require also adding another function like
> os_supports_at (similar to supports_fd, supports_dirfd, etc.), but that's
> not a big deal.

I agree that this requires supporting platforms. (I've run this on FreeBSD as 
well.) I didn't mean to imply that this should be required for a better walk 
function. I wanted to provide some color about the value of exposing alternate 
listdir-type functions themselves and not just as a stepping stone on the way 
to enhancing walk.



From phlogistonjohn at asynchrono.us  Sun Nov 25 14:46:54 2012
From: phlogistonjohn at asynchrono.us (John Mulligan)
Date: Sun, 25 Nov 2012 08:46:54 -0500
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CADiSq7fhC=eYnSyEugOC7J0vGJsnEMU1F4hj9mDmv4P7nrF0bw@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1353799629.25753.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<CADiSq7fhC=eYnSyEugOC7J0vGJsnEMU1F4hj9mDmv4P7nrF0bw@mail.gmail.com>
Message-ID: <3392801.HGlhUSeoMx@jackal>

> On Sunday, November 25, 2012 01:47:05 PM Nick Coghlan wrote:
> 
> > On Sun, Nov 25, 2012 at 9:27 AM, Andrew Barnert <abarnert at yahoo.com> 
wrote:
> > 
> > This can only be implemented on platforms that support the *at functions. 
I
> > believe that means just linux and OpenBSD right now, other *BSD (including 
OS X)
> > at some unspecified point in the future. Putting something like that in the
> > stdlib would probably require also adding another function like 
os_supports_at
> > (similar to supports_fd, supports_dirfd, etc.), but that's not a big deal.
> > 
> 
> FWIW, if "supports_dirfd" is non-empty, you can be pretty sure that the 
underlying OS supports the *at APIs, as that's how the dirfd argument gets 
used by the affected functions.
> 

When I realized that Python 3.3 listdir already supports a file descriptor 
argument I had to go back to the docs and read this section over again! I have 
not used Python 3 in anger yet so even though I've read the new docs before I 
find it easy to overlook things. 

This means that (for my use cases at least) all that Python 3.3 is missing are 
the extra pieces of data in the direntry structure. I don't know if anyone 
else is interested in this particular low-level feature but if we were to come 
up with an interface that works well for both posix/windows I think the 
os.walk case (both walk and fwalk now) is much easier to deal with.





From abarnert at yahoo.com  Mon Nov 26 08:52:39 2012
From: abarnert at yahoo.com (Andrew Barnert)
Date: Sun, 25 Nov 2012 23:52:39 -0800 (PST)
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <6959424.U36ixQ3iu6@jackal>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1360817.kTgMK23t3p@giza>
	<1353799629.25753.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<6959424.U36ixQ3iu6@jackal>
Message-ID: <1353916359.19441.YahooMailRC@web184701.mail.ne1.yahoo.com>

From: John Mulligan <phlogistonjohn at asynchrono.us>
Sent: Sun, November 25, 2012 4:54:45 AM

> Agreed, keeping things separate might be a better approach. I wanted to point 
> out the usefulness of an enhanced listdir/iterdir as its own beast in 
addition 
> to improving os.walk. 

Agreed. I would have uses for an iterdir_stat on its own.

> There is one thing that is advantageous about creating an ideal enhanced 
> os.walk. People would only have to change the module walk is getting imported 
> from, no changes would have to be made anywhere else even if that code is 
> using features like the ability to modify dirnames (when topdown=True).

Sure, there's a whole lot of existing code, and existing knowledge in people's 
heads, so, if it turns out to be easy, we might as well provide a drop-in 
replacement. 

But if it's not, I'd be happy enough with something like, "You can use fts.walk 
as a faster replacement for os.walk if you don't modify dirnames. If you do need 
to modify dirnames, either stick with os.path, or rewrite your code around fts."

Anyway, the code below is the obvious way to implement os.walk on top of fts, 
but I'd need to think it through and test it to see if it handles everything 
properly:

def walk(path, topdown=True, onerror=None, followlinks=False):
    level = None
    dirents, filenames = [], []
    dirpath = path
    with open([path], 
              ((LOGICAL if followlinks else PHYSICAL) | 
               NOSTAT | NOCHDIR | COMFOLLOW)) as f:
        for ent in f:
            if ent.level != level:
                dirnames = [dirent.name for dirent in dirents]
                yield dirpath, dirnames, filenames
                for dirent in dirents:
                    if dirent.name not in dirnames:
                        f.skip(dirent)
                level = ent.level
                dirents, filenames = [], []
                path = os.path.join(path, ent.path)
            else:
                if ent.info in (D, DC):
                    if topdown:
                        dirents.append(ent)
                elif ent.info == DP:
                    if not topdown:
                        dirents.append(ent)
                elif ent.info in (DNR, ERR):
                    if onerror:
                        # make OSError with filename member
                        onerror(None) 
                elif ent.info == DOT:
                    pass
                else:
                    filenames.append(ent.name)
    dirnames = [dirent.name for dirent in dirents]
    yield dirpath, dirnames, filenames

> I am not sure if fts or other platform specific API could be wrangled into an 
> exact drop in replacement.

My goal isn't to use fts as a platform-specific API to re-implement os.walk, but 
to replace os.walk with a better API which is just a pythonized version of the 
fts API (and is available on every platform, as efficiently as possible).

If that also gives us a drop-in replacement for os.walk, that's gravy.


> > From: John Mulligan <phlogistonjohn at asynchrono.us>
> >  Sent: Fri, November 23, 2012 8:13:22 AM
> > 
> > > I like returning  the d_type directly because in  the unix style APIs the
> > > dirent  structure doesn't provide the same stuff as  the stat result and I
> >  > don't want to trick myself into thinking I have all  the  information
> > > available from the readdir call. I also like to have  my  Python functions
> > > map pretty closely to the C  calls.
> > 
> > Of course that means that implementing the same  interface on Windows means
> > faking d_type from the stat result, and  making the functions map less
> > closely to the C calls?
> 
> I agree, I  don't know if it would be better to simply have platform dependent 

> fields/values in the struct or if it is better to abstract things in this  
>case. 
>
> Anyway, the betterwalk code is already converting constants from the  Windows 
> API to mode values. Something similar might be possible for d_type  values as 
> well.

I was just bringing up the point that, in your quest for mapping Python to C as 
thinly as possibly on POSIX, you're inherently making the mapping a little 
thicker on Windows. That isn't necessarily a problem?the same thing is true for 
much of the os module today, after all?just something to keep in mind.

Either way, I would want to have some way of knowing "is this entry a directory" 
without having to figure out which of two values I need to check based on my 
platform, if at all possible.

> > > In addition I have a fditerdir call that supports a directory  file
> > > descriptor as the first argument. This is handy because I also  have a
> > > wrapper for fstatat (this was all created for Python 2 and  before 3.3
> > > was released).
> > 
> > This can only be  implemented on platforms that support the *at functions. I
> > believe that  means just linux and OpenBSD right now, other *BSD (including
> > OS X) at  some unspecified point in the future. Putting something like that
> > in the  stdlib would probably require also adding another function like
> >  os_supports_at (similar to supports_fd, supports_dirfd, etc.), but  that's
> > not a big deal.
> 
> I agree that this requires supporting  platforms. (I've run this on FreeBSD as 

> well.) I didn't mean to imply that  this should be required for a better walk 
> function. I wanted to provide some  color about the value of exposing alternate 
>
> listdir-type functions  themselves and not just as a stepping stone on the way 

> to enhancing  walk.


This also raises the point that there is no "ffts" or "ftsat" on any platform I 
know of, and in fact implementing the former wouldn't be totally trivial, 
because fts remembers the root paths? So, if we wanted an fwalk, it might be a 
bit trickier than an fiterdir.


From benhoyt at gmail.com  Mon Nov 26 09:01:00 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Mon, 26 Nov 2012 21:01:00 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <50AEE81D.5060707@fastmail.us>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<50AEE81D.5060707@fastmail.us>
Message-ID: <CAL9jXCEYXBagxkmEt+k9QOVOzFJsRGOAc36tH6q_686y-oT+7Q@mail.gmail.com>

> I'm suspicious of your use of Windows' built-in pattern matching. There are
> a number of quirks to it you haven't accounted for... for example: it
> matches short filenames, the behavior you noted of "?" at the end of
> patterns also applies to the end of the 'filename portion' (e.g. foo?.txt
> can match foo.txt), and the behavior of patterns ending in ".*" or "." isn't
> like fnmatch.

Oh, you're right. What a pain. The FindFirstFile docs are terrible in
this regard, and simply say "the file name, which can include wildcard
characters, for example, an asterisk (*) or a question mark (?)."
Microsoft documents * and ? at [1], but it's very incomplete and
doesn't mention those quirks. Any idea where there's thorough
documentation of it?

Oh, looks like someone's had a go here:
http://digital.ni.com/public.nsf/allkb/0DBE16907A17717B86256F7800169797

And this article by Raymond Chen looks related and interesting:
http://blogs.msdn.com/b/oldnewthing/archive/2007/12/17/6785519.aspx

Still, I think "pattern" is useful enough to get right (either that,
or drop it). It should be fairly straight-forward to find the patterns
that don't work and use wildcard='*' with fnmatch in those cases
instead.

> your find_data_to_stat function ignores the symlink flag

Yes, you're right. I haven't tested symlink handling in my code so
far. I intend to once I've got the speed issues ironed out though.

-Ben

[1] http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/find_c_search_wildcard.mspx?mfr=true


From benhoyt at gmail.com  Mon Nov 26 09:19:55 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Mon, 26 Nov 2012 21:19:55 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <1353916359.19441.YahooMailRC@web184701.mail.ne1.yahoo.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1360817.kTgMK23t3p@giza>
	<1353799629.25753.YahooMailRC@web184706.mail.ne1.yahoo.com>
	<6959424.U36ixQ3iu6@jackal>
	<1353916359.19441.YahooMailRC@web184701.mail.ne1.yahoo.com>
Message-ID: <CAL9jXCHNcv79YxrwXUNA=h+Q7fzJs-787f9F3J0Qmdmw4naVjw@mail.gmail.com>

>> There is one thing that is advantageous about creating an ideal enhanced
>> os.walk. People would only have to change the module walk is getting imported
>> from, no changes would have to be made anywhere else even if that code is
>> using features like the ability to modify dirnames (when topdown=True).
>
> Sure, there's a whole lot of existing code, and existing knowledge in people's
> heads, so, if it turns out to be easy, we might as well provide a drop-in
> replacement.

Yeah, that's the main reason that if nothing else comes of all this,
I'd still love to speed up os.walk() on Windows by 3x to 6x. Whether
or not iterdir_stat() or anything else is added to the stdlib.

> But if it's not, I'd be happy enough with something like, "You can use fts.walk
> as a faster replacement for os.walk if you don't modify dirnames. If you do need
> to modify dirnames, either stick with os.path, or rewrite your code around fts."

I'm afraid looking into fts is beyond the scope of the work I'd like
to do at the moment.

> I was just bringing up the point that, in your quest for mapping Python to C as
> thinly as possibly on POSIX, you're inherently making the mapping a little
> thicker on Windows. That isn't necessarily a problem?the same thing is true for
> much of the os module today, after all?just something to keep in mind.

Totally agreed.

-Ben


From benhoyt at gmail.com  Mon Nov 26 09:14:49 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Mon, 26 Nov 2012 21:14:49 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <1360817.kTgMK23t3p@giza>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<1360817.kTgMK23t3p@giza>
Message-ID: <CAL9jXCFn2ngae9vbfB3UuZd=VRh72amMwvwi2aWzmLJC21vMPg@mail.gmail.com>

> I'm really happy that someones looking into this.  I've done some
> similar work for my day job and have some thoughts about the APIs and
> approach.

Thanks!

> extra information is returned/yielded from the call. What I do
> differently is that I (a) always return the d_type value instead of
> falling back to stat'ing the item (b) do not provide the pattern
> argument.

Yeah, returning more stat fields was a suggestion of someone's on
python-ideas, and (b) was my idea to allow me to tap into Windows
wildcard matching. Both of which I think are simple and good.

> I like returning the d_type directly because in the unix style APIs the
> dirent structure doesn't provide the same stuff as the stat result and I
> don't want to trick myself into thinking I have all the information
> available from the readdir call. I also like to have my Python functions
> map pretty closely to the C calls.

Here I disagree. Though I would, being a heavy Windows user. :-) As
somebody else mentioned, on Windows, the API here is nothing like the
FindFirst/Next C calls. In general, I think the stdlib should tend
towards getting more cross-platform, not more Linux-ish. In the case
of my stat fields, it's not any more cross-platform, but at least the
st_mode field is something the stdlib can already handle.

> For example there is a potential race condition between calling the
> readdir and the stat, like if the object is removed between calls. I can
> be very granular (for lack of a better word) about my error handling in
> these cases.

That's a good point. I'm not sure it'd be a big deal in practice. But
it's worth thinking about. Perhaps the os.stat() call should catch
OSError and return None for all fields if it failed. But maybe that's
suppressing too much. Or maybe it could be an option
(stat_errors=True).

> Because I come at this from a Linux platform I am also not so keen on
> the built in pattern matching that comes for "free" from the
> FindFirst/FindNext Window's API provides. It just feels like this should
> be provided at a higher layer. But I can't say for sure because I don't
> know how much of a performance boost this is on Windows.

I don't know about the performance boost here either. I suspect it's
significant only in certain cases (when you're matching a small
fragment of files in a large directory) but I should do some
performance tests.

-Ben


From sturla at molden.no  Mon Nov 26 22:23:33 2012
From: sturla at molden.no (Sturla Molden)
Date: Mon, 26 Nov 2012 22:23:33 +0100
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <k8jjvq$7a7$1@ger.gmane.org>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
	<k8jjvq$7a7$1@ger.gmane.org>
Message-ID: <439017CA-40D4-4489-BD05-17B067FA1724@molden.no>


Den 21. nov. 2012 kl. 23:18 skrev Richard Oudkerk <shibturn at gmail.com>:

> 
> An implementation is available at
> 
>    http://hg.python.org/sandbox/sbt#spawn
> 
> You just need to stick
> 
>    multiprocessing.set_start_method('spawn')
> 
> at the beginning of the program to use fork+exec instead of fork.  The test suite passes.  (I would not say that making this work was that straightforward though.)
> 
> That branch also supports the starting of processes via a server process.  That gives an alternative solution to the problem of mixing fork() with threads, but has the advantage of being as fast as the default fork start method.  However, it does not work on systems where fd passing is unsupported like OpenSolaris.
> 

That is very nice, thank you :-)

BTW, fd passing is possible on Windows too, using DuplicateHandle. One can "inject" an open file handle into a different process, but some means of ipc (e.g. a pipe) must be used to communicate it's value.

Sturla





From shibturn at gmail.com  Mon Nov 26 23:11:02 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Mon, 26 Nov 2012 22:11:02 +0000
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <439017CA-40D4-4489-BD05-17B067FA1724@molden.no>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
	<k8jjvq$7a7$1@ger.gmane.org>
	<439017CA-40D4-4489-BD05-17B067FA1724@molden.no>
Message-ID: <k90pdu$l7v$1@ger.gmane.org>

On 26/11/2012 9:23pm, Sturla Molden wrote:
> BTW, fd passing is possible on Windows too, using DuplicateHandle. One can "inject" an open file handle into a different process, but some means of ipc (e.g. a pipe) must be used to communicate it's value.

multiprocessing on Windows already depends on that feature;-)

-- 
Richard



From sturla at molden.no  Tue Nov 27 01:26:36 2012
From: sturla at molden.no (Sturla Molden)
Date: Tue, 27 Nov 2012 01:26:36 +0100
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <k90pdu$l7v$1@ger.gmane.org>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
	<k8jjvq$7a7$1@ger.gmane.org>
	<439017CA-40D4-4489-BD05-17B067FA1724@molden.no>
	<k90pdu$l7v$1@ger.gmane.org>
Message-ID: <854389E7-1FC7-4970-BFEB-990DB163BF23@molden.no>

Den 26. nov. 2012 kl. 23:11 skrev Richard Oudkerk <shibturn at gmail.com>:

> On 26/11/2012 9:23pm, Sturla Molden wrote:
>> BTW, fd passing is possible on Windows too, using DuplicateHandle. One can "inject" an open file handle into a different process, but some means of ipc (e.g. a pipe) must be used to communicate it's value.
> 
> multiprocessing on Windows already depends on that feature;-)


Hmm, last time I looked at the code it used handle inheritance on Windows, which was why e.g. a lock or a file could not be sent over a pipe or queue.

Sturla




From trent at snakebite.org  Tue Nov 27 13:33:25 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 07:33:25 -0500
Subject: [Python-ideas] WSAPoll and tulip
Message-ID: <20121127123325.GH90314@snakebite.org>

    A weekend or two ago, I was planning on doing some work on some
    ideas I had regarding IOCP and the tulip/async-IO discussion.

    I ended up getting distracted by WSAPoll.  WSAPoll is a method
    that Microsoft introduced with Vista/2008 that is intended to
    be semantically equivalent to poll() on UNIX.

    I decided to play around and see what it would take to get it
    available via select.poll() on Windows, eventually hacking it
    into a working state.

    Issue: http://bugs.python.org/issue16507
    Patch: http://bugs.python.org/file28038/wsapoll.patch

    So, it basically works.  poll() on Windows, who would have thought.

    It's almost impossible to test with our current infrastructure; all
    our unit tests seem to pass pipes and other non-Winsock-backed-socks
    to poll(), which, like select()-on-Windows, isn't supported.

    I suspect Twisted's test suite would give it a much better work out
    (CC'd Glyph just so it's on his radar).  I ended up having to verify
    it worked with some admittedly-hacky dual-python-console sessions,
    one poll()'ing as a server, the other connecting as a client.  It
    definitely works, so, it's worth keeping it in mind for the future.

    It's still equivalent to poll()'s O(N) on UNIX, but that's better
    than the 64/512 limit select is currently stuck with on Windows.

    Didn't have much luck trying to get the patched Python working with
    tulip's PollReactor, unfortunately, so I just wanted to provide some
    feedback on that experience.

    First bit of feedback: man, debugging `yield from` stuff is *hard*.
    Tulip just flat out didn't work with the PollReactor from the start
    but it was dying in a non-obvious way.

    So, I attached both a Pdb debugger and Visual Studio debugger and
    tried to step through everything to figure out why the first call
    to poll() was blowing up (can't remember the exact error message
    but it was along the lines of "you can't poll() whatever it is you
    just asked me to poll(), it's defo' not a socket").

    I eventually, almost by pure luck, traced the problem to the fact
    that PollReactor's __init__ eventually results in code being called
    that calls poll() on two os.pipe() objects (in EventLoop I think).

    However, when I was looking at the code, it appeared as though the
    first poll() came from the getaddrinfo().  So all my breakpoints
    and whatnot were geared towards that, yet none of them were being
    hit, yet poll() was still being called somehow, somewhere.

    I ended up having to spend ages traipsing through every line in
    Visual Studio's debugger to try figure out what the heck was going
    on.  I believe the `yield from` aspect made that so much more of an
    arduous affair -- one moment I'm in selectmodule.c's getaddrinfo(),
    then I'm suddenly deep in the bowels of some cryptic eval frame
    black magic, then one 'step' later, I'm over in some completely
    different part of selectmodule.c, and so on.

    I think the reason I found it so tough was because when you're
    single stepping through each line of a C program, you can sort of
    always rely on the fact you know what's going to happen when you
    "step" the next line.

    In this case though, a step of an eval frame would wildly jump
    to seemingly unrelated parts of C code.  As far as I could tell,
    there was no easy/obvious way to figure the details out before
    stepping that instruction either (i.e. probing the various locals
    and whatnot).

    So, that's the main feedback from that weekend, I guess.  Granted,
    it's more of a commentary on `yield from` than tulip per se, but I
    figured it would be worth offering up my experience nevertheless.

    I ended up with the following patch to avoid the initial poll()
    against os.pipe() objects:

--- a/polling.py        Sat Nov 03 13:54:14 2012 -0700
+++ b/polling.py        Tue Nov 27 07:05:10 2012 -0500
@@ -41,6 +41,7 @@
 import os
 import select
 import time
+import sys
 
 
 class PollsterBase:
@@ -459,6 +460,10 @@
     """
 
     def __init__(self, eventloop, executor=None):
+        if sys.platform == 'win32':
+            # Work around the fact that we can't poll pipes on Windows.
+            if isinstance(eventloop.pollster, PollPollster):
+                eventloop = EventLoop(SelectPollster())
         self.eventloop = eventloop
         self.executor = executor  # Will be constructed lazily.
         self.pipe_read_fd, self.pipe_write_fd = os.pipe()

    By that stage it was pretty late in the day and I accepted defeat.
    My patch didn't really work, it just allowed the test to run to
    completion without the poll OSError exception being raised.

        Trent.


From shibturn at gmail.com  Tue Nov 27 14:24:02 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Tue, 27 Nov 2012 13:24:02 +0000
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <854389E7-1FC7-4970-BFEB-990DB163BF23@molden.no>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
	<k8jjvq$7a7$1@ger.gmane.org>
	<439017CA-40D4-4489-BD05-17B067FA1724@molden.no>
	<k90pdu$l7v$1@ger.gmane.org>
	<854389E7-1FC7-4970-BFEB-990DB163BF23@molden.no>
Message-ID: <k92etr$ml5$1@ger.gmane.org>

On 27/11/2012 12:26am, Sturla Molden wrote:
> Hmm, last time I looked at the code it used handle inheritance on Windows,
 > which was why e.g. a lock or a file could not be sent over a pipe or 
queue.

In 3.3 on Windows, connection objects can be pickled using a trick with 
DuplicateHandle() and DUPLICATE_CLOSE_SOURCE.  The same could be done 
for locks fairly easily.  But on Unix I cannot see a way to reliably 
manage the life time of picklable locks.

-- 
Richard



From ncoghlan at gmail.com  Tue Nov 27 15:30:43 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 28 Nov 2012 00:30:43 +1000
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <20121127123325.GH90314@snakebite.org>
References: <20121127123325.GH90314@snakebite.org>
Message-ID: <CADiSq7cLbJ=KJnQuuDmr7KPzbB9i8rkp6V4MtBh42NF-uRZ=oA@mail.gmail.com>

On Tue, Nov 27, 2012 at 10:33 PM, Trent Nelson <trent at snakebite.org> wrote:

>     In this case though, a step of an eval frame would wildly jump
>     to seemingly unrelated parts of C code.  As far as I could tell,
>     there was no easy/obvious way to figure the details out before
>     stepping that instruction either (i.e. probing the various locals
>     and whatnot).
>

I'm not sure that has anything to do with "yield from", but rather to do
with the use of computed gotos (
http://hg.python.org/cpython/file/default/Python/ceval.c#l821). For sane
stepping in the eval loop, you probably want to build with
"--without-computed-gotos" enabled (that's a configure option on Linux, I
have no idea how to turn it off on Windows). Even without that, the manual
opcode prediction macros are still a bit wacky (albeit easier to follow
than the compiler level trickery).

The eval loop commits many sins against debuggability and maintainability
in pursuit of speed, so it's not really fair to place all of that at the
feet of the yield from clause.

if you really did just mean the behaviour of jumping from
caller-frame-eval-loop to generator-frame-eval-loop and back out again,
then that again is really just about generator stepping at the C level
(where suspend/resume means passing through ceval.c), rather than being
specific to yield from.

    So, that's the main feedback from that weekend, I guess.  Granted,
>     it's more of a commentary on `yield from` than tulip per se, but I
>     figured it would be worth offering up my experience nevertheless.
>

>From your description so far, it seems like more of a commentary on
pointing a C level debugger at our eval loop...

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121128/88efb791/attachment.html>

From solipsis at pitrou.net  Tue Nov 27 15:42:04 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 27 Nov 2012 15:42:04 +0100
Subject: [Python-ideas] WSAPoll and tulip
References: <20121127123325.GH90314@snakebite.org>
Message-ID: <20121127154204.5fc81457@pitrou.net>


Hi,

Le Tue, 27 Nov 2012 07:33:25 -0500,
Trent Nelson <trent at snakebite.org> a ?crit :
>     So, it basically works.  poll() on Windows, who would have
> thought.
> 
>     It's almost impossible to test with our current infrastructure;
> all our unit tests seem to pass pipes and other
> non-Winsock-backed-socks to poll(), which, like select()-on-Windows,
> isn't supported.

Well, then you should write new tests that don't rely on pipes.
There's no reason it can't be done, and there are already lots of
examples of tests using TCP sockets in our test suite. It will also be
a nice improvement to the current test suite for Unix platforms.

>     Visual Studio's debugger to try figure out what the heck was going
>     on.  I believe the `yield from` aspect made that so much more of
> an arduous affair -- one moment I'm in selectmodule.c's getaddrinfo(),
>     then I'm suddenly deep in the bowels of some cryptic eval frame
>     black magic, then one 'step' later, I'm over in some completely
>     different part of selectmodule.c, and so on.

I'm not sure why you're using Visual Studio to debug Python code?
It sounds like you want something higher-level, e.g. Python print()
calls or pdb.

Regards

Antoine.




From trent at snakebite.org  Tue Nov 27 16:03:31 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 10:03:31 -0500
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <20121127154204.5fc81457@pitrou.net>
References: <20121127123325.GH90314@snakebite.org>
	<20121127154204.5fc81457@pitrou.net>
Message-ID: <20121127150330.GB91191@snakebite.org>

On Tue, Nov 27, 2012 at 06:42:04AM -0800, Antoine Pitrou wrote:
> 
> Hi,
> 
> Le Tue, 27 Nov 2012 07:33:25 -0500,
> Trent Nelson <trent at snakebite.org> a ?crit :
> >     So, it basically works.  poll() on Windows, who would have
> > thought.
> > 
> >     It's almost impossible to test with our current infrastructure;
> > all our unit tests seem to pass pipes and other
> > non-Winsock-backed-socks to poll(), which, like select()-on-Windows,
> > isn't supported.
> 
> Well, then you should write new tests that don't rely on pipes.
> There's no reason it can't be done, and there are already lots of
> examples of tests using TCP sockets in our test suite. It will also be
> a nice improvement to the current test suite for Unix platforms.

    Agreed, there's more work required.  It's on the list.

> >     Visual Studio's debugger to try figure out what the heck was going
> >     on.  I believe the `yield from` aspect made that so much more of
> > an arduous affair -- one moment I'm in selectmodule.c's getaddrinfo(),
> >     then I'm suddenly deep in the bowels of some cryptic eval frame
> >     black magic, then one 'step' later, I'm over in some completely
> >     different part of selectmodule.c, and so on.
> 
> I'm not sure why you're using Visual Studio to debug Python code?
> It sounds like you want something higher-level, e.g. Python print()
> calls or pdb.

    Ah, right.  So, I was trying to figure out why poll was barfing up
    an WSAError on whatever it was being asked to poll.  So, I set out
    to find what it was polling, via breakpoints in register().

    That pointed to an fd with value 3.  That seemed a little strange,
    as all my other socket tests consistently had socket fd values above
    250-something.

    So, I wanted to track down where that fd was coming from, thinking
    it was related to the first poll()/register() instance I could find
    in getaddrinfo().  It wasn't, and through combined use of *both* pdb
    and VS, I eventually stumbled onto the attempt to poll os.pipe()
    FDs.  I think.

    (There were also other issues that I skipped over in the e-mail;
     like figuring out I had to &= ~POLLPRI in order for the poll call
     to work at all.)

    And... Visual Studio's debugger is sublime.  I'll jump at the chance
    to fire it up if I think it'll help me debug an issue.  You get much
    better situational awareness than stepping through with gdb.

        Trent.


From jimjjewett at gmail.com  Tue Nov 27 16:23:03 2012
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 27 Nov 2012 10:23:03 -0500
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCEYXBagxkmEt+k9QOVOzFJsRGOAc36tH6q_686y-oT+7Q@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<50AEE81D.5060707@fastmail.us>
	<CAL9jXCEYXBagxkmEt+k9QOVOzFJsRGOAc36tH6q_686y-oT+7Q@mail.gmail.com>
Message-ID: <CA+OGgf7-ygtbp7bs4L_ipP1qQsMf4pAPSHh2d=zqGXBDoi0ZzA@mail.gmail.com>

On 11/26/12, Ben Hoyt <benhoyt at gmail.com> quoted (random?) as writing:
>> I'm suspicious of your use of Windows' built-in pattern matching. ...
>> ... for example: it matches short filenames,
>> ... ["?" at the end of a name means an *optional* any character]
>> ... the behavior of patterns ending in ".*" or "." isn't like fnmatch.

So?  Consistency would be better, but that horse left before the barn
was even built.  It is called filename "globbing" because even the
wild inconsistency between regular expression implementations
doesn't quite encompass most file globbing rules.

I'll grant that better documentation would be nice.  But at this point,
matching the platform expectation (at the cost of some additional
cross-platform inconsistency) may be the lesser of evils.

And frankly, for many use cases the windows algorithm is better.

It only hurts when it brings up something you weren't expecting *and*
you didn't double-check before performing a dangerous operation.
I can assure that I've found unexpected file matches under unix
semantics as well.

-jJ


From guido at python.org  Tue Nov 27 17:06:02 2012
From: guido at python.org (Guido van Rossum)
Date: Tue, 27 Nov 2012 08:06:02 -0800
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <20121127150330.GB91191@snakebite.org>
References: <20121127123325.GH90314@snakebite.org>
	<20121127154204.5fc81457@pitrou.net>
	<20121127150330.GB91191@snakebite.org>
Message-ID: <CAP7+vJJPgWh4gY=1=yxWYtP9pyaOyy4DHDT592d=7LiP_BnzVQ@mail.gmail.com>

I wasn't there, and it's been 7 years since I last saw Visual Studio,
but I do believe it is a decent way to debug C code. But it sounds
like it wa tough to explore the border between C and Python code,
which is why it took you so long to find the issue, right?

Also, please be aware that tulip is *not* ready for anything. As I
just stated in a thread on python-dev, it is just my way of trying to
understand the issues with async I/O (in a different context than
where I've understood them before, i.e. App Engine's NDB). I am well
aware of how hard it is to debug currently -- just read the last
section in the TODO file. I have not had to debug any C code, so my
experience is purely based on using pdb. Here, the one big difficulty
seems to be that it does the wrong thing when it hits a yield or
yield-from -- it treats these as if they were returns, and this
totally interrupts the debug flow. In the past, when I was debugging
NDB, I've asked in vain whether someone had already made the necessary
changes to pdb to let it jump over a yield instead of following it --
I may have to go in and develop a change myself, because this problem
isn't going away.

However, I have noted that a system using a yield-from-based scheduler
is definitely more pleasant to debug than one using yield <future> --
the big improvement is that if the system prints a traceback, it
automatically looks right. However there are still situations where
there's a suspended task that's holding on to relevant information,
and it's too hard to peek in its stack frame. I will be exploring
better solutions once I get back to working on tulip more intensely.

-- 
--Guido van Rossum (python.org/~guido)


From christian at python.org  Tue Nov 27 17:56:50 2012
From: christian at python.org (Christian Heimes)
Date: Tue, 27 Nov 2012 17:56:50 +0100
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <CADiSq7cLbJ=KJnQuuDmr7KPzbB9i8rkp6V4MtBh42NF-uRZ=oA@mail.gmail.com>
References: <20121127123325.GH90314@snakebite.org>
	<CADiSq7cLbJ=KJnQuuDmr7KPzbB9i8rkp6V4MtBh42NF-uRZ=oA@mail.gmail.com>
Message-ID: <50B4F0D2.7060703@python.org>

Am 27.11.2012 15:30, schrieb Nick Coghlan:
> I'm not sure that has anything to do with "yield from", but rather to do
> with the use of computed gotos
> (http://hg.python.org/cpython/file/default/Python/ceval.c#l821). For
> sane stepping in the eval loop, you probably want to build with
> "--without-computed-gotos" enabled (that's a configure option on Linux,
> I have no idea how to turn it off on Windows). Even without that, the
> manual opcode prediction macros are still a bit wacky (albeit easier to
> follow than the compiler level trickery).

I don't think the problem is related to computed gotos. Visual Studio
doesn't support labels as values and therefore doesn't do computed
gotos, too. It's a special feature of GCC and some other compilers.

tl;dr:
No computed gotos on Windows ;)

Christian


From sturla at molden.no  Tue Nov 27 18:03:36 2012
From: sturla at molden.no (Sturla Molden)
Date: Tue, 27 Nov 2012 18:03:36 +0100
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <k90pdu$l7v$1@ger.gmane.org>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
	<k8jjvq$7a7$1@ger.gmane.org>
	<439017CA-40D4-4489-BD05-17B067FA1724@molden.no>
	<k90pdu$l7v$1@ger.gmane.org>
Message-ID: <50B4F268.1080203@molden.no>

On 26.11.2012 23:11, Richard Oudkerk wrote:

> multiprocessing on Windows already depends on that feature;-)

Indeed it does :) But it seems the handle duplication is only used when 
the Popen class is initiated, so it is not more flexible than just 
inheriting handles on fork or CreateProcess. It would be nice to pass 
newly created fds to child processes that are already running.

I.e. what I would like to see is an advanced queue that can be used to 
pass files, sockets, locks, and other objects associated with a handle. 
That is, when a "special object" on the queue is deserialized 
(unpickled) by the receiver, it sends a request back to the sender for 
handle duplication. One obvious use case would be a "thread pool" design 
for a server app using processes instead of threads.

Sturla


From guido at python.org  Tue Nov 27 18:17:57 2012
From: guido at python.org (Guido van Rossum)
Date: Tue, 27 Nov 2012 09:17:57 -0800
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <50B4F0D2.7060703@python.org>
References: <20121127123325.GH90314@snakebite.org>
	<CADiSq7cLbJ=KJnQuuDmr7KPzbB9i8rkp6V4MtBh42NF-uRZ=oA@mail.gmail.com>
	<50B4F0D2.7060703@python.org>
Message-ID: <CAP7+vJKPeDdri0c0Fu_i_qszHBAGWwAODyutp=9esafDZQaD0g@mail.gmail.com>

Nevertheless the optimizer does crazy things to ceval.c. Trent, can you
confirm you were debugging unoptimized code?

--Guido van Rossum (sent from Android phone)
On Nov 27, 2012 8:57 AM, "Christian Heimes" <christian at python.org> wrote:

> Am 27.11.2012 15:30, schrieb Nick Coghlan:
> > I'm not sure that has anything to do with "yield from", but rather to do
> > with the use of computed gotos
> > (http://hg.python.org/cpython/file/default/Python/ceval.c#l821). For
> > sane stepping in the eval loop, you probably want to build with
> > "--without-computed-gotos" enabled (that's a configure option on Linux,
> > I have no idea how to turn it off on Windows). Even without that, the
> > manual opcode prediction macros are still a bit wacky (albeit easier to
> > follow than the compiler level trickery).
>
> I don't think the problem is related to computed gotos. Visual Studio
> doesn't support labels as values and therefore doesn't do computed
> gotos, too. It's a special feature of GCC and some other compilers.
>
> tl;dr:
> No computed gotos on Windows ;)
>
> Christian
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121127/ab79026f/attachment.html>

From sturla at molden.no  Tue Nov 27 18:22:24 2012
From: sturla at molden.no (Sturla Molden)
Date: Tue, 27 Nov 2012 18:22:24 +0100
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <50B4F268.1080203@molden.no>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
	<k8jjvq$7a7$1@ger.gmane.org>
	<439017CA-40D4-4489-BD05-17B067FA1724@molden.no>
	<k90pdu$l7v$1@ger.gmane.org> <50B4F268.1080203@molden.no>
Message-ID: <50B4F6D0.4030007@molden.no>

On 27.11.2012 18:03, Sturla Molden wrote:

> Indeed it does :) But it seems the handle duplication is only used when
> the Popen class is initiated, so it is not more flexible than just
> inheriting handles on fork or CreateProcess. It would be nice to pass
> newly created fds to child processes that are already running.

Actually, it seems the non-Windows versions still use os.fork for 
inheriting fds. So fd passing is only used on Windows, and the use of fd 
passing on the Windows version is only used to simulate handle 
inheritance (which CreateProcess can do as well). It seems its use is 
held back by the use of fork to inherit fds on non-Windows versions.

If the fd passing was used in a specialized queue, e.g. called 
multiprocessing.fdqueue, the means of startup should not matter.


Sturla


From amauryfa at gmail.com  Tue Nov 27 18:32:54 2012
From: amauryfa at gmail.com (Amaury Forgeot d'Arc)
Date: Tue, 27 Nov 2012 18:32:54 +0100
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <CAP7+vJKPeDdri0c0Fu_i_qszHBAGWwAODyutp=9esafDZQaD0g@mail.gmail.com>
References: <20121127123325.GH90314@snakebite.org>
	<CADiSq7cLbJ=KJnQuuDmr7KPzbB9i8rkp6V4MtBh42NF-uRZ=oA@mail.gmail.com>
	<50B4F0D2.7060703@python.org>
	<CAP7+vJKPeDdri0c0Fu_i_qszHBAGWwAODyutp=9esafDZQaD0g@mail.gmail.com>
Message-ID: <CAGmFidYmCiW_cQ2rYECCavnyBzOf1ng2Z1Z0QAH_cV910deVbA@mail.gmail.com>

2012/11/27 Guido van Rossum <guido at python.org>

> Nevertheless the optimizer does crazy things to ceval.c. Trent, can you
> confirm you were debugging unoptimized code?


ceval.c is always compiled with a lot of optimizations, even in "debug"
mode,
because of the "#define PY_LOCAL_AGGRESSIVE" at the top of the file.

I sometimes had to remove this line to debug programs correctly.
OTOH the stack usage is much higher and some recursion tests will fail.

-- 
Amaury Forgeot d'Arc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121127/ca72e41d/attachment.html>

From shibturn at gmail.com  Tue Nov 27 18:44:51 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Tue, 27 Nov 2012 17:44:51 +0000
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <50B4F268.1080203@molden.no>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
	<k8jjvq$7a7$1@ger.gmane.org>
	<439017CA-40D4-4489-BD05-17B067FA1724@molden.no>
	<k90pdu$l7v$1@ger.gmane.org> <50B4F268.1080203@molden.no>
Message-ID: <k92u6s$f6a$1@ger.gmane.org>

On 27/11/2012 5:03pm, Sturla Molden wrote:
> I.e. what I would like to see is an advanced queue that can be used to
> pass files, sockets, locks, and other objects associated with a handle.
> That is, when a "special object" on the queue is deserialized
> (unpickled) by the receiver, it sends a request back to the sender for
> handle duplication. One obvious use case would be a "thread pool" design
> for a server app using processes instead of threads.

Socket pickling is already supported in 3.3.  Adding the same support 
for file objects would be easy enough.

But I don't see a sensible way to support general pickling of lock 
objects on Unix.  So I don't much like the idea of adding support only 
for Windows.

-- 
Richard



From trent at snakebite.org  Tue Nov 27 18:49:33 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 12:49:33 -0500
Subject: [Python-ideas] An alternate approach to async IO
Message-ID: <20121127174933.GC91191@snakebite.org>

    I was hoping to have some code to go along with this idea, but the
    WSAPoll stuff distracted me, so, I don't have any concrete examples
    just yet.

    As part of (slowly) catching up on all the async IO discussions, I
    reviewed both Twisted's current iocpreactor implementation, as well
    as Richard Oudkerk's IOCP/tulip work:

        http://mail.python.org/pipermail/python-ideas/2012-November/017684.html

    Both implementations caught me a little off-guard.  The Twisted
    iocpreactor appears to drive a 'one shot' iteration of GetQueued-
    CompletionStatus per event loop iteration -- sort of treating it
    just like select/poll (in that you call it once to get a list of
    things to do, do them, then call it again).

    Richard's work sort of works the same way... the proactor drives
    the completion port polling via GetQueuedCompletionStatus.

    From what I know about overlapped IO and IOCP, this seems like a
    really odd way to do things: by *not* using a thread pool, whose
    threads' process_some_io_that_just_completed() methods are auto-
    matically called* by the underlying OS, you're getting all of the
    additional complexity of overlapped IO and IOCP without any of the
    concurrency benefits.

        [*]: On AIX, Solaris and Windows XP/2003, you'd manually spawn
             a bunch of threads and have them call GetQueuedCompletion-
             Status or port_get().  Those methods return when IO is
             available on the given completion port.

             On Windows 7/2008R2, you can leverage the new thread pool
             APIs.  You don't need to create a single thread yourself,
             just spin up a pool and associate the IO completion with
             it -- Windows will automatically manage the underlying
             threads optimally.

             Windows 8/2012 introduces a new API, Registered I/O,
             which leverages pre-registered buffers and completion
             ports to minimize overhead of copying data from kernel
             to user space.  It's pretty nifty.  You use RIO in concert
             with thread pools and IOCP.

    Here's the "idea" I had, with zero working code to back it up:
    what if we had a bunch of threads in the background whose sole
    purpose it was to handle AIO?  On Windows/AIX, they would poll
    GetQueuedCompletionStatus, on Solaris, get_event().

    They're literally raw pthreads and have absolutely nothing to
    do with Python's threading.Thread() stuff.  They exist solely
    in C and can't be interfaced to directly from Python code.

    ....which means they're free to run outside the GIL, and thus,
    multiple cores could be leveraged concurrently.  (Only for
    processing completed I/O, but hey, it's better than nothing.)

    The threads would process the completion port events via C code
    and allocate necessary char * buffers on demand.  Upon completion
    of their processing, let's say, reading 4096 bytes from a socket,
    they push their processed event and data to an interlocked* list,
    then go back to GetQueuedCompletionStatus/get_event.

    You eventually need to process these events from Python code.
    Here's where I think this approach is neat: we could expose a new
    API that's semantically pretty close to how poll() works now.

    Except instead of polling a bunch of non-blocking file descriptors
    to see what's ready for reading/writing, you're simply getting a
    list of events that completed in the background.

    You process the events, then presumably, your event loop starts
    again: grab available events, process them.  The key here is that
    nothing blocks, but not because of non-blocking sockets.

    Nothing blocks because the reads() have already taken place and
    the writes() return immediately.  So your event loop now becomes
    this tight little chunk of code that spins as quickly as it can
    on processing events.  (This would lend itself very well to the
    Twisted notion of deferring blocking actions (like db stuff) via
    callFromThread(), but I'll ignore that detail for now.)

    So, presuming your Python code looks something like this:

        for event in aio.events():
            if event.type == EventType.DataReceived:
                ...
            elif event.type == ...

    Let's talk about how aio.events() would work.  I mentioned that the
    background threads, once they've completed processing their event,
    push the event details and data onto an interlocked list.

    I got this idea from the interlocked list methods available in
    Windows since XP.  They facilitate synchronized access to a singly
    linked list without the need for explicit mutexes.  Some sample
    methods:

        InterlockedPopEntrySList
        InterlockedPushEntrySList
        InterlockedFlushSList

        More info: http://msdn.microsoft.com/en-us/library/windows/desktop/ms684121(v=vs.85).aspx

    So, the last thing a background thread does before going back to
    poll GetQueuedCompletionStatus/get_event is an interlocked push
    onto a global list.  What would it push?  Depends on the event,
    at the least, an event identifier, at the most, an event identifier
    and pointer to the char * buffer allocated by the thread, perhaps?

    Now, when aio.events() is called, we have some C code that does
    an interlocked flush -- this basically pops all the entries off
    the list in an interlocked, atomic fashion.

    It then loops over all the events and creates the necessary CPython
    objects that can then be used in the subsequent Python code.  So,
    for data received, it would call PyBytesBuffer_FromString(...) with
    the buffer indicated in the event, then free() that chunk of memory.

        (That was just the first idea that came to my mind; there are
         probably tons of better ways to do it in practice.  The point
         is that however its done, the end result is a GC-tracked object
         with no memory leaks from the background thread buffer alloc.

         Perhaps there could be a separate interlocked list of shared
         buffers that the background threads pop from when they need
         a buffer, and the Python aio.events() code pushes to when
         it is done converting the buffer into a CPython object.

         ....and then down the track, subsequent optimizations that
         allow the CPython object to inherit the buffer for its life-
         time, removing the need to constantly copy data from back-
         ground thread buffers to CPython buffers.)

    And... I think that's the crux of it really.  Key points are actual
    asynchronous IO, carried out by threads that aren't GIL constrained
    and thus, able to run concurrently -- coupled with a pretty simple
    Python interface.

    Now, follow-on ideas from that core premise: the longer we can stay
    in C land, the more performant the solution.  Putting that little
    tidbit aside, I want to mention Twisted again, because I think their
    protocol approach is spot on:

        class Echo(protocol.Protocol):
            def dataReceived(self, data):
                self.transport.write(data)

            def lineReceived(self, line):
                self.transport.write(line)

    That's a completely nonsense example, as you wouldn't have both a
    lineReceived and dataReceived method, but it illustrates the point
    of writing classes that are driven by events.

    As for maintaining how long we can stay in C versus Python, consider
    serving a HTTP request.  You accept a connection, wait for headers,
    then send headers+data back.  From a Python programmer's perspective
    you don't really care if data has been read unless you've received
    the entire, well-formed set of headers from the client.

    For an SMTP server, it's more chatty, read a line, send a line back,
    read a few more lines, send some more lines back.

    In both of these cases, you could apply some post-processing of data
    in C, perhaps as a simple regex match to start with.  It would be
    pretty easy to regex match an incoming HELO/GET line, queue the well
    formed ones for processing by aio.events(), and automatically send
    pre-registered errors back for those that don't match.

    Things like accept filters on BSD work like this; i.e. don't return
    back to the calling code until there's a legitimate event.  It
    greatly simplifies the eventual Python implementation, too.  Rather
    than write your own aio.events()-based event loop, you'd take the
    Twisted approach and register your protocol handlers with a global
    "reactor" that is responsible for processing the raw aio.events()
    and then invoking the relevant method on your class instances.

    So, let's assume that's all implemented and working in 3.4.  The
    drawback of this approach is that even though we've allowed for
    some actual threaded concurrency via background IO threads, the
    main Python code that loops over aio.events() is still limited
    to executing on a single core.  Albeit, in a very tight loop that
    never blocks and would probably be able to process an insane number
    of events per second when pegging a single core at 100%.

    So, that's 3.4.  Perhaps in 3.5 we could add automatic support for
    multiprocessing once the number of events per-poll reach a certain
    threshold.  The event loop automatically spreads out the processing
    of events via multiprocessing, facilitating multiple core usage both
    via background threads *and* Python code.  (And we could probably do
    some optimizations such that the background IO thread always queues
    up events for the same multiprocessing instance -- which would yield
    even more benefits if we had fancy "buffer inheritance" stuff that
    removes the need to continually copy data from the background IO
    buffers to the foreground CPython code.)

    As an added bonus, by the time 3.5 rolls around, perhaps the Linux
    and FreeBSD camps have seen how performant IOCP/Solaris-events can
    be and added similar support (the Solaris event API wouldn't be that
    hard to port elsewhere for an experienced kernel hacker.  It's quite
    elegant, and, hey, the source code is available).  (We could mimic
    it in the mean time with background threads that call epoll/kqueue,
    I guess.)

    Thoughts?  Example code or GTFO? ;-)

        Trent.



From trent at snakebite.org  Tue Nov 27 18:51:20 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 12:51:20 -0500
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <CAP7+vJKPeDdri0c0Fu_i_qszHBAGWwAODyutp=9esafDZQaD0g@mail.gmail.com>
References: <20121127123325.GH90314@snakebite.org>
	<CADiSq7cLbJ=KJnQuuDmr7KPzbB9i8rkp6V4MtBh42NF-uRZ=oA@mail.gmail.com>
	<50B4F0D2.7060703@python.org>
	<CAP7+vJKPeDdri0c0Fu_i_qszHBAGWwAODyutp=9esafDZQaD0g@mail.gmail.com>
Message-ID: <20121127175119.GD91191@snakebite.org>

On Tue, Nov 27, 2012 at 09:17:57AM -0800, Guido van Rossum wrote:
>    Nevertheless the optimizer does crazy things to ceval.c. Trent, can you
>    confirm you were debugging unoptimized code?

    Yup, definitely.  If it helps, I can fire up the dev env again and
    give specifics on the exact frame-jumping-voodoo that baffled me.

        Trent.


From shibturn at gmail.com  Tue Nov 27 19:30:10 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Tue, 27 Nov 2012 18:30:10 +0000
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121127174933.GC91191@snakebite.org>
References: <20121127174933.GC91191@snakebite.org>
Message-ID: <k930rr$8il$1@ger.gmane.org>

On 27/11/2012 5:49pm, Trent Nelson wrote:
>      Here's the "idea" I had, with zero working code to back it up:
>      what if we had a bunch of threads in the background whose sole
>      purpose it was to handle AIO?  On Windows/AIX, they would poll
>      GetQueuedCompletionStatus, on Solaris, get_event().
>
>      They're literally raw pthreads and have absolutely nothing to
>      do with Python's threading.Thread() stuff.  They exist solely
>      in C and can't be interfaced to directly from Python code.
>
>      ....which means they're free to run outside the GIL, and thus,
>      multiple cores could be leveraged concurrently.  (Only for
>      processing completed I/O, but hey, it's better than nothing.)
>
>      The threads would process the completion port events via C code
>      and allocate necessary char * buffers on demand.  Upon completion
>      of their processing, let's say, reading 4096 bytes from a socket,
>      they push their processed event and data to an interlocked* list,
>      then go back to GetQueuedCompletionStatus/get_event.

But you have to allocate the buffer *before* you initiate an overlapped 
read.  And you may as well make that buffer a Python bytes object (which 
can be shrunk if it is too large).  That leaves no "processing" that can 
usefully be done by a C level thread pool.

Also, note that (at least on Windows) overlapped IO automatically makes 
use of a hidden thread pool to do the IO.

-- 
Richard



From trent at snakebite.org  Tue Nov 27 20:59:13 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 14:59:13 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <k930rr$8il$1@ger.gmane.org>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
Message-ID: <20121127195913.GE91191@snakebite.org>

On Tue, Nov 27, 2012 at 10:30:10AM -0800, Richard Oudkerk wrote:
> On 27/11/2012 5:49pm, Trent Nelson wrote:
> >      Here's the "idea" I had, with zero working code to back it up:
> >      what if we had a bunch of threads in the background whose sole
> >      purpose it was to handle AIO?  On Windows/AIX, they would poll
> >      GetQueuedCompletionStatus, on Solaris, get_event().
> >
> >      They're literally raw pthreads and have absolutely nothing to
> >      do with Python's threading.Thread() stuff.  They exist solely
> >      in C and can't be interfaced to directly from Python code.
> >
> >      ....which means they're free to run outside the GIL, and thus,
> >      multiple cores could be leveraged concurrently.  (Only for
> >      processing completed I/O, but hey, it's better than nothing.)
> >
> >      The threads would process the completion port events via C code
> >      and allocate necessary char * buffers on demand.  Upon completion
> >      of their processing, let's say, reading 4096 bytes from a socket,
> >      they push their processed event and data to an interlocked* list,
> >      then go back to GetQueuedCompletionStatus/get_event.
> 
> But you have to allocate the buffer *before* you initiate an overlapped 
> read.  And you may as well make that buffer a Python bytes object (which 
> can be shrunk if it is too large).  That leaves no "processing" that can 
> usefully be done by a C level thread pool.

    I'm a little confused by that last sentence.  The premise of my idea
    is being able to service AIO via simple GIL-independent threads that
    really just copy data from A to B.  The simple fact that they don't
    have to acquire the GIL each time the completion port has an event
    seems like a big win, no?

    (So I'm not sure if you're saying that this wouldn't work because
    you may as well use Python bytes objects, and they can't be accessed
    willy-nilly from non-GIL threads... or if you're saying they can,
    but there's no benefit from a C-level thread copying data to/from
    buffers independent of the GIL.)

> Also, note that (at least on Windows) overlapped IO automatically makes 
> use of a hidden thread pool to do the IO.

    I don't think that detail impacts my general idea though, right?

        Trent.


From shibturn at gmail.com  Tue Nov 27 21:13:18 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Tue, 27 Nov 2012 20:13:18 +0000
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121127195913.GE91191@snakebite.org>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org>
Message-ID: <k936t7$2hc$1@ger.gmane.org>

On 27/11/2012 7:59pm, Trent Nelson wrote:
>> >But you have to allocate the buffer*before*  you initiate an overlapped
>> >read.  And you may as well make that buffer a Python bytes object (which
>> >can be shrunk if it is too large).  That leaves no "processing" that can
>> >usefully be done by a C level thread pool.
>      I'm a little confused by that last sentence.  The premise of my idea
>      is being able to service AIO via simple GIL-independent threads that
>      really just copy data from A to B.  The simple fact that they don't
>      have to acquire the GIL each time the completion port has an event
>      seems like a big win, no?
>
>      (So I'm not sure if you're saying that this wouldn't work because
>      you may as well use Python bytes objects, and they can't be accessed
>      willy-nilly from non-GIL threads... or if you're saying they can,
>      but there's no benefit from a C-level thread copying data to/from
>      buffers independent of the GIL.)
>

I am saying that there is no copying necessary if you use a bytes object 
as your buffer.  You can just use _PyBytes_Resize() afterwards to shrink 
it if necessary.

-- 
Richard



From trent at snakebite.org  Tue Nov 27 21:19:46 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 15:19:46 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <k936t7$2hc$1@ger.gmane.org>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
Message-ID: <20121127201946.GF91191@snakebite.org>

On Tue, Nov 27, 2012 at 12:13:18PM -0800, Richard Oudkerk wrote:
> On 27/11/2012 7:59pm, Trent Nelson wrote:
> >> >But you have to allocate the buffer*before*  you initiate an overlapped
> >> >read.  And you may as well make that buffer a Python bytes object (which
> >> >can be shrunk if it is too large).  That leaves no "processing" that can
> >> >usefully be done by a C level thread pool.
> >      I'm a little confused by that last sentence.  The premise of my idea
> >      is being able to service AIO via simple GIL-independent threads that
> >      really just copy data from A to B.  The simple fact that they don't
> >      have to acquire the GIL each time the completion port has an event
> >      seems like a big win, no?
> >
> >      (So I'm not sure if you're saying that this wouldn't work because
> >      you may as well use Python bytes objects, and they can't be accessed
> >      willy-nilly from non-GIL threads... or if you're saying they can,
> >      but there's no benefit from a C-level thread copying data to/from
> >      buffers independent of the GIL.)
> >
> 
> I am saying that there is no copying necessary if you use a bytes object 
> as your buffer.  You can just use _PyBytes_Resize() afterwards to shrink 
> it if necessary.

    Got it.  So what about the "no processing that can be usefully done
    by a C level thread" bit?  I'm trying to discern whether or not you're
    highlighting a fundamental flaw in the theory/idea ;-)

    (That it's going to be more optimal to have background threads service
     IO without the need to acquire the GIL, basically.)

        Trent.


From solipsis at pitrou.net  Tue Nov 27 21:54:14 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 27 Nov 2012 21:54:14 +0100
Subject: [Python-ideas] An alternate approach to async IO
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
Message-ID: <20121127215414.22ffbefc@pitrou.net>

On Tue, 27 Nov 2012 15:19:46 -0500
Trent Nelson <trent at snakebite.org> wrote:
> 
>     Got it.  So what about the "no processing that can be usefully done
>     by a C level thread" bit?  I'm trying to discern whether or not you're
>     highlighting a fundamental flaw in the theory/idea ;-)
> 
>     (That it's going to be more optimal to have background threads service
>      IO without the need to acquire the GIL, basically.)

You don't need to acquire the GIL if you use the Py_buffer API in the
right way. You'll figure out the details :-)

Regards

Antoine.




From sturla at molden.no  Tue Nov 27 22:25:55 2012
From: sturla at molden.no (Sturla Molden)
Date: Tue, 27 Nov 2012 22:25:55 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121127215414.22ffbefc@pitrou.net>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
Message-ID: <17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>


Den 27. nov. 2012 kl. 21:54 skrev Antoine Pitrou <solipsis at pitrou.net>:

> 
> You don't need to acquire the GIL if you use the Py_buffer API in the
> right way. You'll figure out the details :-)
> 

And if he don't, this is what Cython's memoryview syntax will help him to do :-)

As for the rest of his idea, I think it is fundamentally flawed in two ways:

First, there is no fundamental difference between a non-registered thread and a Python thread that has released the GIL. They are both native OS threads running freely without bothering the interpreter. So given that the GIL will br released in an extension module, a list of threading.Thread is much easier to set up than use POSIX or Windows threads from C.

Second, using a thread pool to monitor another thread pool (an IOCP on Windows) seems a bit confused. IOCPs can fire a callback function on completion. It is easier to let the IOCP use the simplified GIL API from the callback code to acquire the GIL, and notify the Python code with a callback. Or en even simpler solution is to use twisted (twistedmatrix.org) instead of reinventing the wheel. 


Sturla








From sturla at molden.no  Tue Nov 27 22:37:50 2012
From: sturla at molden.no (Sturla Molden)
Date: Tue, 27 Nov 2012 22:37:50 +0100
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <k92u6s$f6a$1@ger.gmane.org>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
	<k8jjvq$7a7$1@ger.gmane.org>
	<439017CA-40D4-4489-BD05-17B067FA1724@molden.no>
	<k90pdu$l7v$1@ger.gmane.org> <50B4F268.1080203@molden.no>
	<k92u6s$f6a$1@ger.gmane.org>
Message-ID: <55B2F855-807B-43FD-BEB6-D4D229304EE8@molden.no>

Den 27. nov. 2012 kl. 18:44 skrev Richard Oudkerk <shibturn at gmail.com>:

> 
> 
> But I don't see a sensible way to support general pickling of lock objects on Unix.  So I don't much like the idea of adding support only for Windows.
> 

I would suggest to use a piece of shared memory and atomic compare-and-swap. Shared memory can be pickled (e.g. take a look at what I have on github). 

Sturla









From shibturn at gmail.com  Tue Nov 27 22:42:33 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Tue, 27 Nov 2012 21:42:33 +0000
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121127201946.GF91191@snakebite.org>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
Message-ID: <k93c4k$inu$1@ger.gmane.org>

On 27/11/2012 8:19pm, Trent Nelson wrote:
>      Got it.  So what about the "no processing that can be usefully done
>      by a C level thread" bit?  I'm trying to discern whether or not you're
>      highlighting a fundamental flaw in the theory/idea;-)
>
>      (That it's going to be more optimal to have background threads service
>       IO without the need to acquire the GIL, basically.)

I mean that I don't understand what sort of "servicing" you expect the 
background threads to do.

If you just mean consuming packets from GetQueuedCompletionStatus() and 
pushing them on an interlocked stack then why bother?

-- 
Richard



From trent at snakebite.org  Tue Nov 27 22:48:22 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 16:48:22 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
Message-ID: <20121127214821.GG91191@snakebite.org>

On Tue, Nov 27, 2012 at 01:25:55PM -0800, Sturla Molden wrote:
> 
> Den 27. nov. 2012 kl. 21:54 skrev Antoine Pitrou
> <solipsis at pitrou.net>:
> 
> > 
> > You don't need to acquire the GIL if you use the Py_buffer API in
> > the right way. You'll figure out the details :-)
> > 
> 
> And if he don't, this is what Cython's memoryview syntax will help him
> to do :-)
> 
> As for the rest of his idea, I think it is fundamentally flawed in two
> ways:
> 
> First, there is no fundamental difference between a non-registered
> thread and a Python thread that has released the GIL. They are both
> native OS threads running freely without bothering the interpreter. So
> given that the GIL will br released in an extension module, a list of
> threading.Thread is much easier to set up than use POSIX or Windows
> threads from C.
> 
> Second, using a thread pool to monitor another thread pool (an IOCP on
> Windows) seems a bit confused. IOCPs can fire a callback function on
> completion. It is easier to let the IOCP use the simplified GIL API
> from the callback code to acquire the GIL, and notify the Python code
> with a callback. Or en even simpler solution is to use twisted
> (twistedmatrix.org) instead of reinventing the wheel. 

    Hrm, neither of those points are really flaws with my idea, you're
    just suggesting two different ways of working with IOCP, both of
    which wouldn't scale as well as a GIL-independent approach.

    As for reinventing the wheel, I'm not.  I'm offering an idea for
    achieving high performance async IO.  In fact, Twisted would run
    like the dickens if there was an aio.events() API -- they wouldn't
    need any of the non-blocking magic behind the scenes like they do
    now to *simulate* an asynchronous, event-driven architecture.

        Trent.


From sturla at molden.no  Tue Nov 27 23:12:12 2012
From: sturla at molden.no (Sturla Molden)
Date: Tue, 27 Nov 2012 23:12:12 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121127214821.GG91191@snakebite.org>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
Message-ID: <96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>


Den 27. nov. 2012 kl. 22:48 skrev Trent Nelson <trent at snakebite.org>:

> 
>    Hrm, neither of those points are really flaws with my idea, you're
>    just suggesting two different ways of working with IOCP, both of
>    which wouldn't scale as well as a GIL-independent approach.

CPython is not a GIL-independent approach. You can stack ten thread pools om top of each other, but sooner or later you must call back to Python.
 
sturla



From trent at snakebite.org  Tue Nov 27 23:19:53 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 17:19:53 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <k93c4k$inu$1@ger.gmane.org>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org> <k93c4k$inu$1@ger.gmane.org>
Message-ID: <20121127221951.GH91191@snakebite.org>

On Tue, Nov 27, 2012 at 01:42:33PM -0800, Richard Oudkerk wrote:
> On 27/11/2012 8:19pm, Trent Nelson wrote:
> >      Got it.  So what about the "no processing that can be usefully done
> >      by a C level thread" bit?  I'm trying to discern whether or not you're
> >      highlighting a fundamental flaw in the theory/idea;-)
> >
> >      (That it's going to be more optimal to have background threads service
> >       IO without the need to acquire the GIL, basically.)
> 
> I mean that I don't understand what sort of "servicing" you expect the 
> background threads to do.
> 
> If you just mean consuming packets from GetQueuedCompletionStatus() and 
> pushing them on an interlocked stack then why bother?

    Theoretically: lower latency, higher throughput and better
    scalability (additional cores improves both) than alternate
    approaches when under load.

    Let's just say the goal of the new async IO framework is to
    be able to handle 65k simultaneous connections and/or saturate
    multiple 10Gb Ethernet links (or 16Gb FC, or 300Gb IB) on a
    system where a pure C/C++ solution using native libs (kqueue,
    epoll, IOCP, GCD etc) *could* do that.

    What async IO library of the future could come the closest?
    That's sort of the thought process I had, which lead to this
    idea.

    We should definitely have a systematic way of benchmarking
    this sort of stuff though, otherwise it's all conjecture.

    On that note, I came across a very interesting presentation
    a few weeks ago whilst doing research:

        http://www.mailinator.com/tymaPaulMultithreaded.pdf

    He makes some very interesting observations regarding contemporary
    performance of non-blocking versus thousands-of-blocking-threads.
    It highlights the importance of having a way to systematically test
    assumptions like "IOCP will handle load better than WSAPoll".

    Definitely worth the read.  The TL;DR version is:

        - Thousands of threads doing blocking IO isn't as bad as
          everyone thinks.  It used to suck, but these days, on
          multicore machines and contemporary kernels, it ain't
          so bad.
        - Throughput is much better using blocking IO than non.

    From Python's next-gen AIO perspective, I think it would be
    useful to define our goals.  Is absolute balls-to-the-wall
    as-close-to-metal-as-possible performance (like 65k clients
    or 1GB/s saturation) the ultimate goal?

    If not, then what?  Half that, but with scalability?  Quarter of
    that, but with a beautifully elegant/simple API?

        Trent.


From trent at snakebite.org  Tue Nov 27 23:36:45 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 17:36:45 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
Message-ID: <20121127223644.GJ91191@snakebite.org>

On Tue, Nov 27, 2012 at 02:12:12PM -0800, Sturla Molden wrote:
> 
> Den 27. nov. 2012 kl. 22:48 skrev Trent Nelson <trent at snakebite.org>:
> 
> > 
> >    Hrm, neither of those points are really flaws with my idea, you're
> >    just suggesting two different ways of working with IOCP, both of
> >    which wouldn't scale as well as a GIL-independent approach.
> 
> CPython is not a GIL-independent approach. You can stack ten thread pools om top of each other, but sooner or later you must call back to Python.

    Right, but with things like interlocked lists, you can make that
    CPython|background_IO synchronization barrier much more performant
    than relying on GIL acquisition.

    I think we'll have to agree to disagree at this point; there's not
    much point arguing further until there's some code+benchmarks on
    the table.

        Trent.


From greg.ewing at canterbury.ac.nz  Tue Nov 27 21:54:50 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 28 Nov 2012 09:54:50 +1300
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121127201946.GF91191@snakebite.org>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
Message-ID: <50B5289A.30405@canterbury.ac.nz>

Trent Nelson wrote:
>     So what about the "no processing that can be usefully done
>     by a C level thread" bit?  I'm trying to discern whether or not you're
>     highlighting a fundamental flaw in the theory/idea ;-)

You seem to be assuming that the little bit of processing
needed to get the data from kernel to user space is going
to be significant compared to whatever the Python code is
going to do with the data. That seems unlikely.

-- 
Greg


From shibturn at gmail.com  Tue Nov 27 23:53:52 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Tue, 27 Nov 2012 22:53:52 +0000
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <55B2F855-807B-43FD-BEB6-D4D229304EE8@molden.no>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
	<k8jjvq$7a7$1@ger.gmane.org>
	<439017CA-40D4-4489-BD05-17B067FA1724@molden.no>
	<k90pdu$l7v$1@ger.gmane.org> <50B4F268.1080203@molden.no>
	<k92u6s$f6a$1@ger.gmane.org>
	<55B2F855-807B-43FD-BEB6-D4D229304EE8@molden.no>
Message-ID: <k93gab$q33$1@ger.gmane.org>

On 27/11/2012 9:37pm, Sturla Molden wrote:
> I would suggest to use a piece of shared memory and atomic compare-and-swap.
 > Shared memory can be pickled (e.g. take a look at what I have on github).

On unix (without assuming a recent gcc) there isn't any cross platform 
way of doing compare-and-swap or other atomic operations.  And even if 
there were, you need to worry about busy-waiting.

One could also consider using a process-shared semaphore or mutex 
allocated from shared memory, but that does not seem to be available on 
many platforms.

A simple (but wasteful) alternative would be to build a lock/semaphore 
from a pipe: write to release and read to acquire.  Then you can use 
normal fd passing.

-- 
Richard



From trent at snakebite.org  Wed Nov 28 00:06:39 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 18:06:39 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <50B5289A.30405@canterbury.ac.nz>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<50B5289A.30405@canterbury.ac.nz>
Message-ID: <20121127230639.GL91191@snakebite.org>

On Tue, Nov 27, 2012 at 12:54:50PM -0800, Greg Ewing wrote:
> Trent Nelson wrote:
> >     So what about the "no processing that can be usefully done
> >     by a C level thread" bit?  I'm trying to discern whether or not you're
> >     highlighting a fundamental flaw in the theory/idea ;-)
> 
> You seem to be assuming that the little bit of processing
> needed to get the data from kernel to user space is going
> to be significant compared to whatever the Python code is
> going to do with the data.  That seems unlikely.

    Great response.  Again, highlights the need to have some
    standard way for benchmarking this sort of stuff.  Would
    be a good use of Snakebite; everything is on a gigabit
    switch on the same subnet, multiple boxes could simulate
    client load against each server being benchmarked.

    (I don't think my idea would really start to show gains
     until you're dealing with tens of thousands of connections
     and 1Gb/10Gb Ethernet traffic, to be honest.  So, as I
     mentioned in a previous e-mail, it depends on what our
     goals are for the next-gen AIO framework (performance
     at whatever the cost vs not).)


        Trent.


From sturla at molden.no  Wed Nov 28 00:33:55 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 00:33:55 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121127223644.GJ91191@snakebite.org>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
Message-ID: <471EE574-81CC-4260-A0FA-9198655674C4@molden.no>


Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent at snakebite.org>:

> 
>    Right, but with things like interlocked lists, you can make that
>    CPython|background_IO synchronization barrier much more performant
>    than relying on GIL acquisition.

You always need the GIL to call back to Python.  You don't need it for anything else. 

Sturla




From mwm at mired.org  Wed Nov 28 00:35:20 2012
From: mwm at mired.org (Mike Meyer)
Date: Tue, 27 Nov 2012 17:35:20 -0600
Subject: [Python-ideas] An error in multiprocessing on MacOSX?
In-Reply-To: <k93gab$q33$1@ger.gmane.org>
References: <50ACEF44.1090705@molden.no>
	<CAD=7U2DTLxGTw0=zcFGU5p9eWk171pHjRYuYe_3jr07KXGaUMw@mail.gmail.com>
	<CAGE7PN+qgoKj2nbe_ry0G36TUYYKp2YiJ-hBFgmZ5Z=UbVOX9A@mail.gmail.com>
	<k8jjvq$7a7$1@ger.gmane.org>
	<439017CA-40D4-4489-BD05-17B067FA1724@molden.no>
	<k90pdu$l7v$1@ger.gmane.org> <50B4F268.1080203@molden.no>
	<k92u6s$f6a$1@ger.gmane.org>
	<55B2F855-807B-43FD-BEB6-D4D229304EE8@molden.no>
	<k93gab$q33$1@ger.gmane.org>
Message-ID: <CAD=7U2Aqz9-vhaj3OP480wY1JP4z3T6Q2V_1auwcpbrSoLc=eg@mail.gmail.com>

On Tue, Nov 27, 2012 at 4:53 PM, Richard Oudkerk <shibturn at gmail.com> wrote:
> On 27/11/2012 9:37pm, Sturla Molden wrote:
>> I would suggest to use a piece of shared memory and atomic
>> compare-and-swap.
>
>> Shared memory can be pickled (e.g. take a look at what I have on github).
> On unix (without assuming a recent gcc) there isn't any cross platform way
> of doing compare-and-swap or other atomic operations.  And even if there
> were, you need to worry about busy-waiting.
>
> One could also consider using a process-shared semaphore or mutex allocated
> from shared memory, but that does not seem to be available on many
> platforms.

Do we need a cross-platform solution for all posix systems, or just a
wrapper that provides that functionality that can be implemented on
many of them? Something akin to the apr global mutex routines?

     <mike


From trent at snakebite.org  Wed Nov 28 00:41:29 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 18:41:29 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
References: <k930rr$8il$1@ger.gmane.org> <20121127195913.GE91191@snakebite.org>
	<k936t7$2hc$1@ger.gmane.org> <20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
Message-ID: <20121127234129.GO91191@snakebite.org>

On Tue, Nov 27, 2012 at 03:33:55PM -0800, Sturla Molden wrote:
> 
> Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent at snakebite.org>:
> 
> > 
> >    Right, but with things like interlocked lists, you can make that
> >    CPython|background_IO synchronization barrier much more performant
> >    than relying on GIL acquisition.
> 
> You always need the GIL to call back to Python.  You don't need it for anything else. 

    Right, you *currently* need the GIL to call back to Python.

    I'm proposing an alternate approach that avoids the GIL.  On
    Windows, it would be via interlocked lists.

        Trent.


From sturla at molden.no  Wed Nov 28 00:50:18 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 00:50:18 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121127174933.GC91191@snakebite.org>
References: <20121127174933.GC91191@snakebite.org>
Message-ID: <B0DC26DA-6EAC-4437-B3B2-9F6605F44102@molden.no>


Den 27. nov. 2012 kl. 18:49 skrev Trent Nelson <trent at snakebite.org>:

>     
>    Here's the "idea" I had, with zero working code to back it up:
>    what if we had a bunch of threads in the background whose sole
>    purpose it was to handle AIO?  On Windows/AIX, they would poll
>    GetQueuedCompletionStatus, on Solaris, get_event().
> 
>    They're literally raw pthreads and have absolutely nothing to
>    do with Python's threading.Thread() stuff.  They exist solely
>    in C and can't be interfaced to directly from Python code.
> 
>    ....which means they're free to run outside the GIL, and thus,
>    multiple cores could be leveraged concurrently.  (Only for
>    processing completed I/O, but hey, it's better than nothing.)


And herein lies the misunderstanding. 

A Python thread can do the same processing of completed I/O before it reacquires the GIL ? and thus Python can run on multiple cores concurrently. There is no difference between a pthread and a threading.Thread that has released the GIL. You don't need to spawn a pthread to process data independent if the GIL. You just need to process the data before the GIL is reacquired.

In fact, I use Python threads for parallel computing all the time. They scale as well as OpenMP threads on multiple cores. Why? Because I have made sure the computational kernels (e.g. LAPACK functions) releases the GIL before they execute ? and the GIL is not reacquired before they are done. As long as the threads are running in C or Fortran land they don't need the GIL. I don't need to spawn pthreads or use OpenMP pragmas to create threads that can run freely on all cores. Python threads (threading.Thread) can do that too.



Sturla



From guido at python.org  Wed Nov 28 00:50:34 2012
From: guido at python.org (Guido van Rossum)
Date: Tue, 27 Nov 2012 15:50:34 -0800
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
Message-ID: <CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>

On Tue, Nov 27, 2012 at 3:33 PM, Sturla Molden <sturla at molden.no> wrote:
>
> Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent at snakebite.org>:
>
>>
>>    Right, but with things like interlocked lists, you can make that
>>    CPython|background_IO synchronization barrier much more performant
>>    than relying on GIL acquisition.
>
> You always need the GIL to call back to Python.  You don't need it for anything else.

You also need it for any use of an object, even INCREF, unless you
know no other thread yet knows about it.

-- 
--Guido van Rossum (python.org/~guido)


From sturla at molden.no  Wed Nov 28 00:59:43 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 00:59:43 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121127234129.GO91191@snakebite.org>
References: <k930rr$8il$1@ger.gmane.org> <20121127195913.GE91191@snakebite.org>
	<k936t7$2hc$1@ger.gmane.org> <20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<20121127234129.GO91191@snakebite.org>
Message-ID: <E1CC6B39-8322-4B8D-973F-7E8682D081F0@molden.no>



Den 28. nov. 2012 kl. 00:41 skrev Trent Nelson <trent at snakebite.org>:

> 
>    Right, you *currently* need the GIL to call back to Python.

You need the GIL to access the CPython interpreter. You did not suggest anything that will change that.


> 
>    I'm proposing an alternate approach that avoids the GIL.  On
>    Windows, it would be via interlocked lists.

No, you just misunderstood how the GIL works.

You seem to think the GIL serializes Python threads. But what it serializes is access to the CPython interpreter.

When Python threads process data in C land they don't need the GIL and can run freely.


Sturla





From trent at snakebite.org  Wed Nov 28 01:01:34 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 19:01:34 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <B0DC26DA-6EAC-4437-B3B2-9F6605F44102@molden.no>
References: <20121127174933.GC91191@snakebite.org>
	<B0DC26DA-6EAC-4437-B3B2-9F6605F44102@molden.no>
Message-ID: <20121128000133.GP91191@snakebite.org>

On Tue, Nov 27, 2012 at 03:50:18PM -0800, Sturla Molden wrote:
> 
> Den 27. nov. 2012 kl. 18:49 skrev Trent Nelson <trent at snakebite.org>:
> 
> >     
> >    Here's the "idea" I had, with zero working code to back it up:
> >    what if we had a bunch of threads in the background whose sole
> >    purpose it was to handle AIO?  On Windows/AIX, they would poll
> >    GetQueuedCompletionStatus, on Solaris, get_event().
> > 
> >    They're literally raw pthreads and have absolutely nothing to do
> >    with Python's threading.Thread() stuff.  They exist solely in C
> >    and can't be interfaced to directly from Python code.
> > 
> >    ....which means they're free to run outside the GIL, and thus,
> >    multiple cores could be leveraged concurrently.  (Only for
> >    processing completed I/O, but hey, it's better than nothing.)
> 
> 
> And herein lies the misunderstanding. 
> 
> A Python thread can do the same processing of completed I/O before it
> reacquires the GIL ? and thus Python can run on multiple cores
> concurrently. There is no difference between a pthread and a
> threading.Thread that has released the GIL. You don't need to spawn a
> pthread to process data independent if the GIL. You just need to
> process the data before the GIL is reacquired.
> 
> In fact, I use Python threads for parallel computing all the time.
> They scale as well as OpenMP threads on multiple cores. Why? Because I
> have made sure the computational kernels (e.g. LAPACK functions)
> releases the GIL before they execute ? and the GIL is not reacquired
> before they are done. As long as the threads are running in C or
> Fortran land they don't need the GIL. I don't need to spawn pthreads
> or use OpenMP pragmas to create threads that can run freely on all
> cores. Python threads (threading.Thread) can do that too.

    Perhaps I was a little too eager to highlight the ability for these
    background IO threads to run without needing to acquire the GIL.
    A Python thread could indeed do the same job, however, you still
    wouldn't interact with it from Python code as if it were a normal
    threading.Thread.

        Trent.


From sturla at molden.no  Wed Nov 28 01:07:03 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 01:07:03 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
References: <20121127174933.GC91191@snakebite.org> <k930rr$8il$1@ger.gmane.org>
	<20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
Message-ID: <E27E8DAD-4E90-452D-8AA0-45985A4C5356@molden.no>


Den 28. nov. 2012 kl. 00:50 skrev Guido van Rossum <guido at python.org>:

> On Tue, Nov 27, 2012 at 3:33 PM, Sturla Molden <sturla at molden.no> wrote:
>> 
>> Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent at snakebite.org>:
>> 
>>> 
>>>   Right, but with things like interlocked lists, you can make that
>>>   CPython|background_IO synchronization barrier much more performant
>>>   than relying on GIL acquisition.
>> 
>> You always need the GIL to call back to Python.  You don't need it for anything else.
> 
> You also need it for any use of an object, even INCREF, unless you
> know no other thread yet knows about it.

Yes.

But you don't need it to call Windows API functions like GetQueuedCompletionStatus, which is what Trent implied.

(Actually, calling GetQueuedCompletionStatus while holding the GIL would probably cause a deadlock.) 

Sturla



From trent at snakebite.org  Wed Nov 28 01:15:14 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 19:15:14 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
Message-ID: <20121128001514.GQ91191@snakebite.org>

On Tue, Nov 27, 2012 at 03:50:34PM -0800, Guido van Rossum wrote:
> On Tue, Nov 27, 2012 at 3:33 PM, Sturla Molden <sturla at molden.no> wrote:
> >
> > Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent at snakebite.org>:
> >
> >>
> >>    Right, but with things like interlocked lists, you can make that
> >>    CPython|background_IO synchronization barrier much more performant
> >>    than relying on GIL acquisition.
> >
> > You always need the GIL to call back to Python.  You don't need it for anything else.
> 
> You also need it for any use of an object, even INCREF, unless you
> know no other thread yet knows about it.

    Right, that's why I proposed using non-Python types as buffers
    whilst in the background IO threads.  Once the thread finishes
    processing the event, it pushes the necessary details onto a
    global interlocked list.  ("Details" being event type and possibly
    a data buffer if the event was 'data received'.)

    Then, when aio.events() is called, CPython code (holding the GIL)
    does an interlocked/atomic flush/pop_all, creates the relevant
    Python objects for the events, and returns them in a list for
    the calling code to iterate over.

    The rationale for all of this is that this approach should scale
    better when heavily loaded (i.e. tens of thousands of connections
    and/or Gb/s traffic).  When you're dealing with that sort of load
    on a many-core machine (let's say 16+ cores), an interlocked list
    is going to reduce latency versus 16+ threads constantly vying for
    the GIL.

    (That's the theory, at least.)

        Trent.


From sturla at molden.no  Wed Nov 28 01:25:10 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 01:25:10 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128000133.GP91191@snakebite.org>
References: <20121127174933.GC91191@snakebite.org>
	<B0DC26DA-6EAC-4437-B3B2-9F6605F44102@molden.no>
	<20121128000133.GP91191@snakebite.org>
Message-ID: <4D8D46FA-E172-4A62-B471-1ED1CFC3CD24@molden.no>



Den 28. nov. 2012 kl. 01:01 skrev Trent Nelson <trent at snakebite.org>:

> 
>    Perhaps I was a little too eager to highlight the ability for these
>    background IO threads to run without needing to acquire the GIL.
>    A Python thread could indeed do the same job, however, you still
>    wouldn't interact with it from Python code as if it were a normal
>    threading.Thread.

threading.Thread allows us to spawn a thread in a platform-independent way.

Here is a thread pool:
pool = [Thread(target=foobar) for i in range(n)]

These threads can release the GIL and call GetQueuedCompletionStatus. They can do any post-processing they want without the GIL. And they can return back to Python while holding the GIL.

Using a pool of non-Python threads inbetween would also take some of the scalability of IOCPs away. The thread that was the last to run is the first to be woken up on IO completion. That way the kernel wakes up a thread that is likely still in cache. But if you prevent the IOCP from waking up the thread that will call back to Python, then this scalability trick is of no value.

Sturla



From guido at python.org  Wed Nov 28 01:44:05 2012
From: guido at python.org (Guido van Rossum)
Date: Tue, 27 Nov 2012 16:44:05 -0800
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128001514.GQ91191@snakebite.org>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
Message-ID: <CAP7+vJL0MjTtMZCaGObscv05RrKY7h=tSh19XDAAmvWxduSMQQ@mail.gmail.com>

On Tue, Nov 27, 2012 at 4:15 PM, Trent Nelson <trent at snakebite.org> wrote:
>     The rationale for all of this is that this approach should scale
>     better when heavily loaded (i.e. tens of thousands of connections
>     and/or Gb/s traffic).  When you're dealing with that sort of load
>     on a many-core machine (let's say 16+ cores), an interlocked list
>     is going to reduce latency versus 16+ threads constantly vying for
>     the GIL.
>
>     (That's the theory, at least.)

But why would you need 15 cores to shuffle the bytes around when you
have only 1 to run the Python code that responds to those bytes?

-- 
--Guido van Rossum (python.org/~guido)


From sturla at molden.no  Wed Nov 28 01:47:26 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 01:47:26 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128001514.GQ91191@snakebite.org>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
Message-ID: <E64CF25E-AC32-420D-B342-573070A6FA55@molden.no>

Den 28. nov. 2012 kl. 01:15 skrev Trent Nelson <trent at snakebite.org>:

> 
>    Right, that's why I proposed using non-Python types as buffers
>    whilst in the background IO threads.  Once the thread finishes
>    processing the event, it pushes the necessary details onto a
>    global interlocked list.  ("Details" being event type and possibly
>    a data buffer if the event was 'data received'.)
> 
>    Then, when aio.events() is called, CPython code (holding the GIL)
>    does an interlocked/atomic flush/pop_all, creates the relevant
>    Python objects for the events, and returns them in a list for
>    the calling code to iterate over.
> 
>    The rationale for all of this is that this approach should scale
>    better when heavily loaded (i.e. tens of thousands of connections
>    and/or Gb/s traffic).  When you're dealing with that sort of load
>    on a many-core machine (let's say 16+ cores), an interlocked list
>    is going to reduce latency versus 16+ threads constantly vying for
>    the GIL.
> 
>    

Sure, the GIL is a lot more expensive than a simple mutex (or an interlocked list), so avoiding a GIL-release and reacquire on each io completion event might be an advantage.

Sturla

From sturla at molden.no  Wed Nov 28 01:56:19 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 01:56:19 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <CAP7+vJL0MjTtMZCaGObscv05RrKY7h=tSh19XDAAmvWxduSMQQ@mail.gmail.com>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<CAP7+vJL0MjTtMZCaGObscv05RrKY7h=tSh19XDAAmvWxduSMQQ@mail.gmail.com>
Message-ID: <ABC1D9A2-7302-4CF5-870C-9F25D3CFC936@molden.no>

Den 28. nov. 2012 kl. 01:44 skrev Guido van Rossum <guido at python.org>:

> 
> But why would you need 15 cores to shuffle the bytes around when you
> have only 1 to run the Python code that responds to those bytes?
> 

I don't think he needs all 15 cores for that, assuming the Python processing is the more expensive. 

But it would avoid having the GIL change hands on each asynch i/o event.

It would also keep the thread that runs the interpreter in cache, if it never goes to sleep. That way objects associated with the interpreter would not be swapped in and out, but kept in cache.


Sturla

From sturla at molden.no  Wed Nov 28 02:09:17 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 02:09:17 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128001514.GQ91191@snakebite.org>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
Message-ID: <B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>


Den 28. nov. 2012 kl. 01:15 skrev Trent Nelson <trent at snakebite.org>:

> 
>    Right, that's why I proposed using non-Python types as buffers
>    whilst in the background IO threads.  Once the thread finishes
>    processing the event, it pushes the necessary details onto a
>    global interlocked list.  ("Details" being event type and possibly
>    a data buffer if the event was 'data received'.)
> 
>    Then, when aio.events() is called, CPython code (holding the GIL)
>    does an interlocked/atomic flush/pop_all, creates the relevant
>    Python objects for the events, and returns them in a list for
>    the calling code to iterate over.
> 
>    The rationale for all of this is that this approach should scale
>    better when heavily loaded (i.e. tens of thousands of connections
>    and/or Gb/s traffic).  When you're dealing with that sort of load
>    on a many-core machine (let's say 16+ cores), an interlocked list
>    is going to reduce latency versus 16+ threads constantly vying for
>    the GIL.
> 

Sorry. I changed my mind. I believe you are right after all :-)

I see two benefits:

1. It avoids contention for the GIL and avoids excessive context shifts in the CPython interpreter.

2. It potentially keeps the thread that runs the CPython interpreter in cache, as it is always active. And thus it also keeps the objects associated with the CPython interpreter in cache.

So yes, it might be better after all :-)


I don't think it would matter much for multicore scalability, as the Python processing is likely the more expensive part.


Sturla

From trent at snakebite.org  Wed Nov 28 03:11:40 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 21:11:40 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <CAP7+vJL0MjTtMZCaGObscv05RrKY7h=tSh19XDAAmvWxduSMQQ@mail.gmail.com>
References: <20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<CAP7+vJL0MjTtMZCaGObscv05RrKY7h=tSh19XDAAmvWxduSMQQ@mail.gmail.com>
Message-ID: <20121128021139.GA93345@snakebite.org>

On Tue, Nov 27, 2012 at 04:44:05PM -0800, Guido van Rossum wrote:
> On Tue, Nov 27, 2012 at 4:15 PM, Trent Nelson <trent at snakebite.org> wrote:
> >     The rationale for all of this is that this approach should scale
> >     better when heavily loaded (i.e. tens of thousands of connections
> >     and/or Gb/s traffic).  When you're dealing with that sort of load
> >     on a many-core machine (let's say 16+ cores), an interlocked list
> >     is going to reduce latency versus 16+ threads constantly vying for
> >     the GIL.
> >
> >     (That's the theory, at least.)
> 
> But why would you need 15 cores to shuffle the bytes around when you
> have only 1 to run the Python code that responds to those bytes?

    There are a few advantages.  For one, something like this:

        with aio.open('1GB-file-on-a-fast-SSD.raw', 'r') as f:
            data = f.read()

    Or even just:

        with aio.open('/dev/zero', 'rb') as f:
            data = f.read(1024 * 1024 * 1024)

    Would basically complete as fast as it physically possible to read
    the bytes off the device.  That's pretty cool.  Ditto for write.

    Sturla touched on some of the other advantages regarding cache
    locality, reduced context switching and absence of any lock
    contention.

    When using the `for event in aio.events()` approach, sure, you've
    only got one Python thread, but nothing blocks, and you'll be able
    to churn away on as many events per second as a single core allows.

    On more powerful boxes, you'll eventually hit a limit where the
    single core event loop can't keep up with the data being serviced by
    16+ threads.  That's where this chunk of my original e-mail becomes
    relevant:

        > So, let's assume that's all implemented and working in 3.4.  The
        > drawback of this approach is that even though we've allowed for
        > some actual threaded concurrency via background IO threads, the
        > main Python code that loops over aio.events() is still limited
        > to executing on a single core.  Albeit, in a very tight loop that
        > never blocks and would probably be able to process an insane number
        > of events per second when pegging a single core at 100%.

        > So, that's 3.4.  Perhaps in 3.5 we could add automatic support for
        > multiprocessing once the number of events per-poll reach a certain
        > threshold.  The event loop automatically spreads out the processing
        > of events via multiprocessing, facilitating multiple core usage both
        > via background threads *and* Python code.  (And we could probably do
        > some optimizations such that the background IO thread always queues
        > up events for the same multiprocessing instance -- which would yield
        > even more benefits if we had fancy "buffer inheritance" stuff that
        > removes the need to continually copy data from the background IO
        > buffers to the foreground CPython code.)

    So, down the track, we could explore options for future scaling via
    something like multiprocessing when the number of incoming events
    exceeds the ability of the single core `for event in aio.events()`
    event loop.

        Trent.


From trent at snakebite.org  Wed Nov 28 03:18:17 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 21:18:17 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <4D8D46FA-E172-4A62-B471-1ED1CFC3CD24@molden.no>
References: <20121127174933.GC91191@snakebite.org>
	<B0DC26DA-6EAC-4437-B3B2-9F6605F44102@molden.no>
	<20121128000133.GP91191@snakebite.org>
	<4D8D46FA-E172-4A62-B471-1ED1CFC3CD24@molden.no>
Message-ID: <20121128021816.GB93345@snakebite.org>

On Tue, Nov 27, 2012 at 04:25:10PM -0800, Sturla Molden wrote:
> 
> 
> Den 28. nov. 2012 kl. 01:01 skrev Trent Nelson <trent at snakebite.org>:
> 
> > 
> >    Perhaps I was a little too eager to highlight the ability for
> >    these background IO threads to run without needing to acquire the
> >    GIL.  A Python thread could indeed do the same job, however, you
> >    still wouldn't interact with it from Python code as if it were a
> >    normal threading.Thread.
> 
> threading.Thread allows us to spawn a thread in a platform-independent
> way.
> 
> Here is a thread pool: pool = [Thread(target=foobar) for i in
> range(n)]
> 
> These threads can release the GIL and call GetQueuedCompletionStatus.
> They can do any post-processing they want without the GIL. And they
> can return back to Python while holding the GIL.
> 
> Using a pool of non-Python threads inbetween would also take some of
> the scalability of IOCPs away. The thread that was the last to run is
> the first to be woken up on IO completion. That way the kernel wakes
> up a thread that is likely still in cache. But if you prevent the IOCP
> from waking up the thread that will call back to Python, then this
> scalability trick is of no value.

    I'm not sure where you're getting the idea that I'll be hampering
    the Windows IOCP optimizations that wake the last thread.  Nothing
    I've described would have any impact on that.

        Trent.


From trent at snakebite.org  Wed Nov 28 03:24:27 2012
From: trent at snakebite.org (Trent Nelson)
Date: Tue, 27 Nov 2012 21:24:27 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
References: <20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
Message-ID: <20121128022427.GC93345@snakebite.org>

On Tue, Nov 27, 2012 at 05:09:17PM -0800, Sturla Molden wrote:
> 
> Den 28. nov. 2012 kl. 01:15 skrev Trent Nelson <trent at snakebite.org>:
> 
> > 
> >    Right, that's why I proposed using non-Python types as buffers
> >    whilst in the background IO threads.  Once the thread finishes
> >    processing the event, it pushes the necessary details onto a
> >    global interlocked list.  ("Details" being event type and
> >    possibly a data buffer if the event was 'data received'.)
> > 
> >    Then, when aio.events() is called, CPython code (holding the GIL)
> >    does an interlocked/atomic flush/pop_all, creates the relevant
> >    Python objects for the events, and returns them in a list for the
> >    calling code to iterate over.
> > 
> >    The rationale for all of this is that this approach should scale
> >    better when heavily loaded (i.e. tens of thousands of connections
> >    and/or Gb/s traffic).  When you're dealing with that sort of load
> >    on a many-core machine (let's say 16+ cores), an interlocked list
> >    is going to reduce latency versus 16+ threads constantly vying
> >    for the GIL.
> > 
> 
> Sorry. I changed my mind. I believe you are right after all :-)
> 
> I see two benefits:
> 
> 1. It avoids contention for the GIL and avoids excessive context
> shifts in the CPython interpreter.
> 
> 2. It potentially keeps the thread that runs the CPython interpreter
> in cache, as it is always active. And thus it also keeps the objects
> associated with the CPython interpreter in cache.
> 
> So yes, it might be better after all :-)

    That's even sweeter to read considering your initial opposition ;-)

> I don't think it would matter much for multicore scalability, as the
> Python processing is likely the more expensive part.

    Yeah, there will eventually be a point where the number of incoming
    events exceeds the number of events that can be processed by the
    single core event loop.  That's where the ideas for distributing
    event loop processing via multiprocessing come in.  That would be
    way down the track though -- and there is still lots of benefit
    to having the async background IO stuff and a single thread event
    loop.

        Trent.


From guido at python.org  Wed Nov 28 03:59:41 2012
From: guido at python.org (Guido van Rossum)
Date: Tue, 27 Nov 2012 18:59:41 -0800
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
Message-ID: <CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>

On Tue, Nov 27, 2012 at 5:09 PM, Sturla Molden <sturla at molden.no> wrote:
>
> Den 28. nov. 2012 kl. 01:15 skrev Trent Nelson <trent at snakebite.org>:
>
>>
>>    Right, that's why I proposed using non-Python types as buffers
>>    whilst in the background IO threads.  Once the thread finishes
>>    processing the event, it pushes the necessary details onto a
>>    global interlocked list.  ("Details" being event type and possibly
>>    a data buffer if the event was 'data received'.)
>>
>>    Then, when aio.events() is called, CPython code (holding the GIL)
>>    does an interlocked/atomic flush/pop_all, creates the relevant
>>    Python objects for the events, and returns them in a list for
>>    the calling code to iterate over.
>>
>>    The rationale for all of this is that this approach should scale
>>    better when heavily loaded (i.e. tens of thousands of connections
>>    and/or Gb/s traffic).  When you're dealing with that sort of load
>>    on a many-core machine (let's say 16+ cores), an interlocked list
>>    is going to reduce latency versus 16+ threads constantly vying for
>>    the GIL.
>>
>
> Sorry. I changed my mind. I believe you are right after all :-)

It's always great to see people change their mind.

> I see two benefits:

I may not understand the proposal any more, but...

> 1. It avoids contention for the GIL and avoids excessive context shifts in the CPython interpreter.

Then why not just have one thread?

> 2. It potentially keeps the thread that runs the CPython interpreter in cache, as it is always active. And thus it also keeps the objects associated with the CPython interpreter in cache.

So what code runs in the other threads? I think I'm confused...

> So yes, it might be better after all :-)
>
>
> I don't think it would matter much for multicore scalability, as the Python processing is likely the more expensive part.

To benefit from multicore, you need to find something that requires a
lot of CPU time and can be done without holding on to the GIL. If it's
not copying bytes, what is it?

-- 
--Guido van Rossum (python.org/~guido)


From greg.ewing at canterbury.ac.nz  Wed Nov 28 06:20:30 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 28 Nov 2012 18:20:30 +1300
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128001514.GQ91191@snakebite.org>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
Message-ID: <50B59F1E.8060108@canterbury.ac.nz>

Trent Nelson wrote:
>     When you're dealing with that sort of load
>     on a many-core machine (let's say 16+ cores), an interlocked list
>     is going to reduce latency versus 16+ threads constantly vying for
>     the GIL.

I don't understand. Why is vying for access to an interlocked
list any less latentful than vying for the GIL?

-- 
Greg


From solipsis at pitrou.net  Wed Nov 28 08:02:58 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 28 Nov 2012 08:02:58 +0100
Subject: [Python-ideas] An alternate approach to async IO
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
Message-ID: <20121128080258.52321462@pitrou.net>

On Tue, 27 Nov 2012 19:15:14 -0500
Trent Nelson <trent at snakebite.org> wrote:

> On Tue, Nov 27, 2012 at 03:50:34PM -0800, Guido van Rossum wrote:
> > On Tue, Nov 27, 2012 at 3:33 PM, Sturla Molden <sturla at molden.no> wrote:
> > >
> > > Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent at snakebite.org>:
> > >
> > >>
> > >>    Right, but with things like interlocked lists, you can make that
> > >>    CPython|background_IO synchronization barrier much more performant
> > >>    than relying on GIL acquisition.
> > >
> > > You always need the GIL to call back to Python.  You don't need it for anything else.
> > 
> > You also need it for any use of an object, even INCREF, unless you
> > know no other thread yet knows about it.
> 
>     Right, that's why I proposed using non-Python types as buffers
>     whilst in the background IO threads.

Trent, once again, please read about Py_buffer.

Thanks

Antoine.




From andrew.svetlov at gmail.com  Wed Nov 28 10:53:36 2012
From: andrew.svetlov at gmail.com (Andrew Svetlov)
Date: Wed, 28 Nov 2012 11:53:36 +0200
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <CAP7+vJJPgWh4gY=1=yxWYtP9pyaOyy4DHDT592d=7LiP_BnzVQ@mail.gmail.com>
References: <20121127123325.GH90314@snakebite.org>
	<20121127154204.5fc81457@pitrou.net>
	<20121127150330.GB91191@snakebite.org>
	<CAP7+vJJPgWh4gY=1=yxWYtP9pyaOyy4DHDT592d=7LiP_BnzVQ@mail.gmail.com>
Message-ID: <CAL3CFcXT5MkoOpbuwAhDsFjBneS76_zhaHMHgovQ4HZD2pg8eA@mail.gmail.com>

On Tue, Nov 27, 2012 at 6:06 PM, Guido van Rossum <guido at python.org> wrote:
> In the past, when I was debugging
> NDB, I've asked in vain whether someone had already made the necessary
> changes to pdb to let it jump over a yield instead of following it --
> I may have to go in and develop a change myself, because this problem
> isn't going away.
>

Do you want new pdb command or change behavior of *step* or *next*?

--
Thanks,
Andrew Svetlov


From sturla at molden.no  Wed Nov 28 13:07:22 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 13:07:22 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
Message-ID: <D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>


Den 28. nov. 2012 kl. 03:59 skrev Guido van Rossum <guido at python.org>:

> 
> Then why not just have one thread?

Because of the way IOCPs work on Windows: A pool of threads is waiting on the i/o completion port, one thread from the pool is woken up on i/o completion. There is nothing to do about that, it is an event driven thread pool by design. 

The question is what a thread woken up on io completion shold do. If it uses the simplified GIL API to ensure the GIL, this would mean excessive GIL shifting with 64k i/o tasks on a port: Each time one of the 64k tasks is complete, a thread would ensure the GIL. That is unlikely to be very scalable.

So what Trent suggested is to just have these threads enqueue some data about the completed task and go back to sleep.

That way the main "Python thread" would never loose the GIL to a thread from the IOCP. Instead it would shortly busy-wait while a completed task is inserted into the queue. Thus synchronization by the GIL is replaced by a spinlock protecting a queue (or an interlocked list on recent Windows versions).

> 
>> 2. It potentially keeps the thread that runs the CPython interpreter in cache, as it is always active. And thus it also keeps the objects associated with the CPython interpreter in cache.
> 
> So what code runs in the other threads? I think I'm confused...

Almost nothing. They sleep on the IOCP, wake up on i/o completion, puts the completed task in a queue, and goes back to waiting/sleeping on the port. But they never attempt to acquire the GIL.

Sturla

 





From sturla at molden.no  Wed Nov 28 13:10:42 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 13:10:42 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <50B59F1E.8060108@canterbury.ac.nz>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<50B59F1E.8060108@canterbury.ac.nz>
Message-ID: <FCBC9F0E-E63B-4824-8F94-7284FBA62B88@molden.no>



Den 28. nov. 2012 kl. 06:20 skrev Greg Ewing <greg.ewing at canterbury.ac.nz>:

> Trent Nelson wrote:
>>    When you're dealing with that sort of load
>>    on a many-core machine (let's say 16+ cores), an interlocked list
>>    is going to reduce latency versus 16+ threads constantly vying for
>>    the GIL.
> 
> I don't understand. Why is vying for access to an interlocked
> list any less latentful than vying for the GIL?

Because ensuring and releasing GIL is more expensive than an atomic read/write.

And because the thread running Python would go to sleep while an IOCP thread is active, and perhaps be moved out of CPU cache.

Sturla


From trent at snakebite.org  Wed Nov 28 13:36:01 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 07:36:01 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128080258.52321462@pitrou.net>
References: <20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<20121128080258.52321462@pitrou.net>
Message-ID: <20121128123600.GB93849@snakebite.org>

On Tue, Nov 27, 2012 at 11:02:58PM -0800, Antoine Pitrou wrote:
> On Tue, 27 Nov 2012 19:15:14 -0500
> Trent Nelson <trent at snakebite.org> wrote:
> 
> > On Tue, Nov 27, 2012 at 03:50:34PM -0800, Guido van Rossum wrote:
> > > On Tue, Nov 27, 2012 at 3:33 PM, Sturla Molden <sturla at molden.no> wrote:
> > > >
> > > > Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent at snakebite.org>:
> > > >
> > > >>
> > > >>    Right, but with things like interlocked lists, you can make that
> > > >>    CPython|background_IO synchronization barrier much more performant
> > > >>    than relying on GIL acquisition.
> > > >
> > > > You always need the GIL to call back to Python.  You don't need it for anything else.
> > > 
> > > You also need it for any use of an object, even INCREF, unless you
> > > know no other thread yet knows about it.
> > 
> >     Right, that's why I proposed using non-Python types as buffers
> >     whilst in the background IO threads.
> 
> Trent, once again, please read about Py_buffer.

    Sorry, I did see your previous e-mail, honest.  Please interpret
    that sentence as "that's why I proposed using something that doesn't
    need to hold the GIL in the background IO threads".  Where 'something'
    sounds like it should be Py_buffer ;-)

        Trent.


From trent at snakebite.org  Wed Nov 28 13:49:12 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 07:49:12 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <50B59F1E.8060108@canterbury.ac.nz>
References: <20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<50B59F1E.8060108@canterbury.ac.nz>
Message-ID: <20121128124912.GC93849@snakebite.org>

On Tue, Nov 27, 2012 at 09:20:30PM -0800, Greg Ewing wrote:
> Trent Nelson wrote:
> >     When you're dealing with that sort of load
> >     on a many-core machine (let's say 16+ cores), an interlocked list
> >     is going to reduce latency versus 16+ threads constantly vying for
> >     the GIL.
> 
> I don't understand. Why is vying for access to an interlocked
> list any less latentful than vying for the GIL?

    I think not having to contend with the interpreter would make a big
    difference under load.  A push to an interlocked list will be more
    performant than having all threads attempt to do GIL acquire ->
    PyList_Append() -> GIL release.  The key to getting high performance
    (either low latency or high throughput) with the background IO stuff
    is ensuring the threads complete their work as quickly as possible.

    The quicker they process an event, the quicker they can process
    another event, the higher the overall throughput and the lower the
    overall latency.  Doing a GIL acquire/PyList_Append()/GIL release
    at the end of the event would add a huge overhead that simply would
    not be present with an interlocked push.

    Also, as soon as you call back into CPython from the background
    thread, your cache footprint explodes, which isn't desirable.

        Trent.


From trent at snakebite.org  Wed Nov 28 13:52:31 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 07:52:31 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
References: <17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
Message-ID: <20121128125230.GD93849@snakebite.org>

On Wed, Nov 28, 2012 at 04:07:22AM -0800, Sturla Molden wrote:
> 
> Den 28. nov. 2012 kl. 03:59 skrev Guido van Rossum <guido at python.org>:
> 
> > 
> > Then why not just have one thread?
> 
> Because of the way IOCPs work on Windows: A pool of threads is waiting
> on the i/o completion port, one thread from the pool is woken up on
> i/o completion. There is nothing to do about that, it is an event
> driven thread pool by design. 
> 
> The question is what a thread woken up on io completion shold do. If
> it uses the simplified GIL API to ensure the GIL, this would mean
> excessive GIL shifting with 64k i/o tasks on a port: Each time one of
> the 64k tasks is complete, a thread would ensure the GIL. That is
> unlikely to be very scalable.
> 
> So what Trent suggested is to just have these threads enqueue some
> data about the completed task and go back to sleep.
> 
> That way the main "Python thread" would never loose the GIL to a
> thread from the IOCP. Instead it would shortly busy-wait while a
> completed task is inserted into the queue. Thus synchronization by the
> GIL is replaced by a spinlock protecting a queue (or an interlocked
> list on recent Windows versions).
> 
> > 
> >> 2. It potentially keeps the thread that runs the CPython
> >> interpreter in cache, as it is always active. And thus it also
> >> keeps the objects associated with the CPython interpreter in cache.
> > 
> > So what code runs in the other threads? I think I'm confused...
> 
> Almost nothing. They sleep on the IOCP, wake up on i/o completion,
> puts the completed task in a queue, and goes back to waiting/sleeping
> on the port. But they never attempt to acquire the GIL.

    Couldn't have said it better myself.  Nice to have you on board
    Sturla ;-)

        Trent.


From shibturn at gmail.com  Wed Nov 28 14:00:35 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Wed, 28 Nov 2012 13:00:35 +0000
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
Message-ID: <k951tt$sm2$1@ger.gmane.org>

On 28/11/2012 12:07pm, Sturla Molden wrote:
> That way the main "Python thread" would never loose
> the GIL to a thread from the IOCP. Instead it would shortly
> busy-wait while a completed task is inserted into the queue.
> Thus synchronization by the GIL is replaced by a spinlock
> protecting a queue (or an interlocked list on recent Windows versions).

How do you know the busy wait will be short?

If you are worried about the cost of acquiring and releasing the GIL 
more than necessary, then why not just dequeue as many completion 
packets as possible at once.  (On Vista and later 
GetQueuedCompletionStatusEx() produces an array of completion packets; 
on WinXP you can get a similar effect by calling 
GetQueuedCompletionStatus() in a loop.)

I very much doubt that GetQueuedCompletionStatus*() consumes enough cpu 
time for it to be worth running in a thread pool (particularly if it 
forces busy waiting on you).

-- 
Richard



From sturla at molden.no  Wed Nov 28 16:07:56 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 16:07:56 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <k951tt$sm2$1@ger.gmane.org>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<k951tt$sm2$1@ger.gmane.org>
Message-ID: <50B628CC.6070506@molden.no>

On 28.11.2012 14:00, Richard Oudkerk wrote:

> If you are worried about the cost of acquiring and releasing the GIL
> more than necessary, then why not just dequeue as many completion
> packets as possible at once. (On Vista and later
> GetQueuedCompletionStatusEx() produces an array of completion packets;
> on WinXP you can get a similar effect by calling
> GetQueuedCompletionStatus() in a loop.)

Hmm, yes, perhaps---

One could call GetQueuedCompletionStatusEx in a loop and set 
dwMilliseconds to 0 (immediate timeout), possibly with a Sleep(0) if the 
task queue was empty. (Sleep(0) releases the reminder of the time-slice, 
and so prevents spinning on GetQueuedCompletionStatusEx from burning the 
CPU.) If after a while (say 10 ms) there is still no tasks in the queue, 
we release the GIL and call GetQueuedCompletionStatusEx with a longer 
time-out than 0.

Sturla



From guido at python.org  Wed Nov 28 16:59:04 2012
From: guido at python.org (Guido van Rossum)
Date: Wed, 28 Nov 2012 07:59:04 -0800
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
Message-ID: <CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>

OK, now I see. (I thought that was how everyone was using IOCP.
Apparently not?) However, the "short busy wait" worries me. What if
your app *doesn't* get a lot of requests?

Isn't the alternative to have a "thread pool" with just one thread,
which runs all the Python code and gets woken up by IOCP when it is
idle and there is a new event? How is Trent's proposal an improvement?

On Wed, Nov 28, 2012 at 4:07 AM, Sturla Molden <sturla at molden.no> wrote:
>
> Den 28. nov. 2012 kl. 03:59 skrev Guido van Rossum <guido at python.org>:
>
>>
>> Then why not just have one thread?
>
> Because of the way IOCPs work on Windows: A pool of threads is waiting on the i/o completion port, one thread from the pool is woken up on i/o completion. There is nothing to do about that, it is an event driven thread pool by design.
>
> The question is what a thread woken up on io completion shold do. If it uses the simplified GIL API to ensure the GIL, this would mean excessive GIL shifting with 64k i/o tasks on a port: Each time one of the 64k tasks is complete, a thread would ensure the GIL. That is unlikely to be very scalable.
>
> So what Trent suggested is to just have these threads enqueue some data about the completed task and go back to sleep.
>
> That way the main "Python thread" would never loose the GIL to a thread from the IOCP. Instead it would shortly busy-wait while a completed task is inserted into the queue. Thus synchronization by the GIL is replaced by a spinlock protecting a queue (or an interlocked list on recent Windows versions).
>
>>
>>> 2. It potentially keeps the thread that runs the CPython interpreter in cache, as it is always active. And thus it also keeps the objects associated with the CPython interpreter in cache.
>>
>> So what code runs in the other threads? I think I'm confused...
>
> Almost nothing. They sleep on the IOCP, wake up on i/o completion, puts the completed task in a queue, and goes back to waiting/sleeping on the port. But they never attempt to acquire the GIL.
>
> Sturla
>
>
>
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas



-- 
--Guido van Rossum (python.org/~guido)


From guido at python.org  Wed Nov 28 18:09:07 2012
From: guido at python.org (Guido van Rossum)
Date: Wed, 28 Nov 2012 09:09:07 -0800
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <CAL3CFcXT5MkoOpbuwAhDsFjBneS76_zhaHMHgovQ4HZD2pg8eA@mail.gmail.com>
References: <20121127123325.GH90314@snakebite.org>
	<20121127154204.5fc81457@pitrou.net>
	<20121127150330.GB91191@snakebite.org>
	<CAP7+vJJPgWh4gY=1=yxWYtP9pyaOyy4DHDT592d=7LiP_BnzVQ@mail.gmail.com>
	<CAL3CFcXT5MkoOpbuwAhDsFjBneS76_zhaHMHgovQ4HZD2pg8eA@mail.gmail.com>
Message-ID: <CAP7+vJ+MmO6TZqs_JoW9W5OLc_i3Wgp47mYOu3D5fBoCaeYsVQ@mail.gmail.com>

On Wed, Nov 28, 2012 at 1:53 AM, Andrew Svetlov
<andrew.svetlov at gmail.com> wrote:
> On Tue, Nov 27, 2012 at 6:06 PM, Guido van Rossum <guido at python.org> wrote:
>> In the past, when I was debugging
>> NDB, I've asked in vain whether someone had already made the necessary
>> changes to pdb to let it jump over a yield instead of following it --
>> I may have to go in and develop a change myself, because this problem
>> isn't going away.
>
> Do you want new pdb command or change behavior of *step* or *next*?

Good question. If it's easier I'd be okay with a new command; but
changing the behavior of "next" would also work, if it's possible. Are
you interested in working on an implementation? I'd be interested in
reviewing it then.

-- 
--Guido van Rossum (python.org/~guido)


From andrew.svetlov at gmail.com  Wed Nov 28 18:39:48 2012
From: andrew.svetlov at gmail.com (Andrew Svetlov)
Date: Wed, 28 Nov 2012 19:39:48 +0200
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <CAP7+vJ+MmO6TZqs_JoW9W5OLc_i3Wgp47mYOu3D5fBoCaeYsVQ@mail.gmail.com>
References: <20121127123325.GH90314@snakebite.org>
	<20121127154204.5fc81457@pitrou.net>
	<20121127150330.GB91191@snakebite.org>
	<CAP7+vJJPgWh4gY=1=yxWYtP9pyaOyy4DHDT592d=7LiP_BnzVQ@mail.gmail.com>
	<CAL3CFcXT5MkoOpbuwAhDsFjBneS76_zhaHMHgovQ4HZD2pg8eA@mail.gmail.com>
	<CAP7+vJ+MmO6TZqs_JoW9W5OLc_i3Wgp47mYOu3D5fBoCaeYsVQ@mail.gmail.com>
Message-ID: <CAL3CFcUwtjOxkPm2enqC8sQYF2H=jzZeT6zM3PXQr80zFU0n7Q@mail.gmail.com>

Probably will try to make a patch this weekend.
Changing behavior of *next* command looks more convenient for end user.

On Wed, Nov 28, 2012 at 7:09 PM, Guido van Rossum <guido at python.org> wrote:
> On Wed, Nov 28, 2012 at 1:53 AM, Andrew Svetlov
> <andrew.svetlov at gmail.com> wrote:
>> On Tue, Nov 27, 2012 at 6:06 PM, Guido van Rossum <guido at python.org> wrote:
>>> In the past, when I was debugging
>>> NDB, I've asked in vain whether someone had already made the necessary
>>> changes to pdb to let it jump over a yield instead of following it --
>>> I may have to go in and develop a change myself, because this problem
>>> isn't going away.
>>
>> Do you want new pdb command or change behavior of *step* or *next*?
>
> Good question. If it's easier I'd be okay with a new command; but
> changing the behavior of "next" would also work, if it's possible. Are
> you interested in working on an implementation? I'd be interested in
> reviewing it then.
>
> --
> --Guido van Rossum (python.org/~guido)



-- 
Thanks,
Andrew Svetlov


From guido at python.org  Wed Nov 28 19:12:21 2012
From: guido at python.org (Guido van Rossum)
Date: Wed, 28 Nov 2012 10:12:21 -0800
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <CAL3CFcUwtjOxkPm2enqC8sQYF2H=jzZeT6zM3PXQr80zFU0n7Q@mail.gmail.com>
References: <20121127123325.GH90314@snakebite.org>
	<20121127154204.5fc81457@pitrou.net>
	<20121127150330.GB91191@snakebite.org>
	<CAP7+vJJPgWh4gY=1=yxWYtP9pyaOyy4DHDT592d=7LiP_BnzVQ@mail.gmail.com>
	<CAL3CFcXT5MkoOpbuwAhDsFjBneS76_zhaHMHgovQ4HZD2pg8eA@mail.gmail.com>
	<CAP7+vJ+MmO6TZqs_JoW9W5OLc_i3Wgp47mYOu3D5fBoCaeYsVQ@mail.gmail.com>
	<CAL3CFcUwtjOxkPm2enqC8sQYF2H=jzZeT6zM3PXQr80zFU0n7Q@mail.gmail.com>
Message-ID: <CAP7+vJJDhmu=UzcdFCXew=DmmBqc-LVCqceNZrbRicn9GhrKeA@mail.gmail.com>

Agreed. Would be wonderful!

On Wed, Nov 28, 2012 at 9:39 AM, Andrew Svetlov
<andrew.svetlov at gmail.com> wrote:
> Probably will try to make a patch this weekend.
> Changing behavior of *next* command looks more convenient for end user.
>
> On Wed, Nov 28, 2012 at 7:09 PM, Guido van Rossum <guido at python.org> wrote:
>> On Wed, Nov 28, 2012 at 1:53 AM, Andrew Svetlov
>> <andrew.svetlov at gmail.com> wrote:
>>> On Tue, Nov 27, 2012 at 6:06 PM, Guido van Rossum <guido at python.org> wrote:
>>>> In the past, when I was debugging
>>>> NDB, I've asked in vain whether someone had already made the necessary
>>>> changes to pdb to let it jump over a yield instead of following it --
>>>> I may have to go in and develop a change myself, because this problem
>>>> isn't going away.
>>>
>>> Do you want new pdb command or change behavior of *step* or *next*?
>>
>> Good question. If it's easier I'd be okay with a new command; but
>> changing the behavior of "next" would also work, if it's possible. Are
>> you interested in working on an implementation? I'd be interested in
>> reviewing it then.
>>
>> --
>> --Guido van Rossum (python.org/~guido)
>
>
>
> --
> Thanks,
> Andrew Svetlov



-- 
--Guido van Rossum (python.org/~guido)


From sturla at molden.no  Wed Nov 28 19:57:32 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 19:57:32 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
Message-ID: <50B65E9C.20100@molden.no>

On 28.11.2012 16:59, Guido van Rossum wrote:

> OK, now I see. (I thought that was how everyone was using IOCP.
> Apparently not?) However, the "short busy wait" worries me. What if
> your app *doesn't* get a lot of requests?

Then there would be no busy wait.

> Isn't the alternative to have a "thread pool" with just one thread,
> which runs all the Python code and gets woken up by IOCP when it is
> idle and there is a new event? How is Trent's proposal an improvement?

I'm not sure.

This is what I suggested in my previous mail.



Sturla


From sturla at molden.no  Wed Nov 28 19:59:43 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 19:59:43 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <50B63277.4@gmail.com>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<k951tt$sm2$1@ger.gmane.org> <50B622EF.1080500@molden.no>
	<50B63277.4@gmail.com>
Message-ID: <50B65F1F.6040302@molden.no>

On 28.11.2012 16:49, Richard Oudkerk wrote:

> You are assuming that GetQueuedCompletionStatus*() will never block
> because of lack of work.

GetQueuedCompletionStatusEx takes a time-out argument, it can be zero.

Sturla


From shibturn at gmail.com  Wed Nov 28 20:11:42 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Wed, 28 Nov 2012 19:11:42 +0000
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <50B65F1F.6040302@molden.no>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<k951tt$sm2$1@ger.gmane.org> <50B622EF.1080500@molden.no>
	<50B63277.4@gmail.com> <50B65F1F.6040302@molden.no>
Message-ID: <k95nlo$m7d$1@ger.gmane.org>

On 28/11/2012 6:59pm, Sturla Molden wrote:
>
>> You are assuming that GetQueuedCompletionStatus*() will never block
>> because of lack of work.
>
> GetQueuedCompletionStatusEx takes a time-out argument, it can be zero.
>
> Sturla

According to your (or Trent's) idea the main thread busy waits until the 
interlocked list is non-empty.  If there is no work to do then the 
interlocked list is empty and the main thread will busy wait till there 
is work to do, which might be for a long time.

-- 
Richard



From sturla at molden.no  Wed Nov 28 20:22:34 2012
From: sturla at molden.no (Sturla Molden)
Date: Wed, 28 Nov 2012 20:22:34 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <k95nlo$m7d$1@ger.gmane.org>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<k951tt$sm2$1@ger.gmane.org> <50B622EF.1080500@molden.no>
	<50B63277.4@gmail.com> <50B65F1F.6040302@molden.no>
	<k95nlo$m7d$1@ger.gmane.org>
Message-ID: <50B6647A.502@molden.no>

On 28.11.2012 20:11, Richard Oudkerk wrote:

> According to your (or Trent's) idea the main thread busy waits until the
> interlocked list is non-empty. If there is no work to do then the
> interlocked list is empty and the main thread will busy wait till there
> is work to do, which might be for a long time.

That would not be an advantage. Surely it should time-out or at least 
stop busy-waiting at some point...

But I am not sure if a list like Trent described is better than just 
calling GetQueuedCompletionStatusEx from the Python thread. One could 
busy-wait with 0 timeout for a while, and then at some point use a few 
ms timouts (1 or 2, or perhaps 10).

IOCPs set up a task queue, so I am back to thinking that stacking two 
task queues after each other does not help very much---


Sturla



From trent at snakebite.org  Wed Nov 28 20:23:47 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 14:23:47 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <k95nlo$m7d$1@ger.gmane.org>
References: <CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<k951tt$sm2$1@ger.gmane.org> <50B622EF.1080500@molden.no>
	<50B63277.4@gmail.com> <50B65F1F.6040302@molden.no>
	<k95nlo$m7d$1@ger.gmane.org>
Message-ID: <20121128192346.GE93849@snakebite.org>

On Wed, Nov 28, 2012 at 11:11:42AM -0800, Richard Oudkerk wrote:
> On 28/11/2012 6:59pm, Sturla Molden wrote:
> >
> >> You are assuming that GetQueuedCompletionStatus*() will never block
> >> because of lack of work.
> >
> > GetQueuedCompletionStatusEx takes a time-out argument, it can be zero.
> >
> > Sturla
> 
> According to your (or Trent's) idea the main thread busy waits until the 
> interlocked list is non-empty.  If there is no work to do then the 
> interlocked list is empty and the main thread will busy wait till there 
> is work to do, which might be for a long time.

    Oooer, that's definitely not what I had in mind.  This is how I
    envisioned it working (think of events() as similar to poll()):

        with aio.events() as events:
            for event in events:
                # process event
                ...

    That aio.events() call would result in an InterlockedSListFlush,
    returning the entire list of available events.  It then does the
    conversion into a CPython event type, bundles everything into a
    list, then returns.

    (In reality, there'd be a bit more glue to handle an empty list
     a bit more gracefully, and probably a timeout to aio.events().
     Nothing should involve a spinlock though.)

        Trent.


From shibturn at gmail.com  Wed Nov 28 20:57:29 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Wed, 28 Nov 2012 19:57:29 +0000
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128192346.GE93849@snakebite.org>
References: <CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<k951tt$sm2$1@ger.gmane.org> <50B622EF.1080500@molden.no>
	<50B63277.4@gmail.com> <50B65F1F.6040302@molden.no>
	<k95nlo$m7d$1@ger.gmane.org> <20121128192346.GE93849@snakebite.org>
Message-ID: <k95qbj$elr$1@ger.gmane.org>

On 28/11/2012 7:23pm, Trent Nelson wrote:
>      Oooer, that's definitely not what I had in mind.  This is how I
>      envisioned it working (think of events() as similar to poll()):
>
>          with aio.events() as events:
>              for event in events:
>                  # process event
>                  ...
>
>      That aio.events() call would result in an InterlockedSListFlush,
>      returning the entire list of available events.  It then does the
>      conversion into a CPython event type, bundles everything into a
>      list, then returns.
>
>      (In reality, there'd be a bit more glue to handle an empty list
>       a bit more gracefully, and probably a timeout to aio.events().
>       Nothing should involve a spinlock though.)
>
>          Trent.

That api is fairly similar to what is in the proactor branch of tulip 
where you can write

     for event in proactor.poll(timeout):
         # process event

But why use a use a thread pool just to take items from one thread safe 
(FIFO) queue and put them onto another thread safe (LIFO) queue?

-- 
Richard



From trent at snakebite.org  Wed Nov 28 21:05:32 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 15:05:32 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
References: <20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
Message-ID: <20121128200532.GF93849@snakebite.org>

On Wed, Nov 28, 2012 at 07:59:04AM -0800, Guido van Rossum wrote:
> OK, now I see. (I thought that was how everyone was using IOCP.
> Apparently not?) However, the "short busy wait" worries me. What if
> your app *doesn't* get a lot of requests?

    From my response to Richard's concern re: busy waits:

    Oooer, that's definitely not what I had in mind.  This is how I
    envisioned it working (think of events() as similar to poll()):

        with aio.events() as events:
            for event in events:
                # process event
                ...

    That aio.events() call would result in an InterlockedSListFlush,
    returning the entire list of available events.  It then does the
    conversion into a CPython event type, bundles everything into a
    list, then returns.

    (In reality, there'd be a bit more glue to handle an empty list
     a bit more gracefully, and probably a timeout to aio.events().
     Nothing should involve a spinlock though.)

> Isn't the alternative to have a "thread pool" with just one thread,
> which runs all the Python code and gets woken up by IOCP when it is
> idle and there is a new event? How is Trent's proposal an improvement?

    I don't really understand this suggestion :/  It's sort of in line
    with how IOCP is used currently, i.e. "let me tell you when I'm
    ready to process events", which I'm advocating against with this
    idea.

        Trent.


From guido at python.org  Wed Nov 28 21:15:22 2012
From: guido at python.org (Guido van Rossum)
Date: Wed, 28 Nov 2012 12:15:22 -0800
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128200532.GF93849@snakebite.org>
References: <20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
Message-ID: <CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>

On Wed, Nov 28, 2012 at 12:05 PM, Trent Nelson <trent at snakebite.org> wrote:
> On Wed, Nov 28, 2012 at 07:59:04AM -0800, Guido van Rossum wrote:
>> OK, now I see. (I thought that was how everyone was using IOCP.
>> Apparently not?) However, the "short busy wait" worries me. What if
>> your app *doesn't* get a lot of requests?
>
>     From my response to Richard's concern re: busy waits:
>
>     Oooer, that's definitely not what I had in mind.  This is how I
>     envisioned it working (think of events() as similar to poll()):
>
>         with aio.events() as events:
>             for event in events:
>                 # process event
>                 ...
>
>     That aio.events() call would result in an InterlockedSListFlush,
>     returning the entire list of available events.  It then does the
>     conversion into a CPython event type, bundles everything into a
>     list, then returns.
>
>     (In reality, there'd be a bit more glue to handle an empty list
>      a bit more gracefully, and probably a timeout to aio.events().
>      Nothing should involve a spinlock though.)
>
>> Isn't the alternative to have a "thread pool" with just one thread,
>> which runs all the Python code and gets woken up by IOCP when it is
>> idle and there is a new event? How is Trent's proposal an improvement?
>
>     I don't really understand this suggestion :/  It's sort of in line
>     with how IOCP is used currently, i.e. "let me tell you when I'm
>     ready to process events", which I'm advocating against with this
>     idea.

Well, but since the proposal also seems to be to keep all Python code
in one thread, that thread still has to say when it's ready to process
events. So, again, what's the big deal? Maybe we just need benchmarks
showing events processed per second for various configurations...

-- 
--Guido van Rossum (python.org/~guido)


From trent at snakebite.org  Wed Nov 28 21:18:19 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 15:18:19 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <k95qbj$elr$1@ger.gmane.org>
References: <B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<k951tt$sm2$1@ger.gmane.org> <50B622EF.1080500@molden.no>
	<50B63277.4@gmail.com> <50B65F1F.6040302@molden.no>
	<k95nlo$m7d$1@ger.gmane.org> <20121128192346.GE93849@snakebite.org>
	<k95qbj$elr$1@ger.gmane.org>
Message-ID: <20121128201819.GG93849@snakebite.org>

On Wed, Nov 28, 2012 at 11:57:29AM -0800, Richard Oudkerk wrote:
> On 28/11/2012 7:23pm, Trent Nelson wrote:
> >      Oooer, that's definitely not what I had in mind.  This is how I
> >      envisioned it working (think of events() as similar to poll()):
> >
> >          with aio.events() as events:
> >              for event in events:
> >                  # process event
> >                  ...
> >
> >      That aio.events() call would result in an InterlockedSListFlush,
> >      returning the entire list of available events.  It then does the
> >      conversion into a CPython event type, bundles everything into a
> >      list, then returns.
> >
> >      (In reality, there'd be a bit more glue to handle an empty list
> >       a bit more gracefully, and probably a timeout to aio.events().
> >       Nothing should involve a spinlock though.)
> >
> >          Trent.
> 
> That api is fairly similar to what is in the proactor branch of tulip 
> where you can write
> 
>      for event in proactor.poll(timeout):
>          # process event
> 
> But why use a use a thread pool just to take items from one thread safe 
> (FIFO) queue and put them onto another thread safe (LIFO) queue?

    I'm not sure how "thread pool" got all the focus suddenly.  That's
    just an implementation detail.  The key thing I'm proposing is that
    we reduce the time involved in processing incoming IO requests.

    Let's ignore everything pre-Vista for the sake of example.  From
    Vista onwards, we don't even need to call GetQueuedCompletionStatus,
    we simply tell the new thread pool APIs which C function to invoke
    upon an incoming event.

    This C function should do as little as possible, and should have a
    small a footprint as possible.  So, no calling CPython, no GIL
    acquisition.  It literally just processes completed events, copying
    data where necessary, then doing an interlocked list push of the
    results, then that's it, done.

    Now, on XP, AIX and Solaris, we'd manually have a little thread
    pool, and each thread would wait on GetQueuedCompletionStatus(Ex)
    or port_get().  That's really the only difference; the main method
    body would be identical to what Windows automatically invokes via
    the thread pool approach in Vista onwards.

        Trent.


From trent at snakebite.org  Wed Nov 28 21:32:39 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 15:32:39 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
References: <20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
Message-ID: <20121128203238.GH93849@snakebite.org>

On Wed, Nov 28, 2012 at 12:15:22PM -0800, Guido van Rossum wrote:
> On Wed, Nov 28, 2012 at 12:05 PM, Trent Nelson <trent at snakebite.org> wrote:
> > On Wed, Nov 28, 2012 at 07:59:04AM -0800, Guido van Rossum wrote:
> >> OK, now I see. (I thought that was how everyone was using IOCP.
> >> Apparently not?) However, the "short busy wait" worries me. What if
> >> your app *doesn't* get a lot of requests?
> >
> >     From my response to Richard's concern re: busy waits:
> >
> >     Oooer, that's definitely not what I had in mind.  This is how I
> >     envisioned it working (think of events() as similar to poll()):
> >
> >         with aio.events() as events:
> >             for event in events:
> >                 # process event
> >                 ...
> >
> >     That aio.events() call would result in an InterlockedSListFlush,
> >     returning the entire list of available events.  It then does the
> >     conversion into a CPython event type, bundles everything into a
> >     list, then returns.
> >
> >     (In reality, there'd be a bit more glue to handle an empty list
> >      a bit more gracefully, and probably a timeout to aio.events().
> >      Nothing should involve a spinlock though.)
> >
> >> Isn't the alternative to have a "thread pool" with just one thread,
> >> which runs all the Python code and gets woken up by IOCP when it is
> >> idle and there is a new event? How is Trent's proposal an improvement?
> >
> >     I don't really understand this suggestion :/  It's sort of in line
> >     with how IOCP is used currently, i.e. "let me tell you when I'm
> >     ready to process events", which I'm advocating against with this
> >     idea.
> 
> Well, but since the proposal also seems to be to keep all Python code
> in one thread, that thread still has to say when it's ready to process
> events.

    Right, so, I'm arguing that with my approach, because the background
    IO thread stuff is as optimal as it can be -- more IO events would
    be available per event loop iteration, and the latency between the
    event occurring versus when the event loop picks it up would be
    reduced.  The theory being that that will result in higher through-
    put and lower latency in practice.

    Also, from a previous e-mail, this:

        with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f:
            data = f.read()

    Or even just:

        with aio.open('/dev/zero', 'rb') as f:
            data = f.read(1024 * 1024 * 1024)

    Would basically complete as fast as it physically possible to read
    the bytes off the device.  If you've got 16+ cores, then you'll have
    16 cores able to service IO interrupts in parallel.  So, the overall
    time to suck in a chunk of data will be vastly reduced.

    There's no other way to get this sort of performance without taking
    my approach.

> So, again, what's the big deal? Maybe we just need benchmarks
> showing events processed per second for various configurations...

    Definitely agree with the need for benchmarks.  (I'm going to set up
    an 8-core Snakebite box w/ Windows 2012 server specifically for this
    purpose, I think.)

        Trent.


From guido at python.org  Wed Nov 28 21:49:51 2012
From: guido at python.org (Guido van Rossum)
Date: Wed, 28 Nov 2012 12:49:51 -0800
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128203238.GH93849@snakebite.org>
References: <20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
	<20121128203238.GH93849@snakebite.org>
Message-ID: <CAP7+vJ+UZoa2u0mKUODM=8g61ph71-V=mSyYS0Wfk3b_PdbqyQ@mail.gmail.com>

On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <trent at snakebite.org> wrote:
>     Right, so, I'm arguing that with my approach, because the background
>     IO thread stuff is as optimal as it can be -- more IO events would
>     be available per event loop iteration, and the latency between the
>     event occurring versus when the event loop picks it up would be
>     reduced.  The theory being that that will result in higher through-
>     put and lower latency in practice.
>
>     Also, from a previous e-mail, this:
>
>         with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f:
>             data = f.read()
>
>     Or even just:
>
>         with aio.open('/dev/zero', 'rb') as f:
>             data = f.read(1024 * 1024 * 1024)
>
>     Would basically complete as fast as it physically possible to read
>     the bytes off the device.  If you've got 16+ cores, then you'll have
>     16 cores able to service IO interrupts in parallel.  So, the overall
>     time to suck in a chunk of data will be vastly reduced.
>
>     There's no other way to get this sort of performance without taking
>     my approach.

So there's something I fundamentally don't understand. Why do those
calls, made synchronously in today's CPython, not already run as fast
as you can get the bytes off the device? I assume it's just a transfer
from kernel memory to user memory. So what is the advantage of using
aio over

  with open(<file>, 'rb') as f:
      data = f.read()

? Is it just that you can run other Python code *while* the I/O is
happening as fast as possible in a separate thread? But that would
also be the case when using (real) threads in today's CPython. What am
I missing?

-- 
--Guido van Rossum (python.org/~guido)


From trent at snakebite.org  Wed Nov 28 22:02:34 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 16:02:34 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <CAP7+vJ+UZoa2u0mKUODM=8g61ph71-V=mSyYS0Wfk3b_PdbqyQ@mail.gmail.com>
References: <CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
	<20121128203238.GH93849@snakebite.org>
	<CAP7+vJ+UZoa2u0mKUODM=8g61ph71-V=mSyYS0Wfk3b_PdbqyQ@mail.gmail.com>
Message-ID: <20121128210233.GI93849@snakebite.org>

On Wed, Nov 28, 2012 at 12:49:51PM -0800, Guido van Rossum wrote:
> On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <trent at snakebite.org> wrote:
> >     Right, so, I'm arguing that with my approach, because the background
> >     IO thread stuff is as optimal as it can be -- more IO events would
> >     be available per event loop iteration, and the latency between the
> >     event occurring versus when the event loop picks it up would be
> >     reduced.  The theory being that that will result in higher through-
> >     put and lower latency in practice.
> >
> >     Also, from a previous e-mail, this:
> >
> >         with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f:
> >             data = f.read()
> >
> >     Or even just:
> >
> >         with aio.open('/dev/zero', 'rb') as f:
> >             data = f.read(1024 * 1024 * 1024)
> >
> >     Would basically complete as fast as it physically possible to read
> >     the bytes off the device.  If you've got 16+ cores, then you'll have
> >     16 cores able to service IO interrupts in parallel.  So, the overall
> >     time to suck in a chunk of data will be vastly reduced.
> >
> >     There's no other way to get this sort of performance without taking
> >     my approach.
> 
> So there's something I fundamentally don't understand. Why do those
> calls, made synchronously in today's CPython, not already run as fast
> as you can get the bytes off the device? I assume it's just a transfer
> from kernel memory to user memory. So what is the advantage of using
> aio over
> 
>   with open(<file>, 'rb') as f:
>       data = f.read()

    Ah, right.  That's where the OVERLAPPED aspect comes into play.
    (Other than Windows and AIX, I don't think any other OS provides
     an overlapped IO facility?)

    The difference being, instead of having one thread writing to a 1GB
    buffer, 4KB at a time, you have 16 threads writing to an overlapped
    1GB buffer, 4KB at a time.

    (Assuming you have 16+ cores, and IO interrupts are coming in whilst
     existing threads are still servicing previous completions.)

        Trent.


From guido at python.org  Wed Nov 28 22:18:48 2012
From: guido at python.org (Guido van Rossum)
Date: Wed, 28 Nov 2012 13:18:48 -0800
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128210233.GI93849@snakebite.org>
References: <CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
	<20121128203238.GH93849@snakebite.org>
	<CAP7+vJ+UZoa2u0mKUODM=8g61ph71-V=mSyYS0Wfk3b_PdbqyQ@mail.gmail.com>
	<20121128210233.GI93849@snakebite.org>
Message-ID: <CAP7+vJ+QVwcMEwGo7X7f3_7AAx2cPY2vAPE6d=7gOAjCy9eRuQ@mail.gmail.com>

On Wed, Nov 28, 2012 at 1:02 PM, Trent Nelson <trent at snakebite.org> wrote:

> On Wed, Nov 28, 2012 at 12:49:51PM -0800, Guido van Rossum wrote:
> > On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <trent at snakebite.org>
> wrote:
> > >     Right, so, I'm arguing that with my approach, because the
> background
> > >     IO thread stuff is as optimal as it can be -- more IO events would
> > >     be available per event loop iteration, and the latency between the
> > >     event occurring versus when the event loop picks it up would be
> > >     reduced.  The theory being that that will result in higher through-
> > >     put and lower latency in practice.
> > >
> > >     Also, from a previous e-mail, this:
> > >
> > >         with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f:
> > >             data = f.read()
> > >
> > >     Or even just:
> > >
> > >         with aio.open('/dev/zero', 'rb') as f:
> > >             data = f.read(1024 * 1024 * 1024)
> > >
> > >     Would basically complete as fast as it physically possible to read
> > >     the bytes off the device.  If you've got 16+ cores, then you'll
> have
> > >     16 cores able to service IO interrupts in parallel.  So, the
> overall
> > >     time to suck in a chunk of data will be vastly reduced.
> > >
> > >     There's no other way to get this sort of performance without taking
> > >     my approach.
> >
> > So there's something I fundamentally don't understand. Why do those
> > calls, made synchronously in today's CPython, not already run as fast
> > as you can get the bytes off the device? I assume it's just a transfer
> > from kernel memory to user memory. So what is the advantage of using
> > aio over
> >
> >   with open(<file>, 'rb') as f:
> >       data = f.read()
>
>     Ah, right.  That's where the OVERLAPPED aspect comes into play.
>     (Other than Windows and AIX, I don't think any other OS provides
>      an overlapped IO facility?)
>
>     The difference being, instead of having one thread writing to a 1GB
>     buffer, 4KB at a time, you have 16 threads writing to an overlapped
>     1GB buffer, 4KB at a time.
>
>     (Assuming you have 16+ cores, and IO interrupts are coming in whilst
>      existing threads are still servicing previous completions.)
>
>         Trent.
>

Aha. So these are kernel threads? Is the bandwidth of the I/O channel
really higher than one CPU can copy bytes across a user/kernel boundary?

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121128/1bc1b7f3/attachment.html>

From greg.ewing at canterbury.ac.nz  Wed Nov 28 22:28:29 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 29 Nov 2012 10:28:29 +1300
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
References: <20121127195913.GE91191@snakebite.org> <k936t7$2hc$1@ger.gmane.org>
	<20121127201946.GF91191@snakebite.org>
	<20121127215414.22ffbefc@pitrou.net>
	<17DA6F12-FE22-4918-84C1-962CA4D31E89@molden.no>
	<20121127214821.GG91191@snakebite.org>
	<96FF3B90-58BA-49B8-AFA5-F09B1AC084CE@molden.no>
	<20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
Message-ID: <50B681FD.9040806@canterbury.ac.nz>

Sturla Molden wrote:
> 
>>Then why not just have one thread?
>
> Because of the way IOCPs work on Windows: A pool of threads is waiting on the
> i/o completion port, one thread from the pool is woken up on i/o completion.

But does it *have* to be used that way? The OP claimed that Twisted was using
a single thread that explicitly polls the IOCP, instead of an OS-managed
thread pool.

If that's possible, the question is then whether the extra complexity of
having I/O threads wake up the Python thread gains you anything, given that
the Python thread will be doing the bulk of the work.

-- 
Greg


From ncoghlan at gmail.com  Wed Nov 28 22:41:24 2012
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 29 Nov 2012 07:41:24 +1000
Subject: [Python-ideas] WSAPoll and tulip
In-Reply-To: <CAP7+vJJDhmu=UzcdFCXew=DmmBqc-LVCqceNZrbRicn9GhrKeA@mail.gmail.com>
References: <20121127123325.GH90314@snakebite.org>
	<20121127154204.5fc81457@pitrou.net>
	<20121127150330.GB91191@snakebite.org>
	<CAP7+vJJPgWh4gY=1=yxWYtP9pyaOyy4DHDT592d=7LiP_BnzVQ@mail.gmail.com>
	<CAL3CFcXT5MkoOpbuwAhDsFjBneS76_zhaHMHgovQ4HZD2pg8eA@mail.gmail.com>
	<CAP7+vJ+MmO6TZqs_JoW9W5OLc_i3Wgp47mYOu3D5fBoCaeYsVQ@mail.gmail.com>
	<CAL3CFcUwtjOxkPm2enqC8sQYF2H=jzZeT6zM3PXQr80zFU0n7Q@mail.gmail.com>
	<CAP7+vJJDhmu=UzcdFCXew=DmmBqc-LVCqceNZrbRicn9GhrKeA@mail.gmail.com>
Message-ID: <CADiSq7cVciR1=C5osZKK1Umi+YqJA9xq=rXr6dt-h3sQVXwa4g@mail.gmail.com>

That will need to be well highlighted in What's New,  as it could be very
confusing if the iterator is never called again.

--
Sent from my phone, thus the relative brevity :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121129/1bace040/attachment.html>

From greg.ewing at canterbury.ac.nz  Wed Nov 28 23:28:35 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 29 Nov 2012 11:28:35 +1300
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128203238.GH93849@snakebite.org>
References: <20121127223644.GJ91191@snakebite.org>
	<471EE574-81CC-4260-A0FA-9198655674C4@molden.no>
	<CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
	<20121128203238.GH93849@snakebite.org>
Message-ID: <50B69013.7030301@canterbury.ac.nz>

Trent Nelson wrote:
>     I'm arguing that with my approach, because the background
>     IO thread stuff is as optimal as it can be -- more IO events would
>     be available per event loop iteration, and the latency between the
>     event occurring versus when the event loop picks it up would be
>     reduced.  The theory being that that will result in higher through-
>     put and lower latency in practice.

But the data still as to wait around somewhere until the Python
thread gets around to dealing with it. I don't see why it's
better for it to sit around in the interlocked list than it is
for the completion packets to just wait in the IOCP until the
Python thread is ready.

-- 
Greg


From trent at snakebite.org  Wed Nov 28 23:40:34 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 17:40:34 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <CAP7+vJ+QVwcMEwGo7X7f3_7AAx2cPY2vAPE6d=7gOAjCy9eRuQ@mail.gmail.com>
References: <B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
	<20121128203238.GH93849@snakebite.org>
	<CAP7+vJ+UZoa2u0mKUODM=8g61ph71-V=mSyYS0Wfk3b_PdbqyQ@mail.gmail.com>
	<20121128210233.GI93849@snakebite.org>
	<CAP7+vJ+QVwcMEwGo7X7f3_7AAx2cPY2vAPE6d=7gOAjCy9eRuQ@mail.gmail.com>
Message-ID: <20121128224034.GJ93849@snakebite.org>

On Wed, Nov 28, 2012 at 01:18:48PM -0800, Guido van Rossum wrote:
>    On Wed, Nov 28, 2012 at 1:02 PM, Trent Nelson <trent at snakebite.org> wrote:
> 
>      On Wed, Nov 28, 2012 at 12:49:51PM -0800, Guido van Rossum wrote:
> > On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <trent at snakebite.org>
>      wrote:
> > > Right, so, I'm arguing that with my approach, because the background
> > > IO thread stuff is as optimal as it can be -- more IO events would
> > > be available per event loop iteration, and the latency between the
> > > event occurring versus when the event loop picks it up would be
> > > reduced.  The theory being that that will result in higher through-
> > > put and lower latency in practice.
> > >
> > > Also, from a previous e-mail, this:
> > >
> > >     with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f:
> > >         data = f.read()
> > >
> > > Or even just:
> > >
> > >     with aio.open('/dev/zero', 'rb') as f:
> > >         data = f.read(1024 * 1024 * 1024)
> > >
> > > Would basically complete as fast as it physically possible to read
> > > the bytes off the device.  If you've got 16+ cores, then you'll have
> > > 16 cores able to service IO interrupts in parallel.  So, the overall
> > > time to suck in a chunk of data will be vastly reduced.
> > >
> > > There's no other way to get this sort of performance without taking
> > > my approach.
> >
> > So there's something I fundamentally don't understand. Why do those
> > calls, made synchronously in today's CPython, not already run as fast
> > as you can get the bytes off the device? I assume it's just a transfer
> > from kernel memory to user memory. So what is the advantage of using
> > aio over
> >
> >   with open(<file>, 'rb') as f:
> >       data = f.read()
> 
>          Ah, right.  That's where the OVERLAPPED aspect comes into play.
>          (Other than Windows and AIX, I don't think any other OS provides
>           an overlapped IO facility?)
> 
>          The difference being, instead of having one thread writing to a 1GB
>          buffer, 4KB at a time, you have 16 threads writing to an overlapped
>          1GB buffer, 4KB at a time.
> 
>          (Assuming you have 16+ cores, and IO interrupts are coming in whilst
>           existing threads are still servicing previous completions.)
>              Trent.
> 
> Aha. So these are kernel threads?

    Sort-of-but-not-really.  In Vista onwards, you don't even work with
    threads directly, you just provide a callback, and Windows does all
    sorts of thread pool magic behind the scenes to allow overlapped IO.

> Is the bandwidth of the I/O channel really higher than one CPU can
> copy bytes across a user/kernel boundary?

    Ah, good question!  Sometimes yes, sometimes no.  Depends on the
    hardware.  If you're reading from a single IO source like a file
    on a disk, it would have to be one hell of a fast disk and one
    super slow CPU before that would happen.

    However, consider this:

        aio.readfile('1GB-raw.1', buf1)
        aio.readfile('1GB-raw.2', buf2)
        aio.readfile('1GB-raw.3', buf3)
        ...

        with aio.events() as events:
            for event in events:
                if event.type == EventType.FileReadComplete:
                    aio.writefile(event.fname + '.bak', event.buf)

                if event.type == EventType.FileWriteComplete:
                    log.debug('backed up ' + event.fname)

                if event.type == EventType.FileWriteFailed:
                    log.error('failed backed up ' + event.fname)

    aio.readfile() and writefile() return instantly.  With sufficient
    files being handled in parallel, the ability to have 16+ threads
    handle incoming requests instantly would be very useful.

    Second beneficial example would be if you're a socket server with
    65k active connections.  New interrupts will continually be pouring
    in whilst you're still in the middle of copying data from a previous
    interrupt.

    Using my approach, Windows would be free to use as many threads as
    you have cores to service all these incoming requests concurrently.

    Because the threads are so simple and don't touch any CPython stuff,
    their cache footprint will be very small, which is ideal.  All they
    are doing is copying bytes then a quick interlocked list push, so
    they'll run extremely quickly, often within their first quantum,
    which means they're ready to service another request that much
    quicker.

    An important detail probably worth noting at this point: Windows
    won't spawn more threads than there are cores*.  So, if you've got
    all 16 threads tied up contending for the GIL and messing around
    with PyList_Append() etc, you're going to kill your performance;
    it'll take a lot longer to process new requests because the threads
    take so much longer to do their work.

    And compare that with the ultimate performance killer of a single
    thread that periodically calls GetQueuedCompletionStatus when it's
    ready to process some IO, and you can see how strange it would seem
    to take that approach.  You're getting all the complexity of IOCP
    and overlapped IO with absolutely none of the benefits.


        Trent.


From guido at python.org  Wed Nov 28 23:52:56 2012
From: guido at python.org (Guido van Rossum)
Date: Wed, 28 Nov 2012 14:52:56 -0800
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128224034.GJ93849@snakebite.org>
References: <B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
	<20121128203238.GH93849@snakebite.org>
	<CAP7+vJ+UZoa2u0mKUODM=8g61ph71-V=mSyYS0Wfk3b_PdbqyQ@mail.gmail.com>
	<20121128210233.GI93849@snakebite.org>
	<CAP7+vJ+QVwcMEwGo7X7f3_7AAx2cPY2vAPE6d=7gOAjCy9eRuQ@mail.gmail.com>
	<20121128224034.GJ93849@snakebite.org>
Message-ID: <CAP7+vJKNdO1ftdjPYH+fKZHHoQFqafjhj1+vLg-urSgG49Ub6g@mail.gmail.com>

Well, okay, please go benchmark something and don't let my ignorance of
async I/O on Windows discourage me. (I suppose you've actually written code
like this in C or C++ so you know it all works?)

It still looks to me like you'll have a hard time keeping 16 cores busy if
the premise is that you're doing *some* processing in Python (as opposed to
the rather unlikely use case of backing up 1GB files), but it also looks to
me that, if your approach works, it could be sliced into (e.g.) a Twisted
reactor easily without changing Twisted's high-level interfaces in any way.

Do you have an implementation for the "interlocked list" that you mention?

On Wed, Nov 28, 2012 at 2:40 PM, Trent Nelson <trent at snakebite.org> wrote:

> On Wed, Nov 28, 2012 at 01:18:48PM -0800, Guido van Rossum wrote:
> >    On Wed, Nov 28, 2012 at 1:02 PM, Trent Nelson <trent at snakebite.org>
> wrote:
> >
> >      On Wed, Nov 28, 2012 at 12:49:51PM -0800, Guido van Rossum wrote:
> > > On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <trent at snakebite.org>
> >      wrote:
> > > > Right, so, I'm arguing that with my approach, because the background
> > > > IO thread stuff is as optimal as it can be -- more IO events would
> > > > be available per event loop iteration, and the latency between the
> > > > event occurring versus when the event loop picks it up would be
> > > > reduced.  The theory being that that will result in higher through-
> > > > put and lower latency in practice.
> > > >
> > > > Also, from a previous e-mail, this:
> > > >
> > > >     with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f:
> > > >         data = f.read()
> > > >
> > > > Or even just:
> > > >
> > > >     with aio.open('/dev/zero', 'rb') as f:
> > > >         data = f.read(1024 * 1024 * 1024)
> > > >
> > > > Would basically complete as fast as it physically possible to read
> > > > the bytes off the device.  If you've got 16+ cores, then you'll have
> > > > 16 cores able to service IO interrupts in parallel.  So, the overall
> > > > time to suck in a chunk of data will be vastly reduced.
> > > >
> > > > There's no other way to get this sort of performance without taking
> > > > my approach.
> > >
> > > So there's something I fundamentally don't understand. Why do those
> > > calls, made synchronously in today's CPython, not already run as fast
> > > as you can get the bytes off the device? I assume it's just a transfer
> > > from kernel memory to user memory. So what is the advantage of using
> > > aio over
> > >
> > >   with open(<file>, 'rb') as f:
> > >       data = f.read()
> >
> >          Ah, right.  That's where the OVERLAPPED aspect comes into play.
> >          (Other than Windows and AIX, I don't think any other OS provides
> >           an overlapped IO facility?)
> >
> >          The difference being, instead of having one thread writing to a
> 1GB
> >          buffer, 4KB at a time, you have 16 threads writing to an
> overlapped
> >          1GB buffer, 4KB at a time.
> >
> >          (Assuming you have 16+ cores, and IO interrupts are coming in
> whilst
> >           existing threads are still servicing previous completions.)
> >              Trent.
> >
> > Aha. So these are kernel threads?
>
>     Sort-of-but-not-really.  In Vista onwards, you don't even work with
>     threads directly, you just provide a callback, and Windows does all
>     sorts of thread pool magic behind the scenes to allow overlapped IO.
>
> > Is the bandwidth of the I/O channel really higher than one CPU can
> > copy bytes across a user/kernel boundary?
>
>     Ah, good question!  Sometimes yes, sometimes no.  Depends on the
>     hardware.  If you're reading from a single IO source like a file
>     on a disk, it would have to be one hell of a fast disk and one
>     super slow CPU before that would happen.
>
>     However, consider this:
>
>         aio.readfile('1GB-raw.1', buf1)
>         aio.readfile('1GB-raw.2', buf2)
>         aio.readfile('1GB-raw.3', buf3)
>         ...
>
>         with aio.events() as events:
>             for event in events:
>                 if event.type == EventType.FileReadComplete:
>                     aio.writefile(event.fname + '.bak', event.buf)
>
>                 if event.type == EventType.FileWriteComplete:
>                     log.debug('backed up ' + event.fname)
>
>                 if event.type == EventType.FileWriteFailed:
>                     log.error('failed backed up ' + event.fname)
>
>     aio.readfile() and writefile() return instantly.  With sufficient
>     files being handled in parallel, the ability to have 16+ threads
>     handle incoming requests instantly would be very useful.
>
>     Second beneficial example would be if you're a socket server with
>     65k active connections.  New interrupts will continually be pouring
>     in whilst you're still in the middle of copying data from a previous
>     interrupt.
>
>     Using my approach, Windows would be free to use as many threads as
>     you have cores to service all these incoming requests concurrently.
>
>     Because the threads are so simple and don't touch any CPython stuff,
>     their cache footprint will be very small, which is ideal.  All they
>     are doing is copying bytes then a quick interlocked list push, so
>     they'll run extremely quickly, often within their first quantum,
>     which means they're ready to service another request that much
>     quicker.
>
>     An important detail probably worth noting at this point: Windows
>     won't spawn more threads than there are cores*.  So, if you've got
>     all 16 threads tied up contending for the GIL and messing around
>     with PyList_Append() etc, you're going to kill your performance;
>     it'll take a lot longer to process new requests because the threads
>     take so much longer to do their work.
>
>     And compare that with the ultimate performance killer of a single
>     thread that periodically calls GetQueuedCompletionStatus when it's
>     ready to process some IO, and you can see how strange it would seem
>     to take that approach.  You're getting all the complexity of IOCP
>     and overlapped IO with absolutely none of the benefits.
>
>
>         Trent.
>



-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121128/1e7391cb/attachment.html>

From trent at snakebite.org  Wed Nov 28 23:54:43 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 17:54:43 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <50B69013.7030301@canterbury.ac.nz>
References: <CAP7+vJL7br7Otno_6txJYrmiviox_g12q7Q6uPYJCo4zJAZxVw@mail.gmail.com>
	<20121128001514.GQ91191@snakebite.org>
	<B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
	<20121128203238.GH93849@snakebite.org>
	<50B69013.7030301@canterbury.ac.nz>
Message-ID: <20121128225443.GK93849@snakebite.org>

On Wed, Nov 28, 2012 at 02:28:35PM -0800, Greg Ewing wrote:
> Trent Nelson wrote:
> >     I'm arguing that with my approach, because the background
> >     IO thread stuff is as optimal as it can be -- more IO events would
> >     be available per event loop iteration, and the latency between the
> >     event occurring versus when the event loop picks it up would be
> >     reduced.  The theory being that that will result in higher through-
> >     put and lower latency in practice.
> 
> But the data still as to wait around somewhere until the Python
> thread gets around to dealing with it. I don't see why it's
> better for it to sit around in the interlocked list than it is
> for the completion packets to just wait in the IOCP until the
> Python thread is ready.

    Hopefully the response I just sent to Guido makes things a little
    clearer?  I gave a few more examples of where I believe my approach
    is going to be much better than the single thread approach, which
    overlaps the concerns you raise here.

        Trent.


From shibturn at gmail.com  Wed Nov 28 23:52:49 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Wed, 28 Nov 2012 22:52:49 +0000
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128224034.GJ93849@snakebite.org>
References: <B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
	<20121128203238.GH93849@snakebite.org>
	<CAP7+vJ+UZoa2u0mKUODM=8g61ph71-V=mSyYS0Wfk3b_PdbqyQ@mail.gmail.com>
	<20121128210233.GI93849@snakebite.org>
	<CAP7+vJ+QVwcMEwGo7X7f3_7AAx2cPY2vAPE6d=7gOAjCy9eRuQ@mail.gmail.com>
	<20121128224034.GJ93849@snakebite.org>
Message-ID: <k964kc$fc8$1@ger.gmane.org>

On 28/11/2012 10:40pm, Trent Nelson wrote:
>      And compare that with the ultimate performance killer of a single
>      thread that periodically calls GetQueuedCompletionStatus when it's
>      ready to process some IO, and you can see how strange it would seem
>      to take that approach.  You're getting all the complexity of IOCP
>      and overlapped IO with absolutely none of the benefits.

BTW, GetQueuedCompletionStatusEx() lets you dequeue an array of messages 
instead of working one at a time.

-- 
Richard



From trent at snakebite.org  Thu Nov 29 00:04:51 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 18:04:51 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <k964kc$fc8$1@ger.gmane.org>
References: <D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
	<20121128203238.GH93849@snakebite.org>
	<CAP7+vJ+UZoa2u0mKUODM=8g61ph71-V=mSyYS0Wfk3b_PdbqyQ@mail.gmail.com>
	<20121128210233.GI93849@snakebite.org>
	<CAP7+vJ+QVwcMEwGo7X7f3_7AAx2cPY2vAPE6d=7gOAjCy9eRuQ@mail.gmail.com>
	<20121128224034.GJ93849@snakebite.org> <k964kc$fc8$1@ger.gmane.org>
Message-ID: <20121128230451.GL93849@snakebite.org>

On Wed, Nov 28, 2012 at 02:52:49PM -0800, Richard Oudkerk wrote:
> On 28/11/2012 10:40pm, Trent Nelson wrote:
> >      And compare that with the ultimate performance killer of a single
> >      thread that periodically calls GetQueuedCompletionStatus when it's
> >      ready to process some IO, and you can see how strange it would seem
> >      to take that approach.  You're getting all the complexity of IOCP
> >      and overlapped IO with absolutely none of the benefits.
> 
> BTW, GetQueuedCompletionStatusEx() lets you dequeue an array of messages 
> instead of working one at a time.

    Right.  The funny thing about that is it's only available in Vista
    onwards.  And if you're on Vista onwards, the new thread pool APIs
    available, which negate the need to call GetQueuedCompletionStatus*
    at all.

        Trent.


From trent at snakebite.org  Thu Nov 29 00:37:48 2012
From: trent at snakebite.org (Trent Nelson)
Date: Wed, 28 Nov 2012 18:37:48 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <CAP7+vJKNdO1ftdjPYH+fKZHHoQFqafjhj1+vLg-urSgG49Ub6g@mail.gmail.com>
References: <D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
	<20121128203238.GH93849@snakebite.org>
	<CAP7+vJ+UZoa2u0mKUODM=8g61ph71-V=mSyYS0Wfk3b_PdbqyQ@mail.gmail.com>
	<20121128210233.GI93849@snakebite.org>
	<CAP7+vJ+QVwcMEwGo7X7f3_7AAx2cPY2vAPE6d=7gOAjCy9eRuQ@mail.gmail.com>
	<20121128224034.GJ93849@snakebite.org>
	<CAP7+vJKNdO1ftdjPYH+fKZHHoQFqafjhj1+vLg-urSgG49Ub6g@mail.gmail.com>
Message-ID: <20121128233748.GM93849@snakebite.org>

On Wed, Nov 28, 2012 at 02:52:56PM -0800, Guido van Rossum wrote:
>    Well, okay, please go benchmark something and don't let my ignorance of
>    async I/O on Windows discourage me.

    Great!

> (I suppose you've actually written code like this in C or C++ so you
> know it all works?)

    Not recently :-)

    (I have some helpers though: http://trent.snakebite.net/books.jpg)

>    It still looks to me like you'll have a hard time keeping 16 cores busy if
>    the premise is that you're doing *some* processing in Python (as opposed
>    to the rather unlikely use case of backing up 1GB files),

    Yeah it won't take much for Python to become the single-core
    bottleneck once this is in place.  But we can explore ways to
    address this down the track.  (I still think multiprocessing
    might be a viable approach for map/reducing all the events
    over multiple cores.)

>    but it also looks to me that, if your approach works, it could be
>    sliced into (e.g.) a Twisted reactor easily without changing
>    Twisted's high-level interfaces in any way.

    It's almost ridiculous how well suited this approach would be for
    Twisted.  Which is funny, given their aversion to Windows ;-)
    I spent a few days reviewing their overall code/approach when I
    was looking at iocpreactor, and it definitely influenced the
    approach I'm proposing.

    I haven't mentioned all the other async Windows ops (especially all
    the new ones in 8) that this approach would also support.  Juicy
    stuff like async accept, getaddrinfo, etc.  That would slot straight
    in just as nicely:

        if event.type == EventType.GetAddrInfoComplete:
            ...
        elif event.type == EventType.AcceptComplete:
            ...

>    Do you have an implementation for the "interlocked list" that you mention?

    Do MSDN docs count? ;-)

    http://msdn.microsoft.com/en-us/library/windows/desktop/ms684121(v=vs.85).aspx

    It's a pretty simple API.  Shouldn't be too hard to replicate with
    gcc/clang primitives.

        Trent.


From solipsis at pitrou.net  Thu Nov 29 10:16:54 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Thu, 29 Nov 2012 10:16:54 +0100
Subject: [Python-ideas] An alternate approach to async IO
References: <B1B8327C-F7B4-4487-A2B0-3BD81B8041E8@molden.no>
	<CAP7+vJ+-4DW8A-32G+wK5KK02YE8Q5zs=oUmZFWAD+qe1odUoQ@mail.gmail.com>
	<D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<k951tt$sm2$1@ger.gmane.org> <50B622EF.1080500@molden.no>
	<50B63277.4@gmail.com> <50B65F1F.6040302@molden.no>
	<k95nlo$m7d$1@ger.gmane.org> <20121128192346.GE93849@snakebite.org>
	<k95qbj$elr$1@ger.gmane.org> <20121128201819.GG93849@snakebite.org>
Message-ID: <20121129101654.1d9a8c9c@pitrou.net>

Le Wed, 28 Nov 2012 15:18:19 -0500,
Trent Nelson <trent at snakebite.org> a ?crit :
> > 
> > That api is fairly similar to what is in the proactor branch of
> > tulip where you can write
> > 
> >      for event in proactor.poll(timeout):
> >          # process event
> > 
> > But why use a use a thread pool just to take items from one thread
> > safe (FIFO) queue and put them onto another thread safe (LIFO)
> > queue?
> 
>     I'm not sure how "thread pool" got all the focus suddenly.  That's
>     just an implementation detail.  The key thing I'm proposing is
> that we reduce the time involved in processing incoming IO requests.

At this point, I propose you start writing some code and come back with
benchmark numbers, before claiming that your proposal improves
performance at all. Further speculating about thread pools, async APIs
and whatnot sounds completely useless to me.

Regards

Antoine.




From trent at snakebite.org  Thu Nov 29 13:24:26 2012
From: trent at snakebite.org (Trent Nelson)
Date: Thu, 29 Nov 2012 07:24:26 -0500
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121129101654.1d9a8c9c@pitrou.net>
References: <D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<k951tt$sm2$1@ger.gmane.org> <50B622EF.1080500@molden.no>
	<50B63277.4@gmail.com> <50B65F1F.6040302@molden.no>
	<k95nlo$m7d$1@ger.gmane.org> <20121128192346.GE93849@snakebite.org>
	<k95qbj$elr$1@ger.gmane.org> <20121128201819.GG93849@snakebite.org>
	<20121129101654.1d9a8c9c@pitrou.net>
Message-ID: <20121129122425.GA97055@snakebite.org>

On Thu, Nov 29, 2012 at 01:16:54AM -0800, Antoine Pitrou wrote:
> Le Wed, 28 Nov 2012 15:18:19 -0500,
> Trent Nelson <trent at snakebite.org> a ?crit :
> > > 
> > > That api is fairly similar to what is in the proactor branch of
> > > tulip where you can write
> > > 
> > >      for event in proactor.poll(timeout):
> > >          # process event
> > > 
> > > But why use a use a thread pool just to take items from one thread
> > > safe (FIFO) queue and put them onto another thread safe (LIFO)
> > > queue?
> > 
> >     I'm not sure how "thread pool" got all the focus suddenly.  That's
> >     just an implementation detail.  The key thing I'm proposing is
> > that we reduce the time involved in processing incoming IO requests.
> 
> At this point, I propose you start writing some code and come back with
> benchmark numbers, before claiming that your proposal improves
> performance at all.

    That's the plan.  (Going to put aside a few hours each day to work
    on it.)  So, watch this space, I guess.

        Trent.


From sturla at molden.no  Thu Nov 29 17:02:32 2012
From: sturla at molden.no (Sturla Molden)
Date: Thu, 29 Nov 2012 17:02:32 +0100
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121129122425.GA97055@snakebite.org>
References: <D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<k951tt$sm2$1@ger.gmane.org> <50B622EF.1080500@molden.no>
	<50B63277.4@gmail.com> <50B65F1F.6040302@molden.no>
	<k95nlo$m7d$1@ger.gmane.org> <20121128192346.GE93849@snakebite.org>
	<k95qbj$elr$1@ger.gmane.org> <20121128201819.GG93849@snakebite.org>
	<20121129101654.1d9a8c9c@pitrou.net>
	<20121129122425.GA97055@snakebite.org>
Message-ID: <50B78718.6050400@molden.no>

On 29.11.2012 13:24, Trent Nelson wrote:

>> At this point, I propose you start writing some code and come back with
>> benchmark numbers, before claiming that your proposal improves
>> performance at all.
>
>      That's the plan.  (Going to put aside a few hours each day to work
>      on it.)  So, watch this space, I guess.

I'd also like to compare with a single-threaded design where the Python 
code calls GetQueuedCompletionStatusEx with a time-out. The idea here is 
an initial busy-wait with immediate time-out without releasing the GIL. 
Then after e.g. 2 ms we release the GIL and do a longer wait. That 
should also avoid excessive GIL shifting with "64k tasks". Personally I 
don't think a thread-pool will add to the scalability as long as the 
Python code just runs on a single core.

Sturla







From benhoyt at gmail.com  Thu Nov 29 21:10:13 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 30 Nov 2012 09:10:13 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CA+OGgf7-ygtbp7bs4L_ipP1qQsMf4pAPSHh2d=zqGXBDoi0ZzA@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<50AEE81D.5060707@fastmail.us>
	<CAL9jXCEYXBagxkmEt+k9QOVOzFJsRGOAc36tH6q_686y-oT+7Q@mail.gmail.com>
	<CA+OGgf7-ygtbp7bs4L_ipP1qQsMf4pAPSHh2d=zqGXBDoi0ZzA@mail.gmail.com>
Message-ID: <CAL9jXCH+2Fb66NfCwUwAUif9-mDxoTDokv1tVLMcb2q_xjp6qg@mail.gmail.com>

> So?  Consistency would be better, but that horse left before the barn
> was even built.  It is called filename "globbing" because even the
> wild inconsistency between regular expression implementations
> doesn't quite encompass most file globbing rules.
>
> I'll grant that better documentation would be nice.  But at this point,
> matching the platform expectation (at the cost of some additional
> cross-platform inconsistency) may be the lesser of evils.

So you're proposing that the "pattern" argument is passed directly to
FindFirstFile as the wildcard on Windows, and to Python's fnmatch on
Linux? That's not terrible, but there is the "my program deleted files
it shouldn't have" problem for edge cases.

Isn't fnmatch's behaviour quite well defined? I think if it's not too
difficult it'd be better to mimic that (emulating "bad patterns" using
fnmatch where necessary on Windows).

-Ben


From p.f.moore at gmail.com  Thu Nov 29 23:59:14 2012
From: p.f.moore at gmail.com (Paul Moore)
Date: Thu, 29 Nov 2012 22:59:14 +0000
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CAL9jXCH+2Fb66NfCwUwAUif9-mDxoTDokv1tVLMcb2q_xjp6qg@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<50AEE81D.5060707@fastmail.us>
	<CAL9jXCEYXBagxkmEt+k9QOVOzFJsRGOAc36tH6q_686y-oT+7Q@mail.gmail.com>
	<CA+OGgf7-ygtbp7bs4L_ipP1qQsMf4pAPSHh2d=zqGXBDoi0ZzA@mail.gmail.com>
	<CAL9jXCH+2Fb66NfCwUwAUif9-mDxoTDokv1tVLMcb2q_xjp6qg@mail.gmail.com>
Message-ID: <CACac1F9+691WrQXeh=J1RcS9-7PyjCoECfU=NWxZWMHsMn4gdA@mail.gmail.com>

On 29 November 2012 20:10, Ben Hoyt <benhoyt at gmail.com> wrote:

> So you're proposing that the "pattern" argument is passed directly to
> FindFirstFile as the wildcard on Windows, and to Python's fnmatch on
> Linux? That's not terrible, but there is the "my program deleted files
> it shouldn't have" problem for edge cases.
>
> Isn't fnmatch's behaviour quite well defined? I think if it's not too
> difficult it'd be better to mimic that (emulating "bad patterns" using
> fnmatch where necessary on Windows).
>

Personally, I would far prefer cross-platform consistency. The fnmatch
behaviour is both better defined and more useful than the Windows behaviour.

Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121129/384a6ef6/attachment.html>

From benhoyt at gmail.com  Fri Nov 30 00:01:25 2012
From: benhoyt at gmail.com (Ben Hoyt)
Date: Fri, 30 Nov 2012 12:01:25 +1300
Subject: [Python-ideas] BetterWalk,
	a better and faster os.walk() for Python
In-Reply-To: <CACac1F9+691WrQXeh=J1RcS9-7PyjCoECfU=NWxZWMHsMn4gdA@mail.gmail.com>
References: <CAL9jXCFJ_gh7C-StSupVh43hkA2LgZnLUrqJuZCuoA7=j4EEKQ@mail.gmail.com>
	<50AEE81D.5060707@fastmail.us>
	<CAL9jXCEYXBagxkmEt+k9QOVOzFJsRGOAc36tH6q_686y-oT+7Q@mail.gmail.com>
	<CA+OGgf7-ygtbp7bs4L_ipP1qQsMf4pAPSHh2d=zqGXBDoi0ZzA@mail.gmail.com>
	<CAL9jXCH+2Fb66NfCwUwAUif9-mDxoTDokv1tVLMcb2q_xjp6qg@mail.gmail.com>
	<CACac1F9+691WrQXeh=J1RcS9-7PyjCoECfU=NWxZWMHsMn4gdA@mail.gmail.com>
Message-ID: <CAL9jXCGVQvgiTZC+H0e59fq5f+t2ae0mZJ9z94NMPoqUk3+LQg@mail.gmail.com>

Agreed -- especially on the "cross-platform consistency" part, and the
well-defined fnmatch being a good place to start. -Ben


Isn't fnmatch's behaviour quite well defined? I think if it's not too
>> difficult it'd be better to mimic that (emulating "bad patterns" using
>> fnmatch where necessary on Windows).
>>
>
> Personally, I would far prefer cross-platform consistency. The fnmatch
> behaviour is both better defined and more useful than the Windows behaviour.
>
> Paul
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121130/78cb4506/attachment.html>

From shibturn at gmail.com  Fri Nov 30 16:20:28 2012
From: shibturn at gmail.com (Richard Oudkerk)
Date: Fri, 30 Nov 2012 15:20:28 +0000
Subject: [Python-ideas] An alternate approach to async IO
In-Reply-To: <20121128230451.GL93849@snakebite.org>
References: <D678D9ED-5FFD-4461-B72A-F65E0FC68030@molden.no>
	<CAP7+vJJUw+O7rg5+Hsg9um9JFtBa_0u+DsMBMQ7HpmMOW2oGrA@mail.gmail.com>
	<20121128200532.GF93849@snakebite.org>
	<CAP7+vJKpoOw5Gk8A59Bc58V09402tvprPqLHvrFcrpvRrn0w-g@mail.gmail.com>
	<20121128203238.GH93849@snakebite.org>
	<CAP7+vJ+UZoa2u0mKUODM=8g61ph71-V=mSyYS0Wfk3b_PdbqyQ@mail.gmail.com>
	<20121128210233.GI93849@snakebite.org>
	<CAP7+vJ+QVwcMEwGo7X7f3_7AAx2cPY2vAPE6d=7gOAjCy9eRuQ@mail.gmail.com>
	<20121128224034.GJ93849@snakebite.org> <k964kc$fc8$1@ger.gmane.org>
	<20121128230451.GL93849@snakebite.org>
Message-ID: <k9ais8$esm$1@ger.gmane.org>

On 28/11/2012 11:04pm, Trent Nelson wrote:
>      Right.  The funny thing about that is it's only available in Vista
>      onwards.  And if you're on Vista onwards, the new thread pool APIs
>      available, which negate the need to call GetQueuedCompletionStatus*
>      at all.
>
>          Trent.

Completion ports are just thread safe queues equivalent to queue.Queue 
in Python: PostQueuedCompletionStatus() corresponds to Queue.put() and 
GetQueuedCompletionStatus() corresponds to Queue.get().  They are not 
specific to IO and can be used for general message passing between threads.

I suspect that registering a file handle to a completion port is simply 
implemented (at least on XP) by using BindIoCompletionCallback() to 
register a callback which calls PostQueuedCompletionStatus() whenever a 
operation completes on that handle.

You seem to be proposing to do *exactly the same thing*, but using a 
different queue implementation (and using a Vista only equivalent of 
BindIoCompletionCallback()).

I also suspect that if you try to implement a thread safe queue which 
does not busy wait, then you will end up having to add synchronization 
which makes using a interlocked list unnecessary and slower than using a 
normal list.  Will it really superior to the operating systems builtin 
queue implementation?

-- 
Richard



From trent at snakebite.org  Fri Nov 30 17:14:22 2012
From: trent at snakebite.org (Trent Nelson)
Date: Fri, 30 Nov 2012 11:14:22 -0500
Subject: [Python-ideas] An async facade? (was Re: [Python-Dev] Socket
 timeout and completion based sockets)
In-Reply-To: <1D9BE0CD-5BF4-480D-8D40-5A409E40760D@twistedmatrix.com>
References: <EFE3877620384242A686D52278B7CCD329DEB553@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJWDEwAvKB1VVrz-RX6b9kO3TpxnwUGnoq57746MV1WFg@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DEBF4B@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJKpeEDV9ZpvJqN2_joOt4cKj9B1BJ5gYeZusnwcnu_VwQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DED074@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJKJrYbXFF8EjBSYdicVMKZv1J4A_rc4rdw6VkMkEg6Fg@mail.gmail.com>
	<1D9BE0CD-5BF4-480D-8D40-5A409E40760D@twistedmatrix.com>
Message-ID: <20121130161422.GB536@snakebite.org>

    [ It's tough coming up with unique subjects for these async
      discussions.  I've dropped python-dev and cc'd python-ideas
      instead as the stuff below follows on from the recent msgs. ]

    TL;DR version:

        Provide an async interface that is implicitly asynchronous;
        all calls return immediately, callbacks are used to handle
        success/error/timeout.

            class async:
                def accept():
                def read():
                def write():
                def getaddrinfo():
                def submit_work():

        How the asynchronicity (not a word, I know) is achieved is
        an implementation detail, and will differ for each platform.

        (Windows will be able to leverage all its async APIs to full
         extent, Linux et al can keep mimicking asynchronicity via
         the usual non-blocking + multiplexing (poll/kqueue etc),
         thread pools, etc.)


On Wed, Nov 28, 2012 at 11:15:07AM -0800, Glyph wrote:
>    On Nov 28, 2012, at 12:04 PM, Guido van Rossum <guido at python.org> wrote:
>    I would also like to bring up <https://github.com/lvh/async-pep> again.

    So, I spent yesterday working on the IOCP/async stuff.  The saw this
    PEP and the sample async/abstract.py.  That got me thinking: why don't
    we have a low-level async facade/API?  Something where all calls are
    implicitly asynchronous.

    On systems with extensive support for asynchronous 'stuff', primarily
    Windows and AIX/Solaris to a lesser extent, we'd be able to leverage
    the platform-provided async facilities to full effect.

    On other platforms, we'd fake it, just like we do now, with select,
    poll/epoll, kqueue and non-blocking sockets.

    Consider the following:

        class Callback:
            __slots__ = [
                'success',
                'failure',
                'timeout',
                'cancel',
            ]

        class AsyncEngine:
            def getaddrinfo(host, port, ..., cb):
                ...

            def getaddrinfo_then_connect(.., callbacks=(cb1, cb2))
                ...

            def accept(sock, cb):
                ...

            def accept_then_write(sock, buf, (cb1, cb2)):
                ...

            def accept_then_expect_line(sock, line, (cb1, cb2)):
                ...

            def accept_then_expect_multiline_regex(sock, regex, cb):
                ...

            def read_until(fd_or_sock, bytes, cb):
                ...

            def read_all(fd_or_sock, cb):
                return self.read_until(fd_or_sock, EOF, cb)

            def read_until_lineglob(fd_or_sock, cb):
                ...

            def read_until_regex(fd_or_sock, cb):
                ...

            def read_chunk(fd_or_sock, chunk_size, cb):
                ...

            def write(fd_or_sock, buf, cb):
                ...

            def write_then_expect_line(fd_or_sock, buf, (cb1, cb2)):
                ...

            def connect_then_expect_line(..):
                ...

            def connect_then_write_line(..):
                ...

            def submit_work(callable, cb):
                ...

            def run_once(..):
                """Run the event loop once."""

            def run(..):
                """Keep running the event loop until exit."""

    All methods always take at least one callback.  Chained methods can
    take multiple callbacks (i.e. accept_then_expect_line()).  You fill
    in the success, failure (both callables) and timeout (an int) slots.
    The engine will populate cb.cancel with a callable that you can call
    at any time to (try and) cancel the IO operation.  (How quickly that
    works depends on the underlying implementation.)

    I like this approach for two reasons: a) it allows platforms with
    great async support to work at their full potential, and b) it
    doesn't leak implementation details like non-blocking sockets, fds,
    multiplexing (poll/kqueue/select, IOCP, etc).  Those are all details
    that are taken care of by the underlying implementation.

    getaddrinfo is a good example here.  Guido, in tulip, you have this
    implemented as:

        def getaddrinfo(host, port, af=0, socktype=0, proto=0):
            infos = yield from scheduling.call_in_thread(
                socket.getaddrinfo,
                host, port, af,
                socktype, proto
            )

    That's very implementation specific.  It assumes the only way to
    perform an async getaddrinfo is by calling it from a separate
    thread.  On Windows, there's native support for async getaddrinfo(),
    which we wouldn't be able to leverage here.

    The biggest benefit is that no assumption is made as to how the
    asynchronicity is achieved.  Note that I didn't mention IOCP or
    kqueue or epoll once.  Those are all implementation details that
    the writer of an asynchronous Python app doesn't need to care about.

    Thoughts?

        Trent.


From Steve.Dower at microsoft.com  Fri Nov 30 18:57:04 2012
From: Steve.Dower at microsoft.com (Steve Dower)
Date: Fri, 30 Nov 2012 17:57:04 +0000
Subject: [Python-ideas] An async facade? (was Re: [Python-Dev] Socket
 timeout and completion based sockets)
In-Reply-To: <20121130161422.GB536@snakebite.org>
References: <EFE3877620384242A686D52278B7CCD329DEB553@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJWDEwAvKB1VVrz-RX6b9kO3TpxnwUGnoq57746MV1WFg@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DEBF4B@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJKpeEDV9ZpvJqN2_joOt4cKj9B1BJ5gYeZusnwcnu_VwQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DED074@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJKJrYbXFF8EjBSYdicVMKZv1J4A_rc4rdw6VkMkEg6Fg@mail.gmail.com>
	<1D9BE0CD-5BF4-480D-8D40-5A409E40760D@twistedmatrix.com>
	<20121130161422.GB536@snakebite.org>
Message-ID: <A7269F03D11BC245BD52843B195AC4F0019E46B6@TK5EX14MBXC293.redmond.corp.microsoft.com>

Trent Nelson wrote:
>    TL;DR version:
>
>        Provide an async interface that is implicitly asynchronous;
>        all calls return immediately, callbacks are used to handle
>        success/error/timeout.

This is the central idea of what I've been advocating - the use of Future. Rather than adding an extra parameter to the initial call, asynchronous methods return an object that can have callbacks added.

>    The biggest benefit is that no assumption is made as to how the
>    asynchronicity is achieved.  Note that I didn't mention IOCP or
>    kqueue or epoll once.  Those are all implementation details that
>    the writer of an asynchronous Python app doesn't need to care about.

I think this is why I've been largely ignored (except by Guido) - I don't even mention sockets, let alone the implementation details :). There are all sorts of operations that can be run asynchronously that do not involve sockets, though it seems that the driving force behind most of the effort is just to make really fast web servers.

My code contribution is at http://bitbucket.org/stevedower/wattle, though I have not updated it in a while and there are certainly aspects that I would change. You may find it interesting if you haven't seen it yet.

Cheers,
Steve

-----Original Message-----
From: Python-ideas [mailto:python-ideas-bounces+steve.dower=microsoft.com at python.org] On Behalf Of Trent Nelson
Sent: Friday, November 30, 2012 0814
To: Guido van Rossum
Cc: Glyph; python-ideas at python.org
Subject: [Python-ideas] An async facade? (was Re: [Python-Dev] Socket timeout and completion based sockets)

    [ It's tough coming up with unique subjects for these async
      discussions.  I've dropped python-dev and cc'd python-ideas
      instead as the stuff below follows on from the recent msgs. ]

    TL;DR version:

        Provide an async interface that is implicitly asynchronous;
        all calls return immediately, callbacks are used to handle
        success/error/timeout.

            class async:
                def accept():
                def read():
                def write():
                def getaddrinfo():
                def submit_work():

        How the asynchronicity (not a word, I know) is achieved is
        an implementation detail, and will differ for each platform.

        (Windows will be able to leverage all its async APIs to full
         extent, Linux et al can keep mimicking asynchronicity via
         the usual non-blocking + multiplexing (poll/kqueue etc),
         thread pools, etc.)


On Wed, Nov 28, 2012 at 11:15:07AM -0800, Glyph wrote:
>    On Nov 28, 2012, at 12:04 PM, Guido van Rossum <guido at python.org> wrote:
>    I would also like to bring up <https://github.com/lvh/async-pep> again.

    So, I spent yesterday working on the IOCP/async stuff.  The saw this
    PEP and the sample async/abstract.py.  That got me thinking: why don't
    we have a low-level async facade/API?  Something where all calls are
    implicitly asynchronous.

    On systems with extensive support for asynchronous 'stuff', primarily
    Windows and AIX/Solaris to a lesser extent, we'd be able to leverage
    the platform-provided async facilities to full effect.

    On other platforms, we'd fake it, just like we do now, with select,
    poll/epoll, kqueue and non-blocking sockets.

    Consider the following:

        class Callback:
            __slots__ = [
                'success',
                'failure',
                'timeout',
                'cancel',
            ]

        class AsyncEngine:
            def getaddrinfo(host, port, ..., cb):
                ...

            def getaddrinfo_then_connect(.., callbacks=(cb1, cb2))
                ...

            def accept(sock, cb):
                ...

            def accept_then_write(sock, buf, (cb1, cb2)):
                ...

            def accept_then_expect_line(sock, line, (cb1, cb2)):
                ...

            def accept_then_expect_multiline_regex(sock, regex, cb):
                ...

            def read_until(fd_or_sock, bytes, cb):
                ...

            def read_all(fd_or_sock, cb):
                return self.read_until(fd_or_sock, EOF, cb)

            def read_until_lineglob(fd_or_sock, cb):
                ...

            def read_until_regex(fd_or_sock, cb):
                ...

            def read_chunk(fd_or_sock, chunk_size, cb):
                ...

            def write(fd_or_sock, buf, cb):
                ...

            def write_then_expect_line(fd_or_sock, buf, (cb1, cb2)):
                ...

            def connect_then_expect_line(..):
                ...

            def connect_then_write_line(..):
                ...

            def submit_work(callable, cb):
                ...

            def run_once(..):
                """Run the event loop once."""

            def run(..):
                """Keep running the event loop until exit."""

    All methods always take at least one callback.  Chained methods can
    take multiple callbacks (i.e. accept_then_expect_line()).  You fill
    in the success, failure (both callables) and timeout (an int) slots.
    The engine will populate cb.cancel with a callable that you can call
    at any time to (try and) cancel the IO operation.  (How quickly that
    works depends on the underlying implementation.)

    I like this approach for two reasons: a) it allows platforms with
    great async support to work at their full potential, and b) it
    doesn't leak implementation details like non-blocking sockets, fds,
    multiplexing (poll/kqueue/select, IOCP, etc).  Those are all details
    that are taken care of by the underlying implementation.

    getaddrinfo is a good example here.  Guido, in tulip, you have this
    implemented as:

        def getaddrinfo(host, port, af=0, socktype=0, proto=0):
            infos = yield from scheduling.call_in_thread(
                socket.getaddrinfo,
                host, port, af,
                socktype, proto
            )

    That's very implementation specific.  It assumes the only way to
    perform an async getaddrinfo is by calling it from a separate
    thread.  On Windows, there's native support for async getaddrinfo(),
    which we wouldn't be able to leverage here.

    The biggest benefit is that no assumption is made as to how the
    asynchronicity is achieved.  Note that I didn't mention IOCP or
    kqueue or epoll once.  Those are all implementation details that
    the writer of an asynchronous Python app doesn't need to care about.

    Thoughts?

        Trent.
_______________________________________________
Python-ideas mailing list
Python-ideas at python.org
http://mail.python.org/mailman/listinfo/python-ideas


From guido at python.org  Fri Nov 30 20:04:09 2012
From: guido at python.org (Guido van Rossum)
Date: Fri, 30 Nov 2012 11:04:09 -0800
Subject: [Python-ideas] An async facade? (was Re: [Python-Dev] Socket
 timeout and completion based sockets)
In-Reply-To: <A7269F03D11BC245BD52843B195AC4F0019E46B6@TK5EX14MBXC293.redmond.corp.microsoft.com>
References: <EFE3877620384242A686D52278B7CCD329DEB553@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJWDEwAvKB1VVrz-RX6b9kO3TpxnwUGnoq57746MV1WFg@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DEBF4B@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJKpeEDV9ZpvJqN2_joOt4cKj9B1BJ5gYeZusnwcnu_VwQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DED074@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJKJrYbXFF8EjBSYdicVMKZv1J4A_rc4rdw6VkMkEg6Fg@mail.gmail.com>
	<1D9BE0CD-5BF4-480D-8D40-5A409E40760D@twistedmatrix.com>
	<20121130161422.GB536@snakebite.org>
	<A7269F03D11BC245BD52843B195AC4F0019E46B6@TK5EX14MBXC293.redmond.corp.microsoft.com>
Message-ID: <CAP7+vJ+np39bRs-F3YCcsyvZVSzDRdC7A2XvMHTDqHd3emW8mw@mail.gmail.com>

Futures or callbacks, that's the question...

Richard and I have even been considering APIs like this:

res = obj.some_call(<args>)
if isinstance(res, Future):
    res = yield res

or

res = obj.some_call(<args>)
if res is None:
    res = yield <magic>

where <magic> is some call on the scheduler/eventloop/proactor that
pulls the future out of a hat.

The idea of the first version is simply to avoid the Future when the
result happens to be immediately ready (e.g. when calling readline()
on some buffering stream, most of the time the next line is already in
the buffer); the point of the second version is that "res is None" is
way faster than "isinstance(res, Future)" -- however the magic is a
little awkward.

The debate is still open.

--Guido

On Fri, Nov 30, 2012 at 9:57 AM, Steve Dower <Steve.Dower at microsoft.com> wrote:
> Trent Nelson wrote:
>>    TL;DR version:
>>
>>        Provide an async interface that is implicitly asynchronous;
>>        all calls return immediately, callbacks are used to handle
>>        success/error/timeout.
>
> This is the central idea of what I've been advocating - the use of Future. Rather than adding an extra parameter to the initial call, asynchronous methods return an object that can have callbacks added.
>
>>    The biggest benefit is that no assumption is made as to how the
>>    asynchronicity is achieved.  Note that I didn't mention IOCP or
>>    kqueue or epoll once.  Those are all implementation details that
>>    the writer of an asynchronous Python app doesn't need to care about.
>
> I think this is why I've been largely ignored (except by Guido) - I don't even mention sockets, let alone the implementation details :). There are all sorts of operations that can be run asynchronously that do not involve sockets, though it seems that the driving force behind most of the effort is just to make really fast web servers.
>
> My code contribution is at http://bitbucket.org/stevedower/wattle, though I have not updated it in a while and there are certainly aspects that I would change. You may find it interesting if you haven't seen it yet.
>
> Cheers,
> Steve
>
> -----Original Message-----
> From: Python-ideas [mailto:python-ideas-bounces+steve.dower=microsoft.com at python.org] On Behalf Of Trent Nelson
> Sent: Friday, November 30, 2012 0814
> To: Guido van Rossum
> Cc: Glyph; python-ideas at python.org
> Subject: [Python-ideas] An async facade? (was Re: [Python-Dev] Socket timeout and completion based sockets)
>
>     [ It's tough coming up with unique subjects for these async
>       discussions.  I've dropped python-dev and cc'd python-ideas
>       instead as the stuff below follows on from the recent msgs. ]
>
>     TL;DR version:
>
>         Provide an async interface that is implicitly asynchronous;
>         all calls return immediately, callbacks are used to handle
>         success/error/timeout.
>
>             class async:
>                 def accept():
>                 def read():
>                 def write():
>                 def getaddrinfo():
>                 def submit_work():
>
>         How the asynchronicity (not a word, I know) is achieved is
>         an implementation detail, and will differ for each platform.
>
>         (Windows will be able to leverage all its async APIs to full
>          extent, Linux et al can keep mimicking asynchronicity via
>          the usual non-blocking + multiplexing (poll/kqueue etc),
>          thread pools, etc.)
>
>
> On Wed, Nov 28, 2012 at 11:15:07AM -0800, Glyph wrote:
>>    On Nov 28, 2012, at 12:04 PM, Guido van Rossum <guido at python.org> wrote:
>>    I would also like to bring up <https://github.com/lvh/async-pep> again.
>
>     So, I spent yesterday working on the IOCP/async stuff.  The saw this
>     PEP and the sample async/abstract.py.  That got me thinking: why don't
>     we have a low-level async facade/API?  Something where all calls are
>     implicitly asynchronous.
>
>     On systems with extensive support for asynchronous 'stuff', primarily
>     Windows and AIX/Solaris to a lesser extent, we'd be able to leverage
>     the platform-provided async facilities to full effect.
>
>     On other platforms, we'd fake it, just like we do now, with select,
>     poll/epoll, kqueue and non-blocking sockets.
>
>     Consider the following:
>
>         class Callback:
>             __slots__ = [
>                 'success',
>                 'failure',
>                 'timeout',
>                 'cancel',
>             ]
>
>         class AsyncEngine:
>             def getaddrinfo(host, port, ..., cb):
>                 ...
>
>             def getaddrinfo_then_connect(.., callbacks=(cb1, cb2))
>                 ...
>
>             def accept(sock, cb):
>                 ...
>
>             def accept_then_write(sock, buf, (cb1, cb2)):
>                 ...
>
>             def accept_then_expect_line(sock, line, (cb1, cb2)):
>                 ...
>
>             def accept_then_expect_multiline_regex(sock, regex, cb):
>                 ...
>
>             def read_until(fd_or_sock, bytes, cb):
>                 ...
>
>             def read_all(fd_or_sock, cb):
>                 return self.read_until(fd_or_sock, EOF, cb)
>
>             def read_until_lineglob(fd_or_sock, cb):
>                 ...
>
>             def read_until_regex(fd_or_sock, cb):
>                 ...
>
>             def read_chunk(fd_or_sock, chunk_size, cb):
>                 ...
>
>             def write(fd_or_sock, buf, cb):
>                 ...
>
>             def write_then_expect_line(fd_or_sock, buf, (cb1, cb2)):
>                 ...
>
>             def connect_then_expect_line(..):
>                 ...
>
>             def connect_then_write_line(..):
>                 ...
>
>             def submit_work(callable, cb):
>                 ...
>
>             def run_once(..):
>                 """Run the event loop once."""
>
>             def run(..):
>                 """Keep running the event loop until exit."""
>
>     All methods always take at least one callback.  Chained methods can
>     take multiple callbacks (i.e. accept_then_expect_line()).  You fill
>     in the success, failure (both callables) and timeout (an int) slots.
>     The engine will populate cb.cancel with a callable that you can call
>     at any time to (try and) cancel the IO operation.  (How quickly that
>     works depends on the underlying implementation.)
>
>     I like this approach for two reasons: a) it allows platforms with
>     great async support to work at their full potential, and b) it
>     doesn't leak implementation details like non-blocking sockets, fds,
>     multiplexing (poll/kqueue/select, IOCP, etc).  Those are all details
>     that are taken care of by the underlying implementation.
>
>     getaddrinfo is a good example here.  Guido, in tulip, you have this
>     implemented as:
>
>         def getaddrinfo(host, port, af=0, socktype=0, proto=0):
>             infos = yield from scheduling.call_in_thread(
>                 socket.getaddrinfo,
>                 host, port, af,
>                 socktype, proto
>             )
>
>     That's very implementation specific.  It assumes the only way to
>     perform an async getaddrinfo is by calling it from a separate
>     thread.  On Windows, there's native support for async getaddrinfo(),
>     which we wouldn't be able to leverage here.
>
>     The biggest benefit is that no assumption is made as to how the
>     asynchronicity is achieved.  Note that I didn't mention IOCP or
>     kqueue or epoll once.  Those are all implementation details that
>     the writer of an asynchronous Python app doesn't need to care about.
>
>     Thoughts?
>
>         Trent.
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas



-- 
--Guido van Rossum (python.org/~guido)


From Steve.Dower at microsoft.com  Fri Nov 30 20:18:34 2012
From: Steve.Dower at microsoft.com (Steve Dower)
Date: Fri, 30 Nov 2012 19:18:34 +0000
Subject: [Python-ideas] An async facade? (was Re: [Python-Dev] Socket
 timeout and completion based sockets)
In-Reply-To: <CAP7+vJ+np39bRs-F3YCcsyvZVSzDRdC7A2XvMHTDqHd3emW8mw@mail.gmail.com>
References: <EFE3877620384242A686D52278B7CCD329DEB553@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJWDEwAvKB1VVrz-RX6b9kO3TpxnwUGnoq57746MV1WFg@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DEBF4B@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJKpeEDV9ZpvJqN2_joOt4cKj9B1BJ5gYeZusnwcnu_VwQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DED074@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJKJrYbXFF8EjBSYdicVMKZv1J4A_rc4rdw6VkMkEg6Fg@mail.gmail.com>
	<1D9BE0CD-5BF4-480D-8D40-5A409E40760D@twistedmatrix.com>
	<20121130161422.GB536@snakebite.org>
	<A7269F03D11BC245BD52843B195AC4F0019E46B6@TK5EX14MBXC293.redmond.corp.microsoft.com>
	<CAP7+vJ+np39bRs-F3YCcsyvZVSzDRdC7A2XvMHTDqHd3emW8mw@mail.gmail.com>
Message-ID: <A7269F03D11BC245BD52843B195AC4F0019E4758@TK5EX14MBXC293.redmond.corp.microsoft.com>

Guido van Rossum wrote:
> Futures or callbacks, that's the question...

I know the C++ standards committee is looking at the same thing right now, and they're probably going to provide both: futures for those who prefer them (which is basically how the code looks) and callbacks for when every cycle is critical or if the developer prefers them. C++ has the advantage that futures can often be optimized out, so implementing a Future-based wrapper around a callback-based function is very cheap, but the two-level API will probably happen.

> Richard and I have even been considering APIs like this:
>
> res = obj.some_call(<args>)
> if isinstance(res, Future):
>     res = yield res
>
> or
>
> res = obj.some_call(<args>)
> if res is None:
>     res = yield <magic>
>
> where <magic> is some call on the scheduler/eventloop/proactor that pulls the future out of a hat.
>
> The idea of the first version is simply to avoid the Future when the result happens to be immediately
> ready (e.g. when calling readline() on some buffering stream, most of the time the next line is
> already in the buffer); the point of the second version is that "res is None" is way faster than
> "isinstance(res, Future)" -- however the magic is a little awkward.
>
> The debate is still open.

How about:

value, future = obj.some_call(...)
if value is None:
    value = yield future

Or:

future = obj.some_call(...)
if future.done():
    value = future.result()
else:
    value = yield future

I like the second one because it doesn't require the methods to do anything special to support always yielding vs. only yielding futures that aren't ready - the caller gets to decide how performant they want to be. (I would also like to see Future['s base class] be implemented in C and possibly even preallocated to reduce overhead. 'done()' could also be an attribute rather than a method, though that would break the existing Future class.)

Cheers,
Steve



From solipsis at pitrou.net  Fri Nov 30 20:27:10 2012
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Fri, 30 Nov 2012 20:27:10 +0100
Subject: [Python-ideas] An async facade? (was Re: [Python-Dev] Socket
 timeout and completion based sockets)
References: <EFE3877620384242A686D52278B7CCD329DEB553@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJWDEwAvKB1VVrz-RX6b9kO3TpxnwUGnoq57746MV1WFg@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DEBF4B@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJKpeEDV9ZpvJqN2_joOt4cKj9B1BJ5gYeZusnwcnu_VwQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DED074@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJKJrYbXFF8EjBSYdicVMKZv1J4A_rc4rdw6VkMkEg6Fg@mail.gmail.com>
	<1D9BE0CD-5BF4-480D-8D40-5A409E40760D@twistedmatrix.com>
	<20121130161422.GB536@snakebite.org>
	<A7269F03D11BC245BD52843B195AC4F0019E46B6@TK5EX14MBXC293.redmond.corp.microsoft.com>
	<CAP7+vJ+np39bRs-F3YCcsyvZVSzDRdC7A2XvMHTDqHd3emW8mw@mail.gmail.com>
Message-ID: <20121130202710.635e2244@pitrou.net>

On Fri, 30 Nov 2012 11:04:09 -0800
Guido van Rossum <guido at python.org> wrote:
> Futures or callbacks, that's the question...
> 
> Richard and I have even been considering APIs like this:
> 
> res = obj.some_call(<args>)
> if isinstance(res, Future):
>     res = yield res
> 
> or
> 
> res = obj.some_call(<args>)
> if res is None:
>     res = yield <magic>
> 
> where <magic> is some call on the scheduler/eventloop/proactor that
> pulls the future out of a hat.
> 
> The idea of the first version is simply to avoid the Future when the
> result happens to be immediately ready (e.g. when calling readline()
> on some buffering stream, most of the time the next line is already in
> the buffer); the point of the second version is that "res is None" is
> way faster than "isinstance(res, Future)" -- however the magic is a
> little awkward.

This premature optimization looks really ugly to me. I'm strongly -1
on both idioms.

Regards

Antoine.




From guido at python.org  Fri Nov 30 20:29:15 2012
From: guido at python.org (Guido van Rossum)
Date: Fri, 30 Nov 2012 11:29:15 -0800
Subject: [Python-ideas] An async facade? (was Re: [Python-Dev] Socket
 timeout and completion based sockets)
In-Reply-To: <A7269F03D11BC245BD52843B195AC4F0019E4758@TK5EX14MBXC293.redmond.corp.microsoft.com>
References: <EFE3877620384242A686D52278B7CCD329DEB553@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJWDEwAvKB1VVrz-RX6b9kO3TpxnwUGnoq57746MV1WFg@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DEBF4B@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJKpeEDV9ZpvJqN2_joOt4cKj9B1BJ5gYeZusnwcnu_VwQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DED074@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJKJrYbXFF8EjBSYdicVMKZv1J4A_rc4rdw6VkMkEg6Fg@mail.gmail.com>
	<1D9BE0CD-5BF4-480D-8D40-5A409E40760D@twistedmatrix.com>
	<20121130161422.GB536@snakebite.org>
	<A7269F03D11BC245BD52843B195AC4F0019E46B6@TK5EX14MBXC293.redmond.corp.microsoft.com>
	<CAP7+vJ+np39bRs-F3YCcsyvZVSzDRdC7A2XvMHTDqHd3emW8mw@mail.gmail.com>
	<A7269F03D11BC245BD52843B195AC4F0019E4758@TK5EX14MBXC293.redmond.corp.microsoft.com>
Message-ID: <CAP7+vJ+qr414pz_FqEh2DPKncdvuT9XvdgpXA7kkYhMrkiCiAQ@mail.gmail.com>

On Fri, Nov 30, 2012 at 11:18 AM, Steve Dower <Steve.Dower at microsoft.com> wrote:
> Guido van Rossum wrote:
>> Futures or callbacks, that's the question...
>
> I know the C++ standards committee is looking at the same thing right now, and they're probably going to provide both: futures for those who prefer them (which is basically how the code looks) and callbacks for when every cycle is critical or if the developer prefers them. C++ has the advantage that futures can often be optimized out, so implementing a Future-based wrapper around a callback-based function is very cheap, but the two-level API will probably happen.

Well, for Python 3 we will definitely have two layers already:
callbacks and yield-from-based-coroutines. The question is whether
there's room for Futures in between (I like layers of abstraction, but
I don't like having too many layers).

>> Richard and I have even been considering APIs like this:
>>
>> res = obj.some_call(<args>)
>> if isinstance(res, Future):
>>     res = yield res
>>
>> or
>>
>> res = obj.some_call(<args>)
>> if res is None:
>>     res = yield <magic>
>>
>> where <magic> is some call on the scheduler/eventloop/proactor that pulls the future out of a hat.
>>
>> The idea of the first version is simply to avoid the Future when the result happens to be immediately
>> ready (e.g. when calling readline() on some buffering stream, most of the time the next line is
>> already in the buffer); the point of the second version is that "res is None" is way faster than
>> "isinstance(res, Future)" -- however the magic is a little awkward.
>>
>> The debate is still open.
>
> How about:
>
> value, future = obj.some_call(...)
> if value is None:
>     value = yield future

Also considered; I don't really like having to allocate a tuple here
(which is impossible to optimize out completely, even though its
allocation may use a fast free list).

> Or:
>
> future = obj.some_call(...)
> if future.done():
>     value = future.result()
> else:
>     value = yield future

That seems the most expensive option of all because of the call to
done() that's always there.

> I like the second one because it doesn't require the methods to do anything special to support always yielding vs. only yielding futures that aren't ready - the caller gets to decide how performant they want to be. (I would also like to see Future['s base class] be implemented in C and possibly even preallocated to reduce overhead. 'done()' could also be an attribute rather than a method, though that would break the existing Future class.)

Note that in all cases the places where this idiom is *used* should be
few and far between -- it should only be needed in the "glue" between
the callback-based world and the coroutine-based world. You'd only be
writing new calls like this if you're writing new glue, which should
only be necessary if you are writing wrappers for (probably
platform-specific) new primitive operations supported by the lowest
level event loop.

This is why I am looking for the pattern that executes fastest rather
than the pattern that is easiest to write for end users -- the latter
would be to always return a Future and let the user write

  res = yield obj.some_call(<args>)

-- 
--Guido van Rossum (python.org/~guido)


From guido at python.org  Fri Nov 30 20:30:34 2012
From: guido at python.org (Guido van Rossum)
Date: Fri, 30 Nov 2012 11:30:34 -0800
Subject: [Python-ideas] An async facade? (was Re: [Python-Dev] Socket
 timeout and completion based sockets)
In-Reply-To: <20121130202710.635e2244@pitrou.net>
References: <EFE3877620384242A686D52278B7CCD329DEB553@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJWDEwAvKB1VVrz-RX6b9kO3TpxnwUGnoq57746MV1WFg@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DEBF4B@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJKpeEDV9ZpvJqN2_joOt4cKj9B1BJ5gYeZusnwcnu_VwQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DED074@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJKJrYbXFF8EjBSYdicVMKZv1J4A_rc4rdw6VkMkEg6Fg@mail.gmail.com>
	<1D9BE0CD-5BF4-480D-8D40-5A409E40760D@twistedmatrix.com>
	<20121130161422.GB536@snakebite.org>
	<A7269F03D11BC245BD52843B195AC4F0019E46B6@TK5EX14MBXC293.redmond.corp.microsoft.com>
	<CAP7+vJ+np39bRs-F3YCcsyvZVSzDRdC7A2XvMHTDqHd3emW8mw@mail.gmail.com>
	<20121130202710.635e2244@pitrou.net>
Message-ID: <CAP7+vJ+upW_T9zHJG18hs5RPPfzrjGQsN0hQneCKuxC_GC-udw@mail.gmail.com>

On Fri, Nov 30, 2012 at 11:27 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Fri, 30 Nov 2012 11:04:09 -0800
> Guido van Rossum <guido at python.org> wrote:
>> Futures or callbacks, that's the question...
>>
>> Richard and I have even been considering APIs like this:
>>
>> res = obj.some_call(<args>)
>> if isinstance(res, Future):
>>     res = yield res
>>
>> or
>>
>> res = obj.some_call(<args>)
>> if res is None:
>>     res = yield <magic>
>>
>> where <magic> is some call on the scheduler/eventloop/proactor that
>> pulls the future out of a hat.
>>
>> The idea of the first version is simply to avoid the Future when the
>> result happens to be immediately ready (e.g. when calling readline()
>> on some buffering stream, most of the time the next line is already in
>> the buffer); the point of the second version is that "res is None" is
>> way faster than "isinstance(res, Future)" -- however the magic is a
>> little awkward.
>
> This premature optimization looks really ugly to me. I'm strongly -1
> on both idioms.

Read my explanation in my response to Steve.

-- 
--Guido van Rossum (python.org/~guido)


From greg.ewing at canterbury.ac.nz  Fri Nov 30 23:37:42 2012
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sat, 01 Dec 2012 11:37:42 +1300
Subject: [Python-ideas] An async facade? (was Re: [Python-Dev] Socket
 timeout and completion based sockets)
In-Reply-To: <CAP7+vJ+np39bRs-F3YCcsyvZVSzDRdC7A2XvMHTDqHd3emW8mw@mail.gmail.com>
References: <EFE3877620384242A686D52278B7CCD329DEB553@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJWDEwAvKB1VVrz-RX6b9kO3TpxnwUGnoq57746MV1WFg@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DEBF4B@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJKpeEDV9ZpvJqN2_joOt4cKj9B1BJ5gYeZusnwcnu_VwQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DED074@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJKJrYbXFF8EjBSYdicVMKZv1J4A_rc4rdw6VkMkEg6Fg@mail.gmail.com>
	<1D9BE0CD-5BF4-480D-8D40-5A409E40760D@twistedmatrix.com>
	<20121130161422.GB536@snakebite.org>
	<A7269F03D11BC245BD52843B195AC4F0019E46B6@TK5EX14MBXC293.redmond.corp.microsoft.com>
	<CAP7+vJ+np39bRs-F3YCcsyvZVSzDRdC7A2XvMHTDqHd3emW8mw@mail.gmail.com>
Message-ID: <50B93536.30104@canterbury.ac.nz>

Guido van Rossum wrote:
> Futures or callbacks, that's the question...
> 
> Richard and I have even been considering APIs like this:
> 
> res = obj.some_call(<args>)
> if isinstance(res, Future):
>     res = yield res

I thought you had decided against the idea of yielding
futures?

-- 
Greg


From guido at python.org  Fri Nov 30 23:55:33 2012
From: guido at python.org (Guido van Rossum)
Date: Fri, 30 Nov 2012 14:55:33 -0800
Subject: [Python-ideas] An async facade? (was Re: [Python-Dev] Socket
 timeout and completion based sockets)
In-Reply-To: <50B93536.30104@canterbury.ac.nz>
References: <EFE3877620384242A686D52278B7CCD329DEB553@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJWDEwAvKB1VVrz-RX6b9kO3TpxnwUGnoq57746MV1WFg@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DEBF4B@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJKpeEDV9ZpvJqN2_joOt4cKj9B1BJ5gYeZusnwcnu_VwQ@mail.gmail.com>
	<EFE3877620384242A686D52278B7CCD329DED074@RKV-IT-EXCH103.ccp.ad.local>
	<CAP7+vJJKJrYbXFF8EjBSYdicVMKZv1J4A_rc4rdw6VkMkEg6Fg@mail.gmail.com>
	<1D9BE0CD-5BF4-480D-8D40-5A409E40760D@twistedmatrix.com>
	<20121130161422.GB536@snakebite.org>
	<A7269F03D11BC245BD52843B195AC4F0019E46B6@TK5EX14MBXC293.redmond.corp.microsoft.com>
	<CAP7+vJ+np39bRs-F3YCcsyvZVSzDRdC7A2XvMHTDqHd3emW8mw@mail.gmail.com>
	<50B93536.30104@canterbury.ac.nz>
Message-ID: <CAP7+vJL=7jNuTJ8VuT1LDw9dKo8TdxPTH+6cs7-gJA6GmnGROQ@mail.gmail.com>

On Fri, Nov 30, 2012 at 2:37 PM, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Guido van Rossum wrote:
>>
>> Futures or callbacks, that's the question...
>>
>> Richard and I have even been considering APIs like this:
>>
>> res = obj.some_call(<args>)
>> if isinstance(res, Future):
>>     res = yield res
>
>
> I thought you had decided against the idea of yielding
> futures?

As a user-facing API style, yes. But this is meant for an internal API
-- the equivalent of your bare 'yield'. If you want to, I can consider
another style as well


res = obj.some_call(<args>)
if isinstance(res, Future):
    res.<magic_call>()
    yield

But I don't see a fundamental advantage to this.

-- 
--Guido van Rossum (python.org/~guido)